Developing big data solutions on Microsoft Azure HDInsight
Microsoft Azure HDInsight is a big data solution based on the open-source Apache Hadoop framework, and is an integral part of the Microsoft Business Intelligence (BI) and Analytics product range. This guide explores the use of HDInsight in a range of use cases and scenarios such as iterative exploration, as a data warehouse, for ETL processes, and integration into existing BI systems. It also includes guidance on understanding the concepts of big data, planning and designing big data solutions, and implementing these solutions.
The guide is divided into three sections:
- Section 1: “Understanding Microsoft big data solutions,” provides an overview of the principles and benefits of big data solutions, and the differences between these and the more traditional database systems. It includes general guidance for planning and designing big data solutions by exploring in more depth topics such as defining the goals, locating data sources, and more. It will help you decide where, when, and how you might benefit from adopting a big data solution. This section also discusses Azure HDInsight, and its place within the comprehensive Microsoft data platform.
- Section 2, “Designing big data solutions using HDInsight,” contains guidance for designing solutions to meet the typical batch processing use cases inherent in big data processing. Even if you choose not to use HDInsight as the platform for your own solution, you will find the information in this section useful.
- Section 3, “Implementing big data solutions using HDInsight,” explores a range of topics such as the options and techniques for loading data into an HDInsight cluster, the tools you can use in HDInsight to process data in a cluster, and the ways you can transfer the results from HDInsight into analytical and visualization tools to generate reports and charts, or export the results into existing data stores such as databases, data warehouses, and enterprise BI systems. This section also contains useful information to help you automate all or part of the process, and to manage and monitor your solutions.
The guide concentrates on the Azure HDInsight service, but much of the information is equally applicable to big data solutions built on any platform, and with any Hadoop-based framework.
What this guide is, and what it is not
Big data is not a new concept. Distributed processing has been the mainstay of supercomputers and high performance data storage clusters for a long time. What’s new is standardization around a set of open source technologies that make distributed processing systems easier to build—combined with the growing need to store, manage, and get information from the ever increasing volume of data that modern society generates. However, as with most new technologies, big data is surrounded by a great deal of hype that often gives rise to unrealistic expectations.
Like all of the releases from Microsoft patterns & practices, this guide avoids the hype by concentrating on the “whys” and the “hows.” In terms of the “whys,” the guide explains the concepts of big data solutions based on Hadoop, gives a focused view on what you can expect to achieve with such a solution, and explores the capabilities of these types of solutions in detail so that you can decide for yourself if it is an appropriate technology for your own scenarios. The guide does this by explaining what a Hadoop-based big data solution can do, how it does it, and the types of problems it is designed to solve.
In terms of the “hows” the guide continues with a detailed examination of the typical use cases and models for Hadoop-based big data batch processing solutions, and the ways that these models integrate with the wider data management and business information environment, so that you can quickly understand how you might apply a big data solution in your own environment.
This guide is not a technical reference manual for big data. It concentrates on the complete end-to-end process of designing and building useful solutions on Hadoop-based systems for batch processing, with the focus mainly on Azure HDInsight. It doesn’t attempt to cover every nuance of implementing a big data solution, or writing complicated code, or pushing the boundaries of what the technology is designed to do. Neither does it cover all of the myriad details of the underlying mechanism—there is a multitude of books, websites, and blogs that explain all these things. For example, you won’t find in this guide a list of all of the configuration settings, the way that memory is allocated, the syntax of every type of query, and the many patterns for writing map and reduce components.
What you will see is guidance that will help you understand and get started using HDInsight to build realistic solutions capable of answering real world questions.
This guide is based on the version 3.0 (March 2014) release of HDInsight on Azure, but also includes some of the preview features that are available in later versions. Earlier and later releases of HDInsight may differ from the version described in this guide. For more information, see What's new in the Hadoop cluster versions provided by HDInsight? To sign up for the Azure service, go to HDInsight service home page.
Who this guide is for
The three sections of this guide target specific audiences:
- Executives, information officers, and technology managers. The discussion of the principles and benefits of Hadoop-based big data solutions, defining the goals for solutions, and identifying analysis requirements in section 1, “Understanding Microsoft big data solutions,” of this guide demonstrates where, when, and how a big data solution would benefit the organization.
- Architects and system designers. The exploration of the typical use cases and scenarios for big data batch processing solutions in section 2, “Designing big data solutions using HDInsight,” of this guide provides valuable assistance in designing systems that will produce the desired results.
- Developers and database administrators. The explanation of topics such as loading, querying and manipulating data; transferring the results into analytical and visualization tools; exporting the results into existing data stores and enterprise BI systems; and automating solutions in section 3, “Implementing big data solutions using HDInsight,” of this guide will help developers and DBAs to get started implementing and working with big data solutions.
Why this guide is pertinent now
Businesses and organizations are increasingly collecting huge volumes of data that may be useful now or in the future, and they need to know how to store and query it to extract the hidden information it contains. This might be web server log files, click-through data, financial information, medical data, user feedback, location and sensor information from mobile devices, or a range of social sentiment data such as tweets or comments to blog posts.
Big data techniques and mechanisms such as HDInsight provide a mechanism to efficiently store this data, analyze it to extract useful information, and export the results to tools and applications that can visualize the information in a range of ways. It is, realistically, the only way to handle the volume and the inherent unstructured nature of this data.
No matter what type of service you provide, what industry or market sector you are in, or even if you only run a blog or website forum, you are highly likely to benefit from collecting and analyzing data that is easily available, is often collected automatically (such as server log files), or can be obtained from other sources and combined with your own data to help you better understand your customers, your users, and your business; and to help you plan for the future.
The Team Who Brought You This Guide
This guide from the Microsoft patterns & practices group was produced with the help of many people within the developer community.
Vision/Program Management: Masashi Narumoto
Authors: Alex Homer, Graeme Malcom, and Masashi Narumoto
Development: Andrew Oakley, Alejandro Jezierski (Southworks), Leo Tilli (Nippur LLC), and Pablo Zaidenvoren (Nippur LLC)
Testing: Rohit Sharma, Larry Brader, Hanz Zhang, Mariano Sanchez (Lagash Systems SA), and Luis Ariel Kahrs (Lagash Systems SA)
Performance Testing: Carlos Farre and Veerapat Sriarunrungrueang (Adecco)
Documentation and illustrations: Alex Homer and Graeme Malcom (Content Master Ltd)
Graphic design: Chris Burns (Linda Werner & Associates Inc)
Editor: RoAnn Corbisier
Production: Nelly Delgado
Reviewers: Cindy Gross, Carl Nolan, Matt Winkler, Maxim Lukiyanov, Rafael Godinho, Simon Gurevich, Scott Shaw (Hortonworks), Wenming Ye (Microsoft Research), Sherman Wang, Kuninobu Sasaki, Philip Reilly, Andre Magni, Simon Lidberg, Michael Hlobil, Chris Douglas, Min Wei, Cale Teeter, Buck Woody, Emilio D’Angelo Yofre, Mandar Inamdar, Daniel Vaughan, Sunil Sabat, Paul Glavich, Christopher Maneu, Carlos dos Santos, Nishant Thacker, Paweł Wilkosz, and Ofer Ashkenazi.
Thanks: Special thanks to Fred Pace, Kate Baroni, and the members of Microsoft Data Insights COE for supporting this project.
Thank you all for bringing this guide to life!
Community and Feedback
Questions? Comments? Suggestions? To provide feedback about this guide, or to get help with any problems, please visit our Community site at http://wag.codeplex.com. The message board on the community site is the preferred feedback and support channel because it allows you to share your ideas, questions, and solutions with the entire community.