Implementing big data solutions using HDInsight
This section of the guide is aimed at developers, and explores the typical stages of implementing a big data solution. The examples focus on Microsoft Azure HDInsight, but—because the underlying big data framework is Hadoop—much of the information about loading, querying, visualization, and automation is equally applicable to big data solutions built on non-Microsoft operating systems, and using services other than HDInsight.
This section is divided into convenient areas that make it easier to understand the challenges, options, solutions, and considerations for each stage. It describes and demonstrates the individual tasks that are part of typical end-to-end big data solutions.
The following sections demonstrate the three main stages of the process, followed by an exploration of how you can combine and automate them to build a comprehensive managed solution. The sections are:
- Obtaining the data and submitting it to the cluster. During this stage you decide how you will collect the data you have identified as the source, and how you will get it into your big data solution for processing. Often you will store the data in its raw format to avoid losing any useful contextual information it contains, though you may choose to do some pre-processing before storing it to remove duplication or to simplify it in some other way. You must also make several decisions about how and when you will initialize a cluster and the associated storage. For more details, see Collecting and loading data into HDInsight.
- Processing the data. After you have started to collect and store the data, the next stage is to develop the processing solutions you will use to extract the information you need. While you can usually use Hive and Pig queries for even quite complex data extraction, you will occasionally need to create map/reduce components to perform more complex queries against the data. For more details, see Processing, querying, and transforming data using HDInsight.
- Visualizing and analyzing the results. Once you are satisfied that the solution is working correctly and efficiently, you can plan and implement the analysis and visualization approach you require. This may be loading the data directly into an application such as Microsoft Excel, or exporting it into a database or enterprise BI system for further analysis, reporting, charting, and more. For more details, see Consuming and visualizing data from HDInsight.
- Building an automated end-to-end solution. At this point it will become clear whether the solution should become part of your organization’s business management infrastructure, complementing the other sources of information that you use to plan and monitor business performance and strategy. If this is the case you should consider how you might automate and manage some or all of the solution to provide predictable behavior, and perhaps so that it is executed on a schedule. For more details, see Building end-to-end solutions using HDInsight.