Mashups and visualizations over Big Data and Azure HDInsight using Data Explorer
With the recent announcement about Azure HDInsight, now is a good time to look at how one might use Data Explorer to connect to data sitting in Windows Azure HDInsight.
As you might be aware, HDInsight is Microsoft’s own offering of Hadoop as a service. In this example, we are going to look at how data can be consumed from HDInsight. We will be using a dataset in HDInsight that contains historical stock prices for all stocks traded on the NYSE between 1970 and 2010. While this dataset is not too big compared to “Big Data” standards, it does represent many of the challenges posed by big data as far as end user consumption goes. In case you are interested in trying this yourself, the source of this data is Infochimps. You will need to get the data into an Azure Blob Storage account that is associated with your HDInsight cluster.
The goal of this post is to show you how to use this data to build a report of those stocks that are traded on the NYSE and are part of the S&P 500 index. This dataset by alone isn’t enough, because all it provides is price and volume information by stock symbol and date. So we will need to ultimately find company name/sector information from another dataset. More on that later.
Here’s a view of the report we are attempting to build:
The steps below show you how to build this interactive report.
Step 1: Connect to HDInsight and shape the data
The first step is to connect to HDInsight and get the data in the right shape. HDInsight is a supported data source in the Data Explorer ribbon.
If you are following along and don’t see HDInsight in the Other Sources dropdown, you need to get the latest update for Data Explorer.
Once the account details are provided, we will eventually be connected up to the HDFS filesystem view:
At this point, you will likely notice the first challenge. The data that we want to consume is scattered across dozens of files – and we absolutely need the data from all of these files. However, this is an easy problem for Data Explorer.
As a first step, we need to use Data Explorer to subset the files based on a condition, so that all unnecessary files are filtered out. We can use a condition that filters down to the list of files that contain “daily_price” in the filename.
After this step, we have just the files needed. Now for the magic.
Data Explorer has a really cool feature that lets you create a logical table out of multiple text files. You can “combine” multiple files in a filesystem view by simply clicking on the Combine icon in the Content column header.
Clicking on that icon produces this:
At this point, one of the top rows can be promoted as header using the feature Data Explorer provides for creating a header row. The rest of the header rows can be filtered out also using the filter capabilities.
A few more operations to hide unwanted columns will produce our final view:
Clicking on Done will start to run the query and stream the data down into Excel.
Step 2: Find the S&P 500 list of companies along with company information
At this point, the data should be streaming down into Excel. There is quite a bit of data that will find its way down into Excel. However, since we are not done with data shaping, we need to toggle a setting that will disable evaluation/download of the results for this particular query.
Clicking on Enable Download stops the download:
In order to fully build out the report, the next thing we need is the list of companies that are part of the S&P 500 index. We can try to find this in Data Explorer’s Online Search.
Searching for S&P 500 yields a few results:
The first one looks pretty close to the data we need. So we can import that into Excel by clicking Use:
Step 3: Merging the two tables, and subsetting to the S&P 500 price data
The last step in our scenario is to combine the two tables using Data Explorer. We can do this by clicking on the Merge button in the Data Explorer ribbon:
The Merge dialog lets us pick the tables we’d like to merge, along with the common columns between the two tables so that Data Explorer can do a join. Note that a left outer join is used when merging is done this way.
Clicking on Apply completes the merge, and we are presented with a resulting table:
Columns from the second table can be added by expanding and looking up columns from the NewColumn column:
The result of selecting the columns we’d like to add to the table produces this:
Note that there are many columns that have null values for the new columns. This is expected as we have more companies in the left table (the one we pulled from HDInsight).
A simple filtering out of nulls fixes the problem, and leaves us with what we need:
We are now left with the historical end of day figures for all companies in the S&P 500. Clicking on Done will now bring the data into Excel.
Step 4: Fix up a few types in PowerPivot and visualize in Power View
Once the data is downloaded, adding it to Excel’s data model (xVelocity) is easy. Clicking on the Load to data model link puts the data into the model in one click:
Once the data is in the data model, it can easily be modeled using Excel’s PowerPivot functionality.
In order for the visualization to work correctly, we need to adjust the types of the following columns in PowerPivot:
- date – this column needs to be converted to Date type
- stock_price_close – this column needs to be converted to Decimal type
Once this is done, adding the visualization via Power View is easy.
- We can insert a Power View from the Insert ribbon/tab in Excel.
- Delete any default visualizations. From the table named Query in the Power View fields list, select the following columns:
- Change the visualization to a Line chart.
This will produce a visualization that shows all the companies on a single chart. We can then customize the chart and see just the companies we are interested in.
That’s all it takes to get all that data from HDInsight, and to combine that data with some publicly available information. Data Explorer’s Online Search is a good source for public data.
We hope this gives you an idea for how Data Explorer enables richer connectivity, discovery and data shaping scenarios while enhancing your Self Service BI experience in Excel. An interesting thought exercise will be to consider how you might accomplish this scenario without using Data Explorer.
Let us know what you think!
Data Explorer Team