Saving Spark Distributed Data Frame (DDF) To PowerBI
- The sample Jupyter Scala notebook described in this blog can be downloaded from https://github.com/hdinsight/spark-jupyter-notebooks/blob/master/Scala/SparkDataFrameToPowerBI.ipynb.
- Spark PowerBI connector source code is available at https://github.com/hdinsight/spark-powerbi-connector.
Data visualization is often the most important part of data processing as it can surface up data patterns and trends in data that cannot be otherwise easily perceptible by humans. PowerBI (https://powerbi.microsoft.com/en-us/) is a powerful data visualization platform offered by Microsoft. As PowerBI turned out to be a natural choice for data visualization we developed a Spark to PowerBI connector (https://github.com/hdinsight/spark-powerbi-connector) which implements a PowerBI REST client and extension methods on Spark
DStream in the same way described in a previous blog https://blogs.msdn.microsoft.com/azuredatalake/2016/03/01/extending-spark-with-extension-methods-in-scala-fun-with-implicits/. These extension methods can be called in a Spark application in codes that run either in the driver or the executors.
This blog describes one of the ways of using the connector for pushing Spark
DataFrame to PowerBI as part of a Spark interactive or batch job through an example Jupyter notebook in Scala which can be run on an HDInsight cluster. To learn about how to deploy Azure HDInsight Linux Spark cluster and launch Jupyter notebook refer to the Azure article at https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-jupyter-spark-sql/. PowerBI has always been able to pull data from various data sources including those supported by Spark (like Hive table, Azure blob, SQL server) through appropriate drivers but the Spark to PowerBI connector enables pushing data from Spark directly to PowerBI. Though this example shows pushing an entire
DataFrame directly to PowerBI, it is expected that users should use their judgment on the amount of data to be pushed through the
DataFrame. Be aware of the limitations of PowerBI REST APIs as detailed in https://msdn.microsoft.com/en-us/library/dn950053.aspx. An ideal use case will be displaying some metrics or aggregates of a job at certain intervals or stages as PowerBI Dashboard.
In this example we will visualize a
DataFrame on PowerBI that displays the relation between the radius and area of circles. To start with this example we need to configure Jupyter to use two additional JARs and place them in a known folder in the default container of the default storage account of the HDInsight cluster:
- adal4j.jar available at http://mvnrepository.com/artifact/com.microsoft.azure/adal4j
- spark- powerbi-connector_2.10-0.6.0.jar (compiled with Spark 1.6.1) available at https://github.com/hdinsight/spark-powerbi-connector. Follow the README for Maven and SBT co-ordinates.
If required, source codes from Github repositories can be cloned, built and packaged into appropriate JAR artifacts through IntelliJ. Refer to the appropriate section of the Azure article at https://azure.microsoft.com/en-us/documentation/articles/hdinsight-apache-spark-eventhub-streaming/ for detailed instructions.
case class Circle that will hold the data which will be used to create the
DataFrame for this example. In practice the
DataFrame data source can be anything that is supported by Spark.
Generate a list of
Circle objects and create a
DataFrame out of it.
Create a new
DataFrame by selecting only the columns from the last
DataFrame which we want to show on the PowerBI dashboard. In this example columns “radius” and “area” are selected.
Enter the PowerBI Client ID, PowerBI Account Username and PowerBI Account Password and declare a
PowerBIAuthentication object. The
PowerBIAuthentication class is defined in the spark-powerbi-connector.jar under
PowerBIAuthentication object, declare PowerBI table with column names and datatypes and create (or get) the PowerBI dataset containing the table. This step runs in the driver and actually goes to PowerBI and creates (or gets) the dataset and the underlying table(s). The column order of the PowerBI table should match that of the
DataFrame which it is storing.
Simply call the
toPowerBI method on the
DataFrame with PowerBI dataset details received when creating (or getting) the dataset in the previous step, the PowerBI table name and the
Verify that the data is actually saved and can be displayed on the PowerBI dashboard.
For further reading into how
DataFrame has been extended to support saving it to PowerBI, following is the code behind. Each partition of the
DataFrame is grouped into 1000 records and serialized into a POST request of multiple rows to PowerBI table in JSON format. Since
DataFrame and PowerBI table both maintain column order and PowerBI table and
DataFrame column orders should match, no name matching is done between columns of
DataFrame and PowerBI table.
[Contributed by Arijit Tarafdar]