Executing Spark SQL Queries using dotnet ODBC driver

Introduction

HDInsight provides numerous ways of executing Spark applications on your cluster. This blogpost outlines how to run Spark SQL queries on your cluster remotely from Visual Studio using C#.  The examples explained below is intended to serve as a framework on which you can extend it to build your custom Spark SQL queries.

Prerequisite

Before you begin, you must have the following:

Downloading Source

The project that contains the sample scenario is hosted in Microsoft GitHub. Download the source from here: https://github.com/Azure-Samples/hdinsight-dotnet-odbc-spark-sql

Sample Scenario

The sample described in the source is intended to do the following:

  • Connect to your HDInsight Spark cluster using ODBC connection string.
  • Take a list of queries as input and run each of the them serially on the cluster.
  • Measure time taken by each query to complete.
  • Measure average, min and max query running time.

Query Description

The queries used in this example perform the following operations on your cluster:

  • List all tables in your cluster.
  • Drop the table mysampletable if it exists.
  • Create an external table mysampletable which contains device platform and number of devices using that platform.
  • Extract all records from hivesampletable which come built in with your HDInsight cluster, group them by device platform and write to mysampletable.
  • Collect all entries from mysampletable and print them.

Executing the Sample

  • Go to the source directory downloaded from Github - cd hdinsight-dotnet-odbc-spark-sql
  • Open the ODBC Client csproj file using Visual Studio 2015
  • Navigate to Program.cs using Solution Explorer
  • Input your username, password and cluster name in the main function.  Username
  • Execute the project by clicking the “Start” button at the top navigation bar.

Customizing the Sample

This sample provides a framework which can be customized to fit your needs. An example scenario for customization could be as follows:

  • Running your custom queries: The sample has a queryList which contains the queries explained in this section. Replace the queries in this list by you’re the queries you want to run.  QueryList
  • Changing timeouts: The sample has currently set command timeout to 5 minutes and the connection timeout to 10 minutes. While this should work perfectly fine for the sample described here, you might want to change it when running long running queries.
    • Changing command timeout: Change the CommandTimeout property set in command context [ Value in seconds ].  CommandT
    • Changing connection timeout: This has to be set while building the connection string as part of GenerateConnectionString  ConnT