This document provides sample Hive scripts that are used to explore data in Hive tables in an HDInsight Hadoop cluster.
The following menu links to topics that describe how to use tools to explore data from various storage environments.
Prerequisites
This article assumes that you have:
- Created an Azure storage account. If you need instructions, see Create an Azure Storage account
- Provisioned a customized Hadoop cluster with the HDInsight service. If you need instructions, see Customize Azure HDInsight Hadoop Clusters for Advanced Analytics.
- The data has been uploaded to Hive tables in Azure HDInsight Hadoop clusters. If it has not, follow the instructions in Create and load data to Hive tables to upload data to Hive tables first.
- Enabled remote access to the cluster. If you need instructions, see Access the Head Node of Hadoop Cluster.
- If you need instructions on how to submit Hive queries, see How to Submit Hive Queries
Example Hive query scripts for data exploration
- Get the count of observations per partition
SELECT <partitionfieldname>, count(*) from <databasename>.<tablename> group by <partitionfieldname>; - Get the count of observations per day
SELECT to_date(<date_columnname>), count(*) from <databasename>.<tablename> group by to_date(<date_columnname>); - Get the levels in a categorical column
SELECT distinct <column_name> from <databasename>.<tablename> - Get the number of levels in combination of two categorical columns
SELECT <column_a>, <column_b>, count(*) from <databasename>.<tablename> group by <column_a>, <column_b> - Get the distribution for numerical columns
SELECT <column_name>, count(*) from <databasename>.<tablename> group by <column_name> Extract records from joining two tables
SELECT a.<common_columnname1> as <new_name1>, a.<common_columnname2> as <new_name2>, a.<a_column_name1> as <new_name3>, a.<a_column_name2> as <new_name4>, b.<b_column_name1> as <new_name5>, b.<b_column_name2> as <new_name6> FROM ( SELECT <common_columnname1>, <common_columnname2>, <a_column_name1>, <a_column_name2>, FROM <databasename>.<tablename1> ) a join ( SELECT <common_columnname1>, <common_columnname2>, <b_column_name1>, <b_column_name2>, FROM <databasename>.<tablename2> ) b ON a.<common_columnname1>=b.<common_columnname1> and a.<common_columnname2>=b.<common_columnname2>
Additional query scripts for taxi trip data scenarios
Examples of queries that are specific to NYC Taxi Trip Data scenarios are also provided in GitHub repository. These queries already have data schema specified and are ready to be submitted to run.


