Explore data in Hive tables with Hive queries
This article provides sample Hive scripts that are used to explore data in Hive tables in an HDInsight Hadoop cluster.
This task is a step in the Team Data Science Process.
Prerequisites
This article assumes that you have:
- Created an Azure storage account. If you need instructions, see Create an Azure Storage account
- Provisioned a customized Hadoop cluster with the HDInsight service. If you need instructions, see Customize Azure HDInsight Hadoop Clusters for Advanced Analytics.
- The data has been uploaded to Hive tables in Azure HDInsight Hadoop clusters. If it has not, follow the instructions in Create and load data to Hive tables to upload data to Hive tables first.
- Enabled remote access to the cluster. If you need instructions, see Access the Head Node of Hadoop Cluster.
- If you need instructions on how to submit Hive queries, see How to Submit Hive Queries
Example Hive query scripts for data exploration
Get the count of observations per partition
SELECT <partitionfieldname>, count(*) from <databasename>.<tablename> group by <partitionfieldname>;Get the count of observations per day
SELECT to_date(<date_columnname>), count(*) from <databasename>.<tablename> group by to_date(<date_columnname>);Get the levels in a categorical column
SELECT distinct <column_name> from <databasename>.<tablename>Get the number of levels in combination of two categorical columns
SELECT <column_a>, <column_b>, count(*) from <databasename>.<tablename> group by <column_a>, <column_b>Get the distribution for numerical columns
SELECT <column_name>, count(*) from <databasename>.<tablename> group by <column_name>Extract records from joining two tables
SELECT a.<common_columnname1> as <new_name1>, a.<common_columnname2> as <new_name2>, a.<a_column_name1> as <new_name3>, a.<a_column_name2> as <new_name4>, b.<b_column_name1> as <new_name5>, b.<b_column_name2> as <new_name6> FROM ( SELECT <common_columnname1>, <common_columnname2>, <a_column_name1>, <a_column_name2>, FROM <databasename>.<tablename1> ) a join ( SELECT <common_columnname1>, <common_columnname2>, <b_column_name1>, <b_column_name2>, FROM <databasename>.<tablename2> ) b ON a.<common_columnname1>=b.<common_columnname1> and a.<common_columnname2>=b.<common_columnname2>
Additional query scripts for taxi trip data scenarios
Examples of queries that are specific to NYC Taxi Trip Data scenarios are also provided in GitHub repository. These queries already have data schema specified and are ready to be submitted to run. The NYC Taxi Trip data is available through Azure Open Datasets or from the source TLC Trip Record Data.