Explore data in Hive tables with Hive queries

This article provides sample Hive scripts that are used to explore data in Hive tables in an HDInsight Hadoop cluster.

This task is a step in the Team Data Science Process.

Prerequisites

This article assumes that you have:

Example Hive query scripts for data exploration

  1. Get the count of observations per partition SELECT <partitionfieldname>, count(*) from <databasename>.<tablename> group by <partitionfieldname>;

  2. Get the count of observations per day SELECT to_date(<date_columnname>), count(*) from <databasename>.<tablename> group by to_date(<date_columnname>);

  3. Get the levels in a categorical column
    SELECT distinct <column_name> from <databasename>.<tablename>

  4. Get the number of levels in combination of two categorical columns SELECT <column_a>, <column_b>, count(*) from <databasename>.<tablename> group by <column_a>, <column_b>

  5. Get the distribution for numerical columns
    SELECT <column_name>, count(*) from <databasename>.<tablename> group by <column_name>

  6. Extract records from joining two tables

    SELECT
        a.<common_columnname1> as <new_name1>,
        a.<common_columnname2> as <new_name2>,
        a.<a_column_name1> as <new_name3>,
        a.<a_column_name2> as <new_name4>,
        b.<b_column_name1> as <new_name5>,
        b.<b_column_name2> as <new_name6>
    FROM
        (
        SELECT <common_columnname1>,
            <common_columnname2>,
            <a_column_name1>,
            <a_column_name2>,
        FROM <databasename>.<tablename1>
        ) a
        join
        (
        SELECT <common_columnname1>,
            <common_columnname2>,
            <b_column_name1>,
            <b_column_name2>,
        FROM <databasename>.<tablename2>
        ) b
        ON a.<common_columnname1>=b.<common_columnname1> and a.<common_columnname2>=b.<common_columnname2>
    

Additional query scripts for taxi trip data scenarios

Examples of queries that are specific to NYC Taxi Trip Data scenarios are also provided in GitHub repository. These queries already have data schema specified and are ready to be submitted to run. The NYC Taxi Trip data is available through Azure Open Datasets or from the source TLC Trip Record Data.