Explore data in Hive tables with Hive queries

This document provides sample Hive scripts that are used to explore data in Hive tables in an HDInsight Hadoop cluster.

The following menu links to topics that describe how to use tools to explore data from various storage environments.

Prerequisites

This article assumes that you have:

Example Hive query scripts for data exploration

  1. Get the count of observations per partition SELECT <partitionfieldname>, count(*) from <databasename>.<tablename> group by <partitionfieldname>;
  2. Get the count of observations per day SELECT to_date(<date_columnname>), count(*) from <databasename>.<tablename> group by to_date(<date_columnname>);
  3. Get the levels in a categorical column
    SELECT distinct <column_name> from <databasename>.<tablename>
  4. Get the number of levels in combination of two categorical columns SELECT <column_a>, <column_b>, count(*) from <databasename>.<tablename> group by <column_a>, <column_b>
  5. Get the distribution for numerical columns
    SELECT <column_name>, count(*) from <databasename>.<tablename> group by <column_name>
  6. Extract records from joining two tables

     SELECT
         a.<common_columnname1> as <new_name1>,
         a.<common_columnname2> as <new_name2>,
         a.<a_column_name1> as <new_name3>,
         a.<a_column_name2> as <new_name4>,
         b.<b_column_name1> as <new_name5>,
         b.<b_column_name2> as <new_name6>
     FROM
         (
         SELECT <common_columnname1>,
             <common_columnname2>,
             <a_column_name1>,
             <a_column_name2>,
         FROM <databasename>.<tablename1>
         ) a
         join
         (
         SELECT <common_columnname1>,
             <common_columnname2>,
             <b_column_name1>,
             <b_column_name2>,
         FROM <databasename>.<tablename2>
         ) b
         ON a.<common_columnname1>=b.<common_columnname1> and a.<common_columnname2>=b.<common_columnname2>
    

Additional query scripts for taxi trip data scenarios

Examples of queries that are specific to NYC Taxi Trip Data scenarios are also provided in GitHub repository. These queries already have data schema specified and are ready to be submitted to run.