Hive and XML File Processing

When I put together the “Generics based Framework for .Net Hadoop MapReduce Job Submission” code one of the goals was to support XML file processing. This was achieved by the creation of a modified Mahout document reader where one can specify the XML node to be presented for processing. But what if ones wants to process XML documents in Hive. Fortunately Hive similarly supports document readers, thus enabling the same document readers to be used as the basis of table definitions.

The process of enabling XML processing in Hive is relatively straightforward:

  • Create the table definition specifying that the input format is XML; thus exposing the necessary XML elements as columns
  • Parse the XML column data using xpath expressions in SELECT statements
  • or – Define a view on the XML table parsing out the relevant XML elements, returning them as native types

The syntax for the xpath processing in Hive can be found at: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+XPathUDF

So onto a simple example.

In “SampleScripts” folder of the MapReduce Framework download there is a script that extracts Store information, in XML format, from the sample AdventureWorks database. A sample of the output is as follows:

  1. <Root>
  2.   <Store>
  3.     <BusinessEntityID>292</BusinessEntityID>
  4.     <Name>Next-Door Bike Store</Name>
  5.     <SalesPersonID>279</SalesPersonID>
  6.     <Demographics>
  7.       <StoreSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey">
  8.         <AnnualSales>800000</AnnualSales>
  9.         <AnnualRevenue>80000</AnnualRevenue>
  10.         <BankName>United Security</BankName>
  11.         <BusinessType>BM</BusinessType>
  12.         <YearOpened>1996</YearOpened>
  13.         <Specialty>Mountain</Specialty>
  14.         <SquareFeet>21000</SquareFeet>
  15.         <Brands>2</Brands>
  16.         <Internet>ISDN</Internet>
  17.         <NumberEmployees>13</NumberEmployees>
  18.       </StoreSurvey>
  19.     </Demographics>
  20.     <Modified>2008-10-13T11:15:07.497</Modified>
  21.   </Store>
  22.   ...
  23.   <Store>
  24.     <BusinessEntityID>374</BusinessEntityID>
  25.     <Name>Immense Manufacturing Company</Name>
  26.     <SalesPersonID>277</SalesPersonID>
  27.     <Demographics>
  28.       <StoreSurvey xmlns="http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/StoreSurvey">
  29.         <AnnualSales>3000000</AnnualSales>
  30.         <AnnualRevenue>300000</AnnualRevenue>
  31.         <BankName>Guardian Bank</BankName>
  32.         <BusinessType>OS</BusinessType>
  33.         <YearOpened>1998</YearOpened>
  34.         <Specialty>Touring</Specialty>
  35.         <SquareFeet>76000</SquareFeet>
  36.         <Brands>4+</Brands>
  37.         <Internet>DSL</Internet>
  38.         <NumberEmployees>73</NumberEmployees>
  39.       </StoreSurvey>
  40.     </Demographics>
  41.     <Modified>2008-10-13T11:15:07.497</Modified>
  42.   </Store>
  43. </Root>

 

The record reader to be used will be “XmlElementStreamingInputFormat”. This document reader using a configuration element to define the XML node to be located, which then outputs for each row a single column consisting of the complete node contents.

Using this record reader a table can be defined consisting of a single XML column:

add JARS file:///C:/Users/Carl/Projects/MSDN.Hadoop.MapReduce/Release/msdn.hadoop.readers.jar;
set xmlinput.element=Store;

CREATE EXTERNAL TABLE StoresXml (storexml string)
STORED AS INPUTFORMAT 'msdn.hadoop.mapreduce.input.XmlElementStreamingInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/Carl/stores/demographics';

The INPUTFORMAT option allows for the definition of the required document reader. The OUTPUTFORMAT specified is the default for Hive. In this example I have defined an EXTERNAL table over the directory containing the extracted XML; independently copied to the Hadoop cluster.

The ADD JARS statement ensures the document reader is available for job execution. The SET statement configures the job such that the document reader knows what XML node to extract.

Once this table is defined you can use it like any other Hive table. If one selects from this table you will get each Store element as a row. However, xpath processing allows you to extract the XML attributes as native types.

In addition to SELECT operations one also has the option of creating a VIEW that parses the XML and presents the data using native types:

CREATE VIEW Stores(BusinessEntityID, BusinessType, BankName, AnnualSales, AnnualRevenue) AS
SELECT
xpath_int (storexml, '/Store/BusinessEntityID'),
xpath_string (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'BusinessType\']'),
xpath_string (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'BankName\']'),
xpath_double (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'AnnualSales\']'),
xpath_double (storexml, '/Store/Demographics/*[local-name()=\'StoreSurvey\']/*[local-name()=\'AnnualRevenue\']')
FROM StoresXml;

Using the Stores definition one can now process the XML data files through the normal Hive operations. Continuing with the same samples in the download one can now easily generate a revenue summary across the banks:

SELECT BusinessType, BankName, CAST(SUM(AnnualSales) AS INT) AS TotalSales FROM Stores
GROUP BY BusinessType, BankName;

As expected, under the covers the necessary MapReduce jobs are executed to aggregate the data, returning:

BM Guardian Bank 43200000
BM International Bank 43200000
BM International Security 43200000
BM Primary Bank & Reserve 42800000
BM Primary International 42200000
BM Reserve Security 42200000
BM United Security 42200000
BS Guardian Bank 88400000
BS International Bank 87400000
BS International Security 88400000
BS Primary Bank & Reserve 88400000
BS Primary International 87400000
BS Reserve Security 87400000
BS United Security 87400000
OS Guardian Bank 192000000
OS International Bank 186000000
OS International Security 186000000
OS Primary Bank & Reserve 186000000
OS Primary International 186000000
OS Reserve Security 186000000
OS United Security 186000000

If you download the aforementioned code, the sample for the Hive execution can be found in the “SampleScripts” folder. The “DocumentInputReaders” folder also contains the XML document reader classes along with a usable JAR file.