Appendix A - Tools and technologies reference
This appendix contains descriptions of the tools, APIs, SDKs, and technologies commonly used in conjunction with big data solutions, including those built on Hadoop and HDInsight. The icons for each item, listed below, will help you to more easily identify the tools and technologies you should investigate.
Data consumption: Tools, APIs, SDKs, and technologies commonly used for extracting and consuming the results from Hadoop-based solutions.
Data ingestion: Tools, APIs, SDKs, and technologies commonly used for extracting data from data sources and loading it into Hadoop-based solutions.
Data processing: Tools, APIs, SDKs, and technologies commonly used for processing, querying, and transforming data in Hadoop-based solutions.
Data transfer: Tools, APIs, SDKs, and technologies commonly used for transferring data between Hadoop and other data stores such as databases and cloud storage.
Data visualization: Tools, APIs, SDKs, and technologies commonly used for visualizing and analyzing the results from Hadoop-based solutions.
Job submission: Tools, APIs, SDKs, and technologies commonly used for submitting jobs for processing in Hadoop-based solutions.
Management: Tools, APIs, SDKs, and technologies commonly used for managing and monitoring Hadoop-based solutions.
Workflow: Tools, APIs, SDKs, and technologies commonly used for creating workflows and managing multi-step processing in Hadoop-based solutions.
The tools, APIs, SDKs, and technologies are listed in alphabetical order below.
A solution for provisioning, managing, and monitoring Hadoop clusters using an intuitive, easy-to-use Hadoop management web UI backed by REST APIs.
Usage notes: Only the monitoring endpoint was available in HDInsight at the time this guide was written.
For more info, see Ambari.
A tool for high-performance transfer and synchronization of files and data sets of virtually any size, with the full access control, privacy and security. Provides maximum speed transfer under variable network conditions.
- Uses a combination of UDP and TCP, which eliminates the latency issues typically encountered when using only TCP.
- Leverages existing WAN infrastructure and commodity hardware.
For more info, see Aspera.
A data serialization system that supports rich data structures, a compact, fast, binary data format, a container file to store persistent data, remote procedure calls (RPC), and simple integration with dynamic languages. Can be used with the client tools in the .NET SDK for Azure.
- Quick and easy to use and simple to understand.
- Uses a JSON serialization approach.
- The API supports C# and several other languages.
- No monitoring or logging features built in.
For more info, see Avro.
A command-line utility designed for high performance that can upload and download Azure storage blobs and files. Can be scripted for automaton. Offers a number of functions to filter and manipulate content. Provides resuming, and logging functions.
- Transfers to and from an Azure datacenter will be constrained by the connection bandwidth available.
- Configure the number of concurrent threads based on experimentation.
- A PowerShell script can be created to monitor the logging files.
For more info, see AZCopy.
A framework for creating workflows that access Hadoop. Designed to overcome the problem of interdependencies between tasks.
- Uses a web server to schedule and manage jobs, an executor server to submit jobs to Hadoop, and either an internal H2 database or a separate MySQL database to store job details.
For more info, see Azkaban.
Azure Intelligent Systems Service (ISS)
A cloud-based service that can be used to collect data from a wide range of devices and applications, apply rules that define automated actions on the data, and connect the data to business applications and clients for analysis.
- Use it to capture, store, join, visualize, analyze, and share data.
- Supports remote management and monitoring of data transfers.
- Service was in preview at the time this guide was written.
For more info, see Azure Intelligent Systems Service (ISS).
Azure Storage Client Libraries
Exposes storage resources through a REST API that can be called by any language that can make HTTP/HTTPS requests.
- Provides programming libraries for several popular languages that simplify many tasks by handling synchronous and asynchronous invocation, batching of operations, exception management, automatic retries, operational behavior, and more.
- Libraries are currently available for .NET, Java, and C++. Others will be available over time.
For more info, see Azure Storage Client Libraries.
Azure Storage Explorer
A free GUI-based tool for viewing, uploading, and managing data in Azure blob storage. Can be used to view multiple storage accounts at the same in separate tab pages.
- Create, view, copy, rename, and delete containers.
- Create, view, copy, rename, delete, upload, and download blobs.
- Blobs can be viewed as images, video, or text.
- Blob properties can be viewed and edited.
For more info, see Azure Storage Explorer.
Azure SQL Database
A platform-as-a-service (PaaS) relational database solution in Azure that offers a minimal configuration, low maintenance solution for applications and business processes that require a relational database with support for SQL Server semantics and client interfaces.
Usage notes: A common work pattern in big data analysis is to provision the HDInsight cluster when it is required, and decommission it after data processing is complete. If you want the results of the big data processing to remain available in relational format for client applications to consume, you can transfer the output generated by HDInsight into a relational database. Azure SQL Database is a good choice for this when you want the data to remain in the cloud, and you do not want to incur the overhead of configuring and managing a physical server or virtual machine running the SQL Server database engine.
For more info, see Azure SQL Database.
A project to develop support for writing native-code REST for Azure, with integration in Visual Studio. Provides a consistent and powerful model for composing asynchronous operations based on C++ 11 features.
- Provides support for accessing REST services from native code on Windows Vista, Windows 7, and Windows 8 through asynchronous C++ bindings to HTTP, JSON, and URIs.
- Includes libraries for accessing Azure blob storage from native clients.
- Includes a C++ implementation of the Erlang actor-based programming model.
- Includes samples and documentation.
For more info, see Casablanca.
A data processing API and processing query planner for defining, sharing, and executing data processing workflows. Adds an abstraction layer over the Hadoop API to simplify development, job creation, and scheduling.
- Can be deployed on a single node to efficiently test code and process local files before being deployed on a cluster, or in a distributed mode that uses Hadoop,
- Uses a metaphor of pipes (data streams) and filters (data operations) that can be assembled to split, merge, group, or join streams of data while applying operations to each data record or groups of records.
For more info, see Cascading.
Cerebrata Azure Management Studio
A comprehensive environment for managing Azure-hosted applications. Can be used to access Azure storage, Azure log files, and manage the life cycle of applications. Provides a dashboard-style UI.
- Connects through a publishing file and enables use of groups and profiles for managing users and resources.
- Provides full control of storage accounts, including Azure blobs and containers.
- Enables control of diagnostics features in Azure.
- Provides management capabilities for many types of Azure service including SQL Database.
For more info, see Cerebrata Azure Management Studio.
An automation platform that transforms infrastructure into code. Allows you to automate configuration, deployment and scaling for on-premises, cloud-hosted, and hybrid applications.
Usage notes: Available as a free open source version, and an enterprise version that includes additional management features such as a portal, authentication and authorization management, and support for multi-tenancy. Also available as a hosted service.
For more info, see Chef.
An open source data collection system for monitoring large distributed systems, built on top of HDFS and map/reduce. Also includes a ﬂexible and powerful toolkit for displaying, monitoring and analyzing results.
Has five primary components:
- Agents that run on each machine and emit data.
- Collectors that receive data from the agent and write it to stable storage.
- ETL Processes for parsing and archiving the data.
- Data Analytics Scripts that aggregate Hadoop cluster health information.
- Hadoop Infrastructure Care Center, a web-portal style interface for displaying data.
For more info, see Chukwa.
A free GUI-based file manager and explorer for browsing and accessing Azure storage.
- Also available as a paid-for Professional version that adds encryption, compression, multi-threaded data transfer, file comparison, and FTP/SFTP support.
For more info, see CloudBerry Explorer.
An easy-to-use GUI-based explorer for browsing and accessing Azure storage. Has a wide range of features for managing storage and transferring data, including access to compressed files. Supports auto-resume for file transfers.
- Multithreaded upload and download support.
- Provides full control of data in Azure blob storage, including metadata.
- Auto-resume upload and download of large files.
- No logging features.
For more info, see CloudXplorer.
Cross-platform Command Line Interface (X-plat CLI)
An open source command line interface for developers and IT administrators to develop, deploy and manage Azure applications. Supports management tasks on Windows, Linux, and iOS. Commands can be extended using Node.js.
- Can be used to manage almost all features of Azure including accounts, storage, databases, virtual machines, websites, networks, and mobile services.
- The open source license allows for reuse of the library.
- Does not support SSL for data transfers.
- You must add the path to the command line PATH list.
- On Windows it is easier to use PowerShell.
For more info, see Cross-platform Command Line Interface (X-plat CLI).
- Allows you to bind arbitrary data to a Document Object Model (DOM) and then apply data-driven transformations to the document, such as generating an HTML table from an array of numbers and using the same data to create an interactive SVG bar chart with smooth transitions and interaction.
- Provides a powerful declarative approach for selecting nodes and can operate on arbitrary sets of nodes called selections.
For more info, see D3.js.
A framework for simplifying data management and pipeline processing that enables automated movement and processing of datasets for ingestion, pipelines, disaster recovery, and data retention. Runs on one server in the cluster and is accessed through the command-line interface or the REST API.
- Replicates HDFS files and Hive Tables between different clusters for disaster recovery and multi-cluster data discovery scenarios.
- Manages data eviction policies.
- Uses entity relationships to allow coarse-grained data lineage.
- Automatically manages the complex logic of late data handling and retries.
- Uses higher-level data abstractions (Clusters, Feeds, and Processes) enabling separation of business logic from application logic.
- Transparently coordinates and schedules data workflows using the existing Hadoop services such as Oozie.
For more info, see Falcon.
A client-server based file transfer system that supports common and secure protocols (UDP, FTP, FTPS, HTTP, HTTPS), encryption, bandwidth management, monitoring, and logging.
- FileCatalyst Direct features are available by installing the FileCatalyst Server and one of the client-side options.
- Uses a combination of UDP and TCP, which eliminates the latency issues typically encountered when using only TCP.
For more info, see FileCatalyst.
A distributed, robust, and fault tolerant tool for efficiently collecting, aggregating, and moving large amounts of log file data. Has a simple and flexible architecture based on streaming data flows and with a tunable reliability mechanism. The simple extensible data model allows for automation using Java code.
- Includes several plugins to support various sources, channels, sinks and serializers. Well supported third party plugins are also available.
- Easily scaled due to its distributed architecture.
- You must manually configure SSL for each agent. Configuration can be complex and requires knowledge of the infrastructure.
- Provides a monitoring API that supports custom and third party tools.
For more info, see Flume.
A scalable distributed monitoring system that can be used to monitor computing clusters. It is based on a hierarchical design targeted at federations of clusters, and uses common technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization.
Comprises the monitoring core, a web interface, an execution environment, a Python client, a command line interface, and RSS capabilities.
For more info, see Ganglia.
Hadoop command line
Provides access to Hadoop to execute the standard Hadoop commands. Supports scripting for managing Hadoop jobs and shows the status of commands and jobs.
- Accessed through a remote desktop connection.
- Does not provide administrative level access.
- Focused on Hadoop and not HDInsight.
- You must create scripts or batch files for operations you want to automate.
- Does not support SSL for uploading data.
- Requires knowledge of Hadoop commands and operating procedures.
For more info, see Commands Manual.
A workflow framework based on directed acyclic graph (DAG) principles for scheduling and managing sequences of jobs by defining datasets and ensuring that each is kept up to date by executing Hadoop jobs.
- Generalizes the programming model for complex tasks through dataflow programming and incremental processing.
- Workflows are defined in XML and can include iterative steps and asynchronous operations over more than one input dataset.
For more info, see Hamake.
Provides a tabular abstraction layer that helps unify the way that data is interpreted across processing interfaces, and provides a consistent way for data to be loaded and stored; regardless of the specific processing interface being used. This abstraction exposes a relational view over the data, including support for partitions.
- Easy to incorporate into solutions. Files in JSON, SequenceFile, CSV, and RC format can be read and written by default, and a custom SerDe can be used to read and write files in other formats.
- Enables notification of data availability, making it easier to write applications that perform multiple jobs.
- Additional effort is required in custom map/reduce components because custom load and store functions must be created.
For more info, see HCatalog.
HDInsight SDK and Microsoft .NET SDK for Hadoop
The HDInsight SDKs provide the capability to create clients that can manage the cluster, and execute jobs in the cluster. Available for .NET development and other languages such as Node.js. WebHDFS client is a .NET wrapper for interacting with WebHDFS compliant end-points in Hadoop and Azure HDInsight. WebHCat is the REST API for HCatalog, a table and storage management layer for Hadoop.
Can be used for a wide range of tasks including:
- Creating, customizing, and deleting clusters.
- Creating and submitting map/reduce, Pig, Hive, Sqoop, and Oozie jobs.
- Configuring Hadoop components such as Hive and Oozie.
- Serializing data with Avro.
- Using Linq To Hive to query and extract data.
- Accessing the Ambari monitoring system.
- Performing storage log file analysis.
- Accessing the WebHCat (for HCatalog) and WebHDFS services.
An abstraction layer over the Hadoop query engine that provides a query language called HiveQL, which is syntactically very similar to SQL and supports the ability to create tables of data that can be accessed remotely through an ODBC connection. Hive enables you to create an interface to your data that can be used in a similar way to a traditional relational database.
- Data can be consumed from Hive tables using tools such as Excel and SQL Server Reporting Services, or though the ODBC driver for Hive.
- Hive QL allows you to plug in custom mappers and reducers to perform more sophisticated processing.
- A good choice for processes such as summarization, ad hoc queries, and analysis on data that has some identifiable structure; and for creating a layer of tables through which users can easily query the source data, and data generated by previously executed jobs.
For more info, see Hive.
A distributed, partitioned, replicated service with the functionality of a messaging system. Stores data as logs across servers in a cluster and exposes the data through consumers to implement common messaging patterns such as queuing and publish-subscribe.
- Uses the concepts of topics that are fed to Kafka by producers. The data is stored in the distributed cluster servers, each of which is referred to as a broker, and accessed by consumers.
- Data is exposed over TCP, and clients are available in a range of languages.
- Data lifetime is configurable, and the system is fault tolerant though the use of replicated copies.
For more info, see Kafka.
A system that provides a single point of authentication and access for Hadoop services in a cluster. Simplifies Hadoop security for users who access the cluster data and execute jobs, and for operators who control access and manage the cluster.
- Provides perimeter security to make Hadoop security setup easier.
- Supports authentication and token verification security scenarios.
- Delivers users a single cluster end-point that aggregates capabilities for data and jobs.
- Enables integration with enterprise and cloud identity management environments.
- Manages security across multiple clusters and multiple versions of Hadoop.
For more info, see Knox.
LINQ to Hive
A technology that supports authoring Hive queries using Language-Integrated Query (LINQ). The LINQ is compiled to Hive and then executed on the Hadoop cluster.
Usage notes: The LINQ code can be executed within a client application or as a user-defined function (UDF) within a Hive query.
For more info, see LINQ to Hive.
A scalable machine learning and data mining library used to examine data files to extract specific types of information. It provides an implementation of several machine learning algorithms, and is typically used with source data files containing relationships between the items of interest in a data processing solution.
- A good choice for grouping documents or data items that contain similar content; recommendation mining to discover user’s preferences from their behavior; assigning new documents or data items to a category based on the existing categorizations; and performing frequent data mining operations based on the most recent data.
For more info, see Mahout.
The Azure Management portal can be used to configure and manage clusters, execute HiveQL commands against the cluster, browse the file system, and view cluster activity. It shows a range of settings and information about the cluster, and a list of the linked resources such as storage accounts. It also provides the ability to connect to the cluster through RDP.
Provides rudimentary monitoring features including:
- The number of map and reduce jobs executed.
- Accumulated, maximum, and minimum data for containers in the storage accounts.
- Accumulated, maximum, and minimum data and running applications.
- A list of jobs that have executed and some basic information about each one.
For more info, see Get started using Hadoop 2.2 in HDInsight
Map/reduce code consists of two functions; a mapper and a reducer. The mapper is run in parallel on multiple cluster nodes, each node applying it to its own subset of the data. The output from the mapper function on each node is then passed to the reducer function, which collates and summarizes the results of the mapper function.
- A good choice for processing completely unstructured data by parsing it and using custom logic to obtain structured information from it; for performing complex tasks that are difficult (or impossible) to express in Pig or Hive without resorting to creating a UDF; for refining and exerting full control over the query execution process, such as using a combiner in the map phase to reduce the size of the map process output.
For more info, see Map/reduce.
One of the most commonly used data analysis and visualization tools in BI scenarios. It includes native functionality for importing data from a wide range of sources, including HDInsight (via the Hive ODBC driver) and relational databases such as SQL Server. Excel also provides native data visualization tools, including tables, charts, conditional formatting, slicers, and timelines.
Usage notes: After HDInsight has been used to process data, the results can be consumed and visualized in Excel. Excel can consume output from HDInsight jobs directly from Hive tables in the HDInsight cluster or by importing output files from Azure storage, or through an intermediary querying and data modeling technology such as SQL Server Analysis Services.
For more info, see Microsoft Excel.
Azure SDK for Node.js
A set of modules for Node.js that can be used to manage many features of Azure.
Includes separate modules for:
- Core management
- Compute management
- Web Site management
- Virtual Network management
- Storage Account management
- SQL Database management
- Service Bus management
For more info, see Azure SDK for Node.js.
A tool that enables you to create repeatable, dynamic workflows for tasks to be performed in a Hadoop cluster. Actions encapsulated in an Oozie workflow can include Sqoop transfers, map/reduce jobs, Pig jobs, Hive jobs, and HDFS commands.
- Defining an Oozie workflow requires familiarity with the XML-based syntax used to define the Direct Acyclic Graph (DAG) for the workflow actions.
- You can initiate Oozie workflows from the Hadoop command line, a PowerShell script, a custom .NET application, or any client that can submit an HTTP request to the Oozie REST API.
For more info, see Oozie.
A client-embedded JDBC driver designed to perform low latency queries over data stored in Apache HBase. It compiles standard SQL queries into a series of HBase scans, and orchestrates the running of those scans to produce standard JDBC result sets. It also supports client-side batching and rollback.
- Supports all common SQL query statement clauses including SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, and more.
- Supports a full set of DML commands and DDL commands including table creation and versioned incremental table alteration.
- Allows columns to be defined dynamically at query time. Metadata for tables is stored in an HBase table and versioned so that snapshot queries over prior versions automatically use the correct schema.
For more info, see Phoenix.
A high-level data-flow language and execution framework for parallel computation that provides a workflow semantic for processing data in HDInsight. Supports complex processing of the source data to generate output that is useful for analysis and reporting. Pig statements generally involve defining relations that contain data. Relations can be thought of as result sets, and can be based on a schema or can be completely unstructured.
Usage notes: A good choice for restructuring data by defining columns, grouping values, or converting columns to rows; transforming data such as merging and filtering data sets, and applying functions to all or subsets of records; and as a sequence of operations that is often a logical way to approach many map/reduce tasks.
For more info, see Pig.
A service for Office 365 that builds on the data modeling and visualization capabilities of PowerPivot, Power Query, Power View, and Power Map to create a cloud-based collaborative platform for self-service BI. Provides a platform for users to share the insights they have found when analyzing and visualizing the output generated by HDInsight, and to make the results of big data processing discoverable for other, less technically proficient, users in the enterprise.
- Users can share queries created with Power Query to make data discoverable across the enterprise through Online Search. Data visualizations created with Power View can be published as reports in a Power BI site, and viewed in a browser or through the Power BI Windows Store app. Data models created with PowerPivot can be published to a Power BI site and used as a source for natural language queries using the Power BI Q&A feature.
- By defining queries and data models that include the results of big data processing, users become “data stewards” and publish it in a way that abstracts the complexities of consuming and modeling data from HDInsight.
For more info, see Power BI.
An add-in for Excel that is available to Office 365 enterprise-level subscribers. Power Map enables users to create animated tours that show changes in geographically-related data values over time, overlaid on a map.
Usage notes: When the results of big data processing include geographical and temporal fields you can import the results into an Excel worksheet or data model and visualize them using Power Map.
For more info, see Power Map.
An add-in for Excel that you can use to define, save, and share queries. Queries can be used to retrieve, filter, and shape data from a wide range of data sources. You can import the results of queries into worksheets or into a workbook data model, which can then be refined using PowerPivot.
Usage notes: You can use Power Query to consume the results of big data processing in HDInsight by defining a query that reads files from the Azure blob storage location that holds the output of big data processing jobs. This enables Excel users to consume and visualize the results of big data processing, even after the HDInsight cluster has been decommissioned.
For more info, see Power Query.
An add-in for Excel that enables users to explore data models by creating interactive data visualizations. It is also available as a SharePoint Server application service when SQL Server Reporting Services is installed in SharePoint-Integrated mode, enabling users to create data visualizations from PowerPivot workbooks and Analysis Services data models in a web browser.
Usage notes: After the results of a big data processing job have been imported into a worksheet or data model in Excel you can use Power View to explore the data visually. With Power View you can create a set of related interactive data visualizations, including column and bar charts, pie charts, line charts, and maps.
For more info, see Power View.
A data modeling add-in for Excel that can be used to define tabular data models for “slice and dice” analysis and visualization in Excel. You can use PowerPivot to combine data from multiple sources into a tabular data model that defines relationships between data tables, hierarchies for drill-up/down aggregation, and calculated fields and measures.
Usage notes: In a big data scenario you can use PowerPivot to import a result set generated by HDInsight as a table into a data model, and then combine that table with data from other sources to create a model for mash-up analysis and reporting.
For more info, see PowerPivot.
A powerful scripting language and environment designed to manage infrastructure and perform a wide range of operations. Can be used to implement almost any manual or automated scenario. A good choice for automating big data processing when there is no requirement to build a custom user interface or integrate with an existing application. Additional packages of cmdlets and functionality are available for Azure and HDInsight. The PowerShell interactive scripting environment (ISE) also provides a useful client environment for testing and exploring.
When working with HDInsight it can be used to perform a wide range of tasks including:
- Provisioning and decommissioning HDInsight clusters and Azure storage.
- Uploading data and code files to Azure storage.
- Submitting map/reduce, Pig, Hive, Sqoop, and Oozie jobs.
- Downloading and displaying job results.
- It supports SSL and includes commands for logging and monitoring actions.
- Installed with Windows, though not all early versions offer optimal performance and range of operations.
- For optimum performance and capabilities all systems must be running the latest version.
- Very well formed and powerful language, but has a reasonably high learning curve for new adopters.
- You can schedule PowerShell scripts to run automatically, or initiate them on-demand.
For more info, see PowerShell.
Automates repetitive tasks, such as deploying applications and managing infrastructure, both on-premises and in the cloud.
Usage notes: The Enterprise version can automate tasks at any stage of the IT infrastructure lifecycle, including: discovery, provisioning, OS and application configuration management, orchestration, and reporting.
For more info, see Puppet.
Reactive Extensions (Rx)
A library that can be used to compose asynchronous and event-based programs using observable collections and LINQ-style query operators. Can be used to create stream-processing solutions for capturing, storing, processing, and uploading data. Supports multiple asynchronous data streams from different sources.
- Download as a NuGet package.
- SSL support can be added using code.
- Can be used to address very complex streaming and processing scenarios, but all parts of the solution must be created using code.
- Requires a high level of knowledge and coding experience, although plenty of documentation and samples are available.
For more info, see Reactive Extensions (Rx).
Remote Desktop Connection
Allows you to remotely connect to the head node of the HDInsight cluster and gain access to the configuration and command line tools for the underlying HDP as well as the YARN and NameNode status portals.
- You must specify a validity period after which the connection is automatically disabled.
- Not recommended for use in production applications but is useful for experimentation and one-off jobs, and for accessing Hadoop files and configuration on the cluster.
For more info, see Remote Desktop Connection.
Provide access to Hadoop services and Azure services.
Hadoop REST APIs include WebHCat (HCatalog), WebHDFS, and Ambari.
Azure REST APIs include storage management and file access, and SQL Database management features.
Requires a client tool that can create REST calls, or use of a custom application (typically using the HDInsight SDKs). REST-capable clients include:
- Simple Microsoft Azure REST API Sample Tool.
- Fiddler add-in for the Internet Explorer browser.
- Postman add-in for the Google Chrome browser.
- RESTClient add-in for the Mozilla Firefox browser.
- Utilities such cURL, which is available for a wide range of platforms including Windows.
A distributed stream processing framework that uses Kafka for messaging and Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.
- Provides a very simple callback-based API comparable to map/reduce for processing messages in the stream.
- Pluggable architecture allows use with many other messaging systems and environments.
- Some fault-tolerance features are still under development.
For more info, see Samza.
A system that uses managers and agents to automate media technology and file-based transfers and workflows. Can be integrated with existing IT infrastructure to enable highly efficient file-based workflows.
- Subscription-based software for accelerated movement of large files between users.
- Signiant Managers+Agents is a system-to-system solution that handles the administration, control, management and execution of all system activity, including workflow modeling, from a single platform.
For more info, see Signiant.
A highly reliable, scalable, and fault tolerant enterprise search platform from the Apache Lucene project that provides powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (such as Word and PDF) handling, and geospatial search.
- Includes distributed indexing, load-balanced querying, replication, and automated failover and recovery.
- REST-like HTTP/XML and JSON APIs make it easy to use from virtually any programming language.
- Wide ranging customization is possible using external configuration and an extensive plugin architecture.
For more info, see Solr.
SQL Server Analysis Services (SSAS)
A component of SQL Server that enables enterprise-level data modeling to support BI. SSAS can be deployed in multidimensional or tabular mode; and in either mode can be used to define a dimensional model of the business to support reporting, interactive analysis, and key performance indicator (KPI) visualization through dashboards and scorecards.
Usage notes: SSAS is commonly used in enterprise BI solutions where large volumes of data in a data warehouse are pre-aggregated in a data model to support BI applications and reports. As organizations start to integrate the results of big data processing into their enterprise BI ecosystem, SSAS provides a way to combine traditional BI data from an enterprise data warehouse with new dimensions and measures that are based on the results generated by HDInsight data processing jobs.
For more info, see SQL Server Analysis Services (SSAS).
SQL Server Database Engine
The core component of SQL Server that provides an enterprise-scale database engine to support online transaction processing (OLTP) and data warehouse workloads. You can install SQL Server on an on-premises server (physical or virtual) or in a virtual machine in Azure.
Usage notes: A common work pattern in big data analysis is to provision the HDInsight cluster when it is required, and decommission it after data processing is complete. If you want the results of the big data processing to remain available in relational format for client applications to consume, you must transfer the output generated by HDInsight into a relational database. The SQL Server database engine is a good choice for this when you want to have full control over server and database engine configuration, or when you want to combine the big data processing results with data that is already stored in a SQL Server database.
For more info, see SQL Server Database Engine.
SQL Server Data Quality Services (DQS)
A SQL Server instance feature that consists of knowledge base databases containing rules for data domain cleansing and matching, and a client tool that enables you to build a knowledge base and use it to perform a variety of critical data quality tasks, including correction, enrichment, standardization, and de-duplication of your data.
- The DQS Cleansing component can be used to cleanse data as it passes through a SQL Server Integration Services (SSIS) data flow. A similar DQS Matching component is available on CodePlex to support data deduplication in a data flow.
- Master Data Services (MDS) can make use of a DQS knowledge base to find duplicate business entity records that have been imported into an MDS model.
- DQS can use cloud-based reference data services provided by reference data providers to cleanse data, for example by verifying parts of mailing addresses.
For more info, see SQL Server Data Quality Services (DQS).
SQL Server Integration Services (SSIS)
A component of SQL Server that can be used to coordinate workflows that consist of automated tasks. SSIS workflows are defined in packages, which can be deployed and managed in an SSIS Catalog on an instance of the SQL Server database engine.
SSIS packages can encapsulate complex workflows that consist of multiple tasks and conditional branching. In particular, SSIS packages can include data flow tasks that perform full ETL processes to transfer data from one data store to another while applying transformations and data cleaning logic during the workflow.
- Although SSIS is often primarily used as a platform for implementing data transfer solutions, in a big data scenario is can also be used to coordinate the various disparate tasks required to ingest, process, and consume data using HDInsight.
- SSIS packages are created using the SQL Server Data Tools for Business Intelligence (SSDT-BI) add-in for Visual Studio, which provides a graphical package design interface.
- Completed packages can be deployed to an SSIS Catalog in SQL Server 2012 or later instances, or they can be deployed as files.
- Package execution can be automated using SQL Server Agent jobs, or you can run them from the command line using the DTExec.exe utility.
- To use SSIS in a big data solution, you require at least one instance of SQL Server.
For more info, see SQL Server Integration Services (SSIS).
SQL Server Reporting Services (SSRS)
A component of SQL Server that provides a platform for creating, publishing, and distributing reports. SQL Server can be deployed in native mode where reports are viewed and managed in a Report Manager website, or in SharePoint-Integrated mode where case reports are viewed and managed in a SharePoint Server document library.
Usage notes: When big data analysis is incorporated into enterprise business operations, it is common to include the results in formal reports. Report developers can create reports that consume big data processing results directly from Hive tables (via the Hive ODBC Driver) or from intermediary data models or databases, and publish those reports to a report server for on-demand viewing or automated distribution via email subscriptions.
For more info, see SQL Server Reporting Services (SSRS).
An easy to use tool for tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases. It automates most of this process, relying on the database to describe the schema for the data to be imported, and uses map/reduce to import and export the data in order to provides parallel operation and fault tolerance.
- Simple to use and supports automation as part of a solution.
- Can be included in an Oozie workflow.
- Uses a pluggable architecture with drivers and connectors.
- Can support SSL by using Oracle Wallet.
- Transfers data to HDFS or Hive by using HCatalog.
For more info, see Sqoop.
A distributed real-time computation system that provides a set of general primitives. It is simple, and can be used with any programming language. Supports a high-level language called Trident that provides exactly-once processing, transactional data store persistence, and a set of common stream analytics operations.
- Supports SSL for data transfer.
- Tools are available to create and manage the processing topology and configuration.
- Topology and parallelism must be manually fine-tuned, requiring some expertise.
- Supports Logging that can be viewed through the Storm Web UI and a reliability API that allows custom tools and third party services to provide performance monitoring. Some third party tools support full real-time monitoring.
For more info, see Storm.
A component of SQL Server that can be used to perform real-time analytics on streaming and other types of data. Supports using the Observable/Observer pattern and an Input/Output adaptor model with LINQ processing capabilities and an administrative GUI. Could be used to capture events to a local file for batch upload to the cluster, or write the event data directly to the cluster storage. Code could append events to an existing file, create a new file for each event, or create a new file based on temporal windows in the event stream.
- Can be deployed on-premises and in an Azure Virtual Machine.
- Events are implemented as classes or structs, and the properties defined for the event class provide the data values for visualization and analysis.
- Logging can be done in code to any suitable sink.
- Monitoring information is by using the diagnostic views API which requires the Management Web Service to be enabled and connected.
- Provided a complex event processing (CEP) solution out of the box, including debugging tools.
For more info, see StreamInsight.
System Center management pack for HDInsight
Simplifies the monitoring process for HDInsight by providing capabilities to discover, monitor, and manage HDInsight clusters deployed on an Analytics Platform System (APS) Appliance or Azure. Provides views for proactive monitoring alerts, health and performance dashboards, and performance metrics for Hadoop at the cluster and node level.
- Enables near real-time diagnosis and resolution of issues detected in HDInsight.
- Includes a custom diagram view that has detailed knowledge about cluster structure and the health states of host components and cluster services.
- Requires the 2012 or 2012 SP1 version of System Center.
- Provides context sensitive tasks to stop or start host component, cluster service or all cluster services at once.
For more info, see System Center management pack for HDInsight.
Visual Studio Server Explorer
A feature available in all except the Express versions of Visual Studio. Provides a GUI-based explorer for Azure features, including storage, with facilities to upload, view, and download files.
- Simple and convenient to use.
- No cost solution for existing Visual Studio users.
- Also provides access and management features for SQL Database, useful when using a custom metastore with an HDInsight cluster.
For more info, see Server Explorer.
For more details about pre-processing and loading data, and the considerations you should be aware of, see the section Collecting and loading data into HDInsight of this guide.
For more details about processing the data using queries and transformations, and the considerations you should be aware of, see the section Processing, querying, and transforming data using HDInsight of this guide.
For more details about consuming and visualizing the results, and the considerations you should be aware of, see the section Consuming and visualizing data from HDInsight of this guide.
For more details about automating and managing solutions, and the considerations you should be aware of, see the section Building end-to-end solutions using HDInsight of this guide.