Work in the Apache Hadoop ecosystem on HDInsight from a Windows PC
Learn about development and management options on the Windows PC for working in the Apache Hadoop ecosystem on HDInsight.
HDInsight is based on Apache Hadoop and Hadoop components, open-source technologies developed on Linux. HDInsight version 3.4 and higher uses the Ubuntu Linux distribution as the underlying OS for the cluster. However, you can work with HDInsight from a Windows client or Windows development environment.
Use PowerShell for deployment and management tasks
Azure PowerShell is a scripting environment that you can use to control and automate deployment and management tasks in HDInsight from Windows.
Examples of tasks you can do with PowerShell:
- Create clusters using PowerShell.
- Run Apache Hive queries using PowerShell.
- Manage clusters with PowerShell.
Follow steps to install and configure Azure Powershell to get the latest version.
Utilities you can run in a browser
The following utilities have a web UI that runs in a browser:
- Azure Cloud Shell is an interactive, command-line shell that runs in your browser and from within the Azure portal.
- Apache Ambari Web UI is a management and monitoring utility available in the Azure portal that can be used to manage different kinds of jobs, such as:
Data Lake (Hadoop) Tools for Visual Studio
Use Data Lake Tools for Visual Studio to deploy and manage Storm topologies. Data Lake Tools also installs the SCP.NET SDK, which allows you to develop C# Storm topologies with Visual Studio.
Before you go to the following examples, install and try Data Lake Tools for Visual Studio.
Examples of tasks you can do with Visual Studio and Data Lake Tools for Visual Studio:
- Deploy and manage Storm topologies from Visual Studio
- Develop C# topologies for Storm using Visual Studio. The bits include example templates for Storm topologies you can connect to databases, such as Azure Cosmos DB and SQL Database.
Visual Studio and the .NET SDK
You can use Visual Studio with the .NET SDK to manage clusters and develop big data applications. You can use other IDEs for the following tasks, but examples are shown in Visual Studio.
Examples of tasks you can do with the .NET SDK in Visual Studio:
- Create clusters and work in HDInsight from a .NET Framework application.
- Run Apache Hive queries using the .NET SDK.
- Use C# user-defined functions with Apache Hive and Apache Pig streaming on Apache Hadoop.
Intellij IDEA and Eclipse IDE for Spark clusters
- Develop and submit a Scala Spark application on an HDInsight Spark cluster.
- Access Spark cluster resources.
- Develop and run a Scala Spark application locally.
These articles show how:
- Intellij IDEA: Create Apache Spark applications using the Azure Toolkit for Intellij plug-in and the Scala SDK.
- Eclipse IDE or Scala IDE for Eclipse: Create Apache Spark applications and the Azure Toolkit for Eclipse
Notebooks on Spark for data scientists
Apache Spark clusters in HDInsight include Apache Zeppelin notebooks and kernels that can be used with Jupyter notebooks.
- Learn how to use kernels on Apache Spark clusters with Jupyter notebooks to test Spark applications
- Learn how to use Apache Zeppelin notebooks on Apache Spark clusters to run Spark jobs
Run Linux-based tools and technologies on Windows
If you encounter a situation where you must use a tool or technology that is only available on Linux, consider the following options:
- Bash on Ubuntu on Windows 10 provides a Linux subsystem on Windows. Bash allows you to directly run Linux utilities without having to maintain a dedicated Linux installation. See Windows Subsystem for Linux Installation Guide for Windows 10 for installation steps. Other Unix shells will work as well.
- Docker for Windows provides access to many Linux-based tools, and can be run directly from Windows. For example, you can use Docker to run the Beeline client for Hive directly from Windows. You can also use Docker to run a local Jupyter notebook and remotely connect to Spark on HDInsight. Get started with Docker for Windows
- MobaXTerm allows you to graphically browse the cluster file system over an SSH connection.
The Azure command-line interface (CLI) is Microsoft's cross-platform command-line experience for managing Azure resources. For more information, see Azure Command-Line Interface (CLI).
If you're new to working in Linux-based clusters, see the follow articles: