Q: What are the various types of nodes in an HDInsight cluster?

See Resource types in Azure HDInsight clusters .

Q: How do I change timezone in Ambari?

Open the Ambari Web UI at https://CLUSTERNAME.azurehdinsight.net , where CLUSTERNAME is the name of your cluster. In the upper-right corner, select admin | Settings. In the User Settings window, select the new timezone from the Timezone drop down, and then click Save.

Question 1

How do I provision an HDInsight cluster?

Accepted Answer

To review the HDInsight clusters types, and the provisioning methods, see Set up clusters in HDInsight with Apache Hadoop, Apache Spark, Apache Kafka, and more.

Question 2

How do I delete an existing HDInsight cluster?

Accepted Answer

To learn more about deleting a cluster when it's no longer in use, see Delete an HDInsight cluster.

Try to leave at least 30 to 60 minutes between create and delete operations. Otherwise the operation may fail with the following error message:

Conflict (HTTP Status Code: 409) error when attempting to delete a cluster immediately after creation of a cluster. If you encounter this error, wait until the newly created cluster is in operational state before attempting to delete it.

Question 3

How do I select the correct number of cores or nodes for my workload?

Accepted Answer

The appropriate number of cores and other configuration options depend on various factors.

For more information, see Capacity planning for HDInsight clusters.

Question 4

What are the various types of nodes in an HDInsight cluster?

Accepted Answer

See Resource types in Azure HDInsight clusters.

Question 5

What are the best practices for creating large HDInsight clusters?

Accepted Answer

Recommend setting up HDInsight clusters with a Custom Ambari DB to improve the cluster scalability.
Use Azure Data Lake Storage Gen2 to create HDInsight clusters to take advantage of higher bandwidth and other performance characteristics of Azure Data Lake Storage Gen2.
Headnodes should be sufficiently large to accommodate multiple master services running on these nodes.
Some specific workloads such as Interactive Query will also need larger Zookeeper nodes. Please consider minimum of eight core VMs.
In the case of Hive and Spark, use External Hive metastore.

Question 6

Can I install additional components on my cluster?

Accepted Answer

Yes. To install additional components or customize cluster configuration, use:

Scripts during or after creation. Scripts are invoked via script action. Script action is a configuration option you can use from the Azure portal, HDInsight Windows PowerShell cmdlets, or the HDInsight .NET SDK. This configuration option can be used from the Azure portal, HDInsight Windows PowerShell cmdlets, or the HDInsight .NET SDK.
HDInsight Application Platform to install applications.

For a list of supported components see What are the Apache Hadoop components and versions available with HDInsight?

Question 7

Can I upgrade the individual components that are pre-installed on the cluster?

Accepted Answer

If you upgrade built-in components or applications that are pre-installed on your cluster, the resulting configuration won't be supported by Microsoft. These system configurations have not been tested by Microsoft. Try to use a different version of the HDInsight cluster that may already have the upgraded version of the component pre-installed.

For example, upgrading Hive as an individual component isn't supported. HDInsight is a managed service, and many services are integrated with Ambari server and tested. Upgrading a Hive on its own causes the indexed binaries of other components to change, and will cause component integration issues on your cluster.

Question 8

Can Spark and Kafka run on the same HDInsight cluster?

Accepted Answer

No, it's not possible to run Apache Kafka and Apache Spark on the same HDInsight cluster. Create separate clusters for Kafka and Spark to avoid resource contention issues.

Question 9

How do I change timezone in Ambari?

Accepted Answer

Open the Ambari Web UI at https://CLUSTERNAME.azurehdinsight.net, where CLUSTERNAME is the name of your cluster.
In the upper-right corner, select admin | Settings.
In the User Settings window, select the new timezone from the Timezone drop down, and then click Save.

Question 10

How can I migrate from the existing metastore to Azure SQL Database?

Accepted Answer

To migrate from SQL Server to Azure SQL Database, see Tutorial: Migrate SQL Server to a single database or pooled database in Azure SQL Database offline using DMS.

Question 11

Is the Hive metastore deleted when the cluster is deleted?

Accepted Answer

It depends on the type of metastore that your cluster is configured to use.

For a default metastore: The default metastore is part of the cluster lifecycle. When you delete a cluster, the corresponding metastore and metadata are also deleted.

For a custom metastore: The lifecycle of the metastore isn't tied to a cluster's lifecycle. So, you can create and delete clusters without losing metadata. Metadata such as your Hive schemas persists even after you delete and re-create the HDInsight cluster.

For more information, see Use external metadata stores in Azure HDInsight.

Question 12

Does migrating a Hive metastore also migrate the default policies of the Ranger database?

Accepted Answer

No, the policy definition is in the Ranger database, so migrating the Ranger database will migrate its policy.

Question 13

Can you migrate a Hive metastore from an Enterprise Security Package (ESP) cluster to a non-ESP cluster, and the other way around?

Accepted Answer

Yes, you can migrate a Hive metastore from an ESP to a non-ESP cluster.

Question 14

How can I estimate the size of a Hive metastore database?

Accepted Answer

A Hive metastore is used to store the metadata for data sources that are used by the Hive server. The size requirements depend partly on the number and complexity of your Hive data sources. These items can't be estimated up front. As outlined in Hive metastore guidelines, you can start with a S2 tier. The tier provides 50 DTU and 250 GB of storage, and if you see a bottleneck, scale up the database.

Question 15

Do you support any other database other than Azure SQL Database as an external metastore?

Accepted Answer

No, Microsoft supports only Azure SQL Database as an external custom metastore.

Question 16

Can I share a metastore across multiple clusters?

Accepted Answer

Yes, you can share custom metastore across multiple clusters as long as they're using the same version of HDInsight.

Question 17

What are the implications of blocking ports 22 and 23 on my network?

Accepted Answer

If you block ports 22 and port 23, you won't have SSH access to the cluster. These ports aren't used by HDInsight service.

For more information, see the following documents:

Question 18

Can I deploy an additional virtual machine within the same subnet as an HDInsight cluster?

Accepted Answer

Yes, you can deploy an additional virtual machine within the same subnet as an HDInsight cluster. The following configurations are possible:

Edge nodes: You can add another edge node to the cluster, as described in Use empty edge nodes on Apache Hadoop clusters in HDInsight.
Standalone nodes: You can add a standalone virtual machine to the same subnet and access the cluster from that virtual machine by using the private end point https://-int.azurehdinsight.net. For more information, see Control network traffic.

Question 19

Should I store data on the local disk of an edge node?

Accepted Answer

No, storing data on a local disk isn't a good idea. If the node fails, all data stored locally will be lost. We recommend storing data in Azure Data Lake Storage Gen2 or Azure Blob storage, or by mounting an Azure Files share for storing the data.

Question 20

Can I add an existing HDInsight cluster to another virtual network?

Accepted Answer

No, you can't. The virtual network should be specified at the time of provisioning. If no virtual network is specified during provisioning, the deployment creates an internal network that isn't accessible from outside. For more information, see Add HDInsight to an existing virtual network.

Question 21

What are the recommendations for malware protection on Azure HDInsight clusters?

Accepted Answer

For information on malware protection, see Microsoft Antimalware for Azure Cloud Services and Virtual Machines.

Question 22

How do I create a keytab for an HDInsight ESP cluster?

Accepted Answer

Create a Kerberos keytab for your domain username. You can later use this keytab to authenticate to remote domain-joined clusters without entering a password. The domain name is uppercase:


ktutil
ktutil: addent -password -p @ -k 1 -e aes256-cts-hmac-sha1-96
Password for @: 
ktutil: wkt .keytab
ktutil: q

Question 23

When is salting required for AES256 encryption when creating the keytab?

Accepted Answer

If your TenantName & DomainName are different (example TenantName – bob@CONTOSO.ONMICROSOFT.COM & DomainName – bob@CONTOSOMicrosoft.ONMICROSOFT.COM), you need to add a SALT value using the -s option.

Question 24

How do I determine the proper SALT value?

Accepted Answer

Use an interactive Kerberos login to determine the proper salt value for the keytab. Interactive Kerberos login will use the highest encryption by default. Tracing should be enabled to observe the salt. Below is a sample Kerberos login:


$ KRB5_TRAACE=/dev/stdout kinit  -V

Look through the output for the salt "......." line.
Use this salt value when creating the keytab.


ktutil
ktutil: addent -password -p @ -k 1 -e aes256-cts-hmac-sha1-96 -s 
Password for @: 
ktutil: wkt .keytab
ktutil: q

Question 25

Can I use an existing Microsoft Entra tenant to create an HDInsight cluster that has the ESP?

Accepted Answer

Enable Microsoft Entra Domain Services before you can create an HDInsight cluster with ESP. Open-source Hadoop relies on Kerberos for Authentication (as opposed to OAuth).

To join VMs to a domain, you must have a domain controller. Microsoft Entra Domain Services is the managed domain controller, and is considered an extension of Microsoft Entra ID. Microsoft Entra Domain Services provides all the Kerberos requirements to build a secure Hadoop cluster in a managed way. HDInsight as a managed service integrates with Microsoft Entra Domain Services to provide security.

Question 26

Can I use a self-signed certificate in a Microsoft Entra Domain Services secure LDAP setup and provision an ESP cluster?

Accepted Answer

Using a certificate issued by a certificate authority is recommended. But using a self-signed certificate is also supported on ESP. For more information, see:

Question 27

Can I install Data Analytics Studio (DAS) as an ESP cluster?

Accepted Answer

No, DAS is not supported on ESP clusters.

Question 28

How can I pull login activity shown in Ranger?

Accepted Answer

For auditing requirements, Microsoft recommends enabling Azure Monitor logs as described in Use Azure Monitor logs to monitor HDInsight clusters.

Question 29

Can I disable `Clamscan` on my cluster?

Accepted Answer

Clamscan is the antivirus software that runs on the HDInsight cluster and is used by Azure security (azsecd) to protect your clusters from virus attacks. Microsoft strongly recommends that users refrain from making any changes to the default Clamscan configuration.

This process doesn't interfere with or take any cycles away from other processes. It will always yield to other process. CPU spikes from Clamscan should be seen only when the system is idle.

In scenarios in which you must control the schedule, you can use the following steps:

Disable automatic execution using the following command:

sudo usr/local/bin/azsecd config -s clamav -d Disabled sudo service azsecd restart
Add a Cron job that runs the following command as root:

/usr/local/bin/azsecd manual -s clamav

For more information about how to set up and run a cron job, see How do I set up a Cron job?

Question 30

Why is LLAP available on Spark ESP clusters?

Accepted Answer

LLAP is enabled for security reasons (Apache Ranger), not performance. Use larger node VMs to accommodate for the resource usage of LLAP (for example, minimum D13V2).

Question 31

How can I add additional Microsoft Entra groups after creating an ESP cluster?

Accepted Answer

There are two ways to achieve this goal: 1- You can recreate the cluster and add the additional group at the time of cluster creation. If you're using scoped synchronization in Microsoft Entra Domain Services, make sure group B is included in the scoped synchronization. 2- Add the group as a nested sub group of the previous group that was used to create the ESP cluster. For example, if you've created an ESP cluster with group A, you can later on add group B as a nested subgroup of A and after approximately one hour it will be synced and available in the cluster automatically.

Question 32

Can I add an Azure Data Lake Storage Gen2 to an existing HDInsight cluster as an additional storage account?

Accepted Answer

No, it's currently not possible to add an Azure Data Lake Storage Gen2 storage account to a cluster that has blob storage as its primary storage. For more information, see Compare storage options.

Question 33

How can I find the currently linked Service Principal for a Data Lake storage account?

Accepted Answer

You can find your settings in Data Lake Storage Gen1 access under your cluster properties in the Azure portal. For more information, see Verify cluster setup.

Question 34

How can I calculate the usage of storage accounts and blob containers for my HDInsight clusters?

Accepted Answer

Do one of the following actions:

Use PowerShell
Find the size of the /user/hive/.Trash/ folder on the HDInsight cluster, using the following command line:

hdfs dfs -du -h /user/hive/.Trash/

Question 35

How can I set up auditing for my blob storage account?

Accepted Answer

To audit blob storage accounts, configure monitoring using the procedure at Monitor a storage account in the Azure portal. An HDFS-audit log provides only auditing information for the local HDFS filesystem only (hdfs://mycluster). It doesn't include operations that are done on remote storage.

Question 36

How can I transfer files between a blob container and an HDInsight head node?

Accepted Answer

Run a script similar to the following shell script on your head node:

for i in cat filenames.txt
do
   hadoop fs -get $i 
done

Note

The file filenames.txt will have the absolute path of the files in the blob containers.

Question 37

Are there any Ranger plugins for storage?

Accepted Answer

Currently, no Ranger plugin exists for blob storage and Azure Data Lake Storage Gen1 or Gen2. For ESP clusters, you should use Azure Data Lake Storage. You can at least set fine-grain permissions manually at the file system level using HDFS tools. Also, when using Azure Data Lake Storage, ESP clusters will do some of the file system access control using Microsoft Entra ID at the cluster level.

You can assign data access policies to your users' security groups by using the Azure Storage Explorer. For more information, see:

Question 38

Can I increase HDFS storage on a cluster without increasing the disk size of worker nodes?

Accepted Answer

No. You can't increase the disk size of any worker node. So the only way to increase disk size is to drop the cluster and recreate it with larger worker VMs. Don't use HDFS for storing any of your HDInsight data, because the data is deleted if you delete your cluster. Instead, store your data in Azure. Scaling up the cluster can also add additional capacity to your HDInsight cluster.

Question 39

Can I add an edge node after the cluster has been created?

Accepted Answer

See Use empty edge nodes on Apache Hadoop clusters in HDInsight.

Question 40

How can I connect to an edge node?

Accepted Answer

After you create an edge node, you can connect to it by using SSH on port 22. You can find the name of the edge node from the cluster portal. The names usually end with -ed.

Question 41

Why are persisted scripts not running automatically on newly created edge nodes?

Accepted Answer

You use persisted scripts to customize new worker nodes added to the cluster through scaling operations. Persisted scripts don't apply to edge nodes.

Question 42

What are the REST API calls to pull a Tez query view from the cluster?

Accepted Answer

You can use the following REST endpoints to pull the necessary information in JSON format. Use basic authentication headers to make the requests.

Tez Query View: https://.azurehdinsight.net/ws/v1/timeline/HIVE_QUERY_ID/
Tez Dag View: https://.azurehdinsight.net/ws/v1/timeline/TEZ_DAG_ID/

Question 43

How do I retrieve the configuration details from HDI cluster by using a Microsoft Entra user?

Accepted Answer

To negotiate proper authentication tokens with your Microsoft Entra user, go through the gateway by using the following format:

https://.azurehdinsight.net/api/v1/clusters/testclusterdem/stack_versions/1/repository_versions/1

Question 44

How do I use Ambari RESTful to monitor YARN performance?

Accepted Answer

If you call the Curl command in the same virtual network or a peered virtual network, the command is:


curl -u  -sS -G
http://:8080/api/v1/clusters//services/YARN/components/NODEMANAGER?fields=metrics/cpu

If you call the command from outside the virtual network or from a non-peered virtual network, the command format is:

For a non-ESP cluster:


curl -u  -sS -G 
https://.azurehdinsight.net/api/v1/clusters//services/YARN/components/NODEMANAGER?fields=metrics/cpu

For an ESP cluster:


curl -u -sS -G 
https://.azurehdinsight.net/api/v1/clusters//services/YARN/components/NODEMANAGER?fields=metrics/cpu

Note

Curl prompts you for a password. You must enter a valid password for the cluster login username.

Question 45

How much does it cost to deploy an HDInsight cluster?

Accepted Answer

For more information about pricing and FAQ related to billing, see the Azure HDInsight Pricing page.

Question 46

When does HDInsight billing start & stop?

Accepted Answer

HDInsight cluster billing starts once a cluster is created and stops when the cluster is deleted. Billing is pro-rated per minute.

Question 47

How do I cancel my subscription?

Accepted Answer

For information about how to cancel your subscription, see Cancel your Azure subscription.

Question 48

For pay-as-you-go subscriptions, what happens after I cancel my subscription?

Accepted Answer

For information about your subscription after it's canceled, see What happens after I cancel my subscription?

Question 49

Why does the Hive version appear as 1.2.1000 instead of 2.1 in the Ambari UI even though I'm running an HDInsight 3.6 cluster?

Accepted Answer

Although only 1.2 appears in the Ambari UI, HDInsight 3.6 contains both Hive 1.2 and Hive 2.1.

Question 50

What does HDInsight offer for real-time stream processing capabilities?

Accepted Answer

For information about integration capabilities of stream processing, see Choosing a stream processing technology in Azure.

Question 51

Is there a way to dynamically kill the head node of the cluster when the cluster is idle for a specific period?

Accepted Answer

You can't do this action with HDInsight clusters. You can use Azure Data Factory for these scenarios.

Question 52

What compliance offerings does HDInsight offer?

Accepted Answer

For compliance information, see the Microsoft Trust Center.

Azure HDInsight: Frequently asked questions

Creating or deleting HDInsight clusters