Community Articles
Find and share helpful community-sourced technical articles
Expert Contributor

Here are some key things that will help an HDInsight cluster manageable and perform better. The following best practices items should be noted.

  • Do not use only one storage account for a given HDInsight cluster. For a 48 node cluster, Microsoft is recommending 4-8 storage accounts. Not because of the storage space but what each storage account provides additional networking bandwidth that opens up the pipe as wide s possible for the compute nodes to finish their jobs faster.
  • Make the naming convention of the storage account as random as possible, no prefix. This is to reduce the chances that you hit storage bottlenecks or common mode failures in storage across all storage accounts at the same time. This type of storage partitioning in WASB is meant to avoid storage throttling.
  • Use D13 for head nodes, D12 for worker nodes.
  • When containers are created, make sure to only have one container per storage account. This yields better performance.
  • The Hive metastore that comes by default when HDInsight is deployed is transient. When the cluster is deleted, Hive metastore gets deleted as well. Use Azure DB to store the Hive metastore so that it persists even when the cluster is blown away. Azure DB is basically SQL Server under the hood. Unless the cluster created is brand new every time and won't create the same tables, then Azure DB is not needed.
  • When scaling down the cluster, some services stop and has to be started manually. Scaling should be done when there are no jobs running as much as possible.
  • HDFS namespace recognizes both local storage and WASB storage. It is recommended not to change the Data Node directory in HDFS configuration (that points to the local SSD storage).
  • NameNodes are not exposed from HDInsight so can't use distcp to transfer data from a remote cluster to HDInsight. Use WASB driver as much as possible to transfer data from on-premise cluster to HDInsight cluster since it yields better performance.

One thing to note is that only Hadoop services can be stopped. VMs are not exposed and cannot be paused. If the goal is to reduce cost of a running environment, it's better to delete the cluster and recreate them when needed.

New Contributor

Use an HDInsight on Linux cluster to have control over VMs through Ambari. Take a look at the public preview of Azure Data Lake Store which gets you past the storage account throttling and total size limits. Because you have separation of storage and compute you can move and load data with tools outside the cluster (SSIS, AzCopy, ADF, etc.), even when the cluster doesn't currently exist. Multiple clusters of HDInsight plus Azure Data Lake Analytics can all access the same data at the same time.

Thank you for your Micorsoft contributions on HCC @Cindy Gross