Here are some key things that will help an HDInsight cluster manageable and perform better. The following best practices items should be noted.
Do not use only one storage account for a given HDInsight cluster. For a 48 node cluster, Microsoft is recommending 4-8 storage accounts. Not because of the storage space but what each storage account provides additional networking bandwidth that opens up the pipe as wide s possible for the compute nodes to finish their jobs faster.
Make the naming convention of the storage account as random as possible, no prefix.
This is to reduce the chances that you hit storage bottlenecks or common mode failures in storage across all storage accounts at the same time. This type of storage partitioning in WASB is meant to avoid storage throttling.
Use D13 for head nodes, D12 for worker nodes.
When containers are created, make sure to only have one container per storage account. This yields better performance.
The Hive metastore that comes by default when HDInsight is deployed is transient. When the cluster is deleted, Hive metastore gets deleted as well. Use Azure DB to store the Hive metastore so that it persists even when the cluster is blown away. Azure DB is basically SQL Server under the hood. Unless the cluster created is brand new every time and won't create the same tables, then Azure DB is not needed.
When scaling down the cluster, some services stop and has to be started manually. Scaling should be done when there are no jobs running as much as possible.
HDFS namespace recognizes both local storage and WASB storage. It is recommended not to change the Data Node directory in HDFS configuration (that points to the local SSD storage).
NameNodes are not exposed from HDInsight so can't use distcp to transfer data from a remote cluster to HDInsight. Use WASB driver as much as possible to transfer data from on-premise cluster to HDInsight cluster since it yields better performance.
One thing to note is that only Hadoop services can be stopped. VMs are not exposed and cannot be paused. If the goal is to reduce cost of a running environment, it's better to delete the cluster and recreate them when needed.