You need to size and price machine and storage separately.
Use Linux VMS on Azure (not to be confused with the Ubuntu Beta offering on HDInsight)
If performance is a must, especially with Kafka and Storm, use Premium storage not Standard. Make sure and request Premium Storage (see link below)
Do not use A8 machines. Use either A10 or A11’s. A8 is backed by Infiniband which is more expensive and unnecessary for Hadoop
Recommend D Series and the newer D_v2 Series for Solid State Drives if needed.
For Premium Storage use DS_v2 Series
It is recommended that Page Blob Storage is used for Hbase as opposed to Block Storage. See link below.
Both options will need attached Blob Storage. The 382 GB local disk that comes with the VM is just for temp storage.
Blob Storage, it comes in 1023GB sizes. Each VM has a maximum number
of Blob Storage that can be attached. Eg. A10 Vms can have a maximum of
16 * 1TB storage. See the following for more details:
Use Availability sets for master and worker nodes
Use one storage account for every node in the cluster in
order to bypass IOPS limits for multiple VMs on the same Storage
You can also try to use Azure Data Lake Store (with adl://) in
order to check the performance on the new Azure service.
You also need to remember the maintenance windows of every Azure
region according to your customers: some regions could be a good choice
for new service availability (e.g.: US East 2) but not from a
maintenance point of view (especially for European customers)
Recommendation 1 - Best Compute performance for Batch and Real Time Use Cases
For Head Master Nodes Use:
Standard_D13_v2 (8 CPU, 56GB) or
Standard_D5_v2 (16 CPU, 56 GB) OR
Standard_D14_v2 (16 CPU, 112 GB)
For Data Nodes Use:
Standard_D14_v2 (16 CPU, 112 GB) or
Standard_DS14_v2 (16 CPU, 112 GB with Premium Storage) or
Standard_DS15_v2 (20 CPU, 140 GB with Premium Storage)
If testing Kafka and Storm use Standard_DS13_v2, Standard_DS14_v2 or Standard_DS15_v2 with Premium Storage especially if performance is needed to meet SLAs
Pros: CPU is 35% than D Series; Local SSD Disks; VMs cheaper per hour that A or D series.
Recommendation 2 - Good Compute performance
Use Standard_D13 (8 CPU, 56GB) or Standard_D14 (16 CPU, 112 GB) for Head/ Master nodes and Standard_D14 (16 CPU, 112 GB) for Data Nodes
If testing Kafka and Storm use Standard_DS13 (8 CPU, 56GB) or Standard_DS14 (16 CPU, 112 GB) with Premium Storage especially if performance is needed to meet SLAs
Pros: 60% faster than A series; Local SSD Disks;
Why pick this if it is slightly more expensive per hour than D_v2 Series
Recommendation 3 - Mostly for Batch performance
Use A10 or A11 for Head/ Master nodes and A11 for Data Nodes
Microsoft is pricing effectively so that you use the D-v2 Series