Support Questions
Find answers, ask questions, and share your expertise
Announcements
Check out our newest addition to the community, the Cloudera Innovation Accelerator group hub.

best pracice to setup hadoop cluster

New Contributor

I am planning to setup a 7 node Cluster on Azure VM's. as of now we are using 4 node setup we want to update it. I am planning to have 2 Master Node and 5 Slave Nodes. I wanted to setup below mentioned services on that cluster.

1) Namenode

2) Oozie

3) DataNode

4) Yarn

5) Spark

6) Hbase

7) Zookeeper

8)storm

9)kafka

10)hdfs

11)yarn

12)hive

I am actually looking out for any guidelines on Memory, Cores and Storage to be required for different services of hadoop as mentioned above. I need to buy 7 VM's on Azure but i want to understand from Infrastructure perspective that how much memory, cores, Storage would be optimal for above mentioned hadoop services(service wise) ,keeping in mind more services can also be added in future.

2 REPLIES 2

Expert Contributor

@satish pujari

Specific HDF-related service recommendations can be found at the following link, this article has been extremely useful for me and hopefully will be for you as well:

https://docs.hortonworks.com/HDPDocuments/HDF3/HDF-3.1.1/bk_planning-your-deployment/content/ch_hard...

Here's a good general read for HDP:

https://community.hortonworks.com/articles/16763/cheat-sheet-and-tips-for-a-custom-install-of-horto....

Regarding some of the specific components you've called out:

NameNode - HDFS master service, needs to be on a node with enough cores (probably 16 in your case but can get by with 8). RAM requirements are at least 32GB but preferably 64. You can probably stay with 32 given a 7 node cluster, but the Java heap will grow as your NameNode keeps track of a larger number of files distributed across the data nodes. For disk - separate OS disk from data disks and follow the HDP guide above.

Data node - multiple disks for parallel I/O and enough cores for parallel block processing are the main requirements

Hive/YARN/Spark - Both Hive and Spark have computationally intensive workloads. Higher cores (at least 16) and higher RAM (at least 64 but recommend 128GB) are important here. YARN will be co-located with the data node so you will have lots of disk space on these nodes.

HBase - enough RAM to maintain temp space for CRUD operations (memstore) as well as cache results to serve, RAM scales up as number of regions on the node scales up. Similar recommendations to the Hive workloads, you need more cores for more parallel processing and enough disk to store regions. Should not exceed 200 regions at 10GB/region, so don't need more than ~3TB spread across multiple disks.

General recommendations:

Looking at the number of nodes you've got versus the number of services planned to be used, I'd recommend at least a 12 node cluster (if not 16, which is preferable) to create more compute capacity or reducing the number of workloads to start with.

Hope that helps.

Cloudera Employee

Much will depend on your workloads. Currently if you are running with 4 nodes, these are lighter workloads, so you dont necessarily have do deal with a lot of concurrent analytic queries. For analytic workloads in production (Hive), it is not uncommon to see 256GB memory per node, with 36+ cores. For heavy production spark, which is typically CPU/memory bound, you might see 44+ cores per node, with 256, or 512 memory. If you are running light occasional jobs in batch mode, fully loaded instances will be less important.