Support Questions

Find answers, ask questions, and share your expertise

CPU Configuration (cores/speed) for Master and DataNodes

avatar
Contributor

Looking for Best Practices!

 

Having almost all CDH services (and CM) in the Master node and YARN's NodeManager, Spark's Workers, HDFS 's DataNodes, and HBase's RegionServers in the Data nodes, what type of CPU configuration should be suitable?

For instance, should I provision the Master host with 20 cores with 3.00GHz of speed (see Intel's Xeon CPU E5-2690 v2 - Ivy Bridge processor, 2 CPU's with 10 cores per socket and 3.00GHz of speed )?

Should I provision the Data hosts with 24 cores with 2.70GHz of speed (see Intel's Xeon CPU E5-2697 v2 - Ivy Bridge processor, 2 CPU's with 12 cores per socket, but with 2.70GHz of speed )?

 

Again, looking for the ultimate configuration and optimizing both cores and speed...

 

1 ACCEPTED SOLUTION

avatar
Super Collaborator

A1: check the Hdfs Design page for details on what is stored were. The edits log and file system image are on the NN. Look for the section on persistence on file system data. For more detail on setting up the cluster follow Cluster Setup.

 

A2: if you have the disk then having a mirrored disk will make it more resilient. Making a backup is still a good idea 😉

 

Wilfred

View solution in original post

7 REPLIES 7

avatar
Cloudera Employee

On the worker nodes, the number of cores determine the number of Yarn containers (MapReduce or Spark) that can run on that node. One could consider the amount of memory on the node and the number of disks to pick the number of cores. I haven't looked at the latest recommendations, but I believe 2 cores per disk is reasonable. Memory to cores ratio choice should depend on the workload itself - the average container size. 

Karthik Kambatla
Software Engineer, Cloudera Inc.

avatar
Contributor

That's a good start 🙂

 

For the argument sake, I am planning on provisioning 1 MASTER node w/ 2 CPUs (Intel E5-2690 v2 @3.00GHz, 10 cores each) and 256GB of RAM. Do I turn CPU multi-threading on? (Actually by default is on, which means I am getting 40 CPU threads).

I will configure 4x300GB disks:

2 disks for OS (RAID-1)

2 disks for apps & logs (RAID-1)

DO I NEED TO CONFIGURE ANY DISKS FOR HDFS IN MASTER?

---

For the DATA nodes (3 of them), planning to have the same cpu/ram setting as MASTER.

I will configure 25x300GB disks:

2 disks for OS (RAID-1)

2 disks for apps & logs (RAID-1)

21 disks for HDFS (JBODs)

===

 

Based on the above settings and the fact that CM and almost of CDH services will be running on MASTER and DataNodes, Spark-Workers, and RegionServers will be running on DATA nodes how do it look?

 

Do you have any links/docs to share about ratio of cores/to memory/to disks/to workload ...

Also, some useful documentation about configuring YARN's containers will be great!

 

Cheers!

 

avatar
Super Collaborator

You do not need to mirror the disks (beside OS) if you are running HDFS HA. On the master nodes: get one disk just for HDFS and you can store all logs on the other disk. One disk for HDFS will get you the best performance since writes are synchronous to that disk. Also make sure that the CM services store logs and DB's on the disk that does not have HDFS on it.

 

On the DATA nodes If you have 2 disks for OS (mirrored) and you thus have 300 GB available I would not use the other 300 GB for apps and logs. Add those 2 disks to your HDFS disks. The logs and apps can live on the OS disk on those nodes. If you are going to use Spark make sure that you use Spark on YARN. We recommend using that instead of using the stand alone mode saves resources and it has been tested far better.

 

We do have recommendations about vcores/mem/disks in our yarn tuning documentation

 

Wilfred

avatar
Contributor

Wilfred thank you!

 

Some clarifications.

 

MASTER Node Disk Layout (Total of 4x300GB HDs)

================

-- 2 disks for OS (RAID-1)

-- 1 disk for apps & logs (CM's logs etc...)

-- 1 disk (JBOD) for HDFS (what will be stored here?????)

 

DATA Nodes Disk Layout (Total of 25x300GB HDs)

===============

-- 2 disks for OS (RAID-1)

-- 23 disks (JBODs) for HDFS

( --1. Does it make a difference if # of disks is even or odd??)

( --2. Should I go for higher capacity of disks and less # of them, i.e. 6x1.2TB HDs ??)

 

DEFINITELY SPARK ON YARN!!!!

 

The link for YARN tuning configuration is great!!!

Please provide a link for tuning network traffic within the cluster (data movement among nodes in the cluster vs. data ingestion from sources).

 

avatar
Super Collaborator

On the master node HDFS will store things like the FSImage, edit file and other relevant files on the disk. Not huge but it needs quick access.

 

For the DN:

- Even or odd does not matter, it can handle what you give it.

- The number of spindles (disks) is important for the number of containers you can run on the host. We normally say about 2 containers per disk can be supported. Since you have a large number of cpu cores and a lot of memory having a larger number of disks will allow you to run more containers on the node. Decreasing the number of disks means you also should lower the number of containers. Looking at the cpu cores and disks: they seem to be nicely balanaced the way you have it now with the 300GB disks.

 

Wilfred

avatar
Contributor

Cool!

 

I'll do the same for SNN's HDFS disk.

<Q1> How does Hadoop know which HDFS folder/file to use? The one(s) in MASTER or the one(s) in DATA nodes??

         Is the HDFS parameter 'dfs.namenode.edits.dir' that will be set to the HDFS directory created in MASTER??

        (I guess based on RF Replication Factor files could be anywhere...)

        (Definitely will be faster for MASTER if it has to write to its own local disks...)

 

<Q2> Should I use RAID-1 for the 2nd 300GB disk (the one that will hold CM's logs) at MASTER?

          (I guess I should!)

 

 

avatar
Super Collaborator

A1: check the Hdfs Design page for details on what is stored were. The edits log and file system image are on the NN. Look for the section on persistence on file system data. For more detail on setting up the cluster follow Cluster Setup.

 

A2: if you have the disk then having a mirrored disk will make it more resilient. Making a backup is still a good idea 😉

 

Wilfred