Support Questions
Find answers, ask questions, and share your expertise

Need help with Cloudera Cluster sizing


Hi Everyone,

We have a requirement to migrate data from ODS (plus some social media, web analytics etc) into Hadoop for which we need to create a cluster. Please find below the details:

  1. It will be Cloudera Enterprise edition and deployed on Azure
  2. Initial expected Data volume is 7.5 TB (includes replication factor of 3 & overhead of 20%)
  3. Incremental load is expected to be 1 GB/day
  4. Thinking to have Sqoop, Hive, Oozie, Flume, Spark, Kafka, HBase as well.
  5. Initial workload will be mainly around Data Import and ETL(Spark).
  6. Further, there could be some Analytics use case involving Classification, Recommendation Algos etc
  7. I have come up with following sizing(for production env).
    NN- Name Node, JN- Journal Node, RM- Resource Manager, ZK - Zookeeper, CM- Cloudera Manager

Node Type                                    Disk in TB's (7200 RPM)                                                 RAM    Cores
NN + JN + RM +ZK                      1(OS) + 2(FSImage & Edit logs) + 1(JN) + 1(ZK)              32       14
StandBy NN + JN                         Same as NN                                                                      32        14
Edge + CM                                   1                                                                                         14         4
Cloudera Director node                1                                                                                         14         4
Data Nodes (4*3TB)
(3 disks of 1 TB per node)           4*3                                                                                       32         8
(Also one of DN will be JN as well)


1. Can anyone please confirm if I need to change anything ?
2. Is it mandatory to have separate RM node in prod? If yes, what should be its conf?

3. Can I have Director on Edge along with Cloudera Manager?
4. Also, please suggest what should I change(scale down)  to set up a Dev env as well ?




New Contributor

My initial thought is that is a lot of services running without much RAM on each box.


You mentioned you would be using HBASE and 7.5 TB of data.  Are you planning on having all that data stored in Hbase?


Thanks Aver for replying. Well, Initially the data would be dumped into HDFS and post processing into HBase(which i am assuming to be less than 2 TB)

; ;