Minimum number of nodes, and specs for a real cluster

I've been tasked with setting up a Hadoop cluster for testing a new big data initiative. However I'm pretty much completely new to all of this. I know that one can set up a single node cluster for proof of concept, but I would like to know what is the minimum number of nodes, and what spec (amount of RAM & disk space) for a proper cluster. Imagine a low throughput as it's only an initial test cluster (fewer than 10 users). And we only need Kafka, HDFS, Pig & Hive services to run.


We generally have the ability to spin up Centos 6 VM's with 4GB RAM each, and I might be able to up that to 8GB each. But Reading many of the setup pages, it's quoting minimums of 10s of GB of RAM (e.g. but the cloudera manager setup only asks for at least 4GB on that node ( and mentions nothing around the other node's specs.


I'm in a similar situation so I'm too interested in any feedback about Ed's question.


Our initial test server for Hadoop cluster is:


1 Namenode (64GB ram + 24 core) + 2 hdd 1 for os, 1 for hdfs storage.

3 Datanode (each 32GB ram + 16 core) + 2 hdd 1 for os, 1 for dfs storage.

   - the datanode is also used for: zookeeper, kafka, spark, YARN/mapreduce, Impala and Pig/Hive gateway.


As the best practice to run hadoop environment, all server should be a bare metal and not VM.


IMHO, maybe you could make the namenode server smaller like 32GB of ram with less core. But for the datanode sides, I don't recommend to have less specs than that, especially the minimum memory.

Our test cluster (on amazon):

- 5 workers m4.xlarge, 250 GB disk magnetic (we increased the disk to 1T afterwards)

           * we used one of the 5 machine just for flume(kafka) 

- 2 masters m4.2xlarge, 125 GB SSD (we decreased the memory and CPU afterwards ==> m4.xlarge)


This was perfect for us for testing purposes.



Ed, what did you choose in the end? Similar position here. Don't have BIG data yet (only a couple of TB), but planning for future. Thinking of using impala on top.

as usual, it depends of what you need...

The cloudera VM has 1 node with everything and it allows you to see it...

A quite simple cluster could have 2..3 MVs for CM & masters and at least 3 VMs for Workers.

As I said and as you can imagine, it depends what you want to test on it

Believe me, you really need a Cloudera admin to get what you want...

In another thread refered to this blog

I hope, this will help you