I've been tasked with setting up a Hadoop cluster for testing a new big data initiative. However I'm pretty much completely new to all of this. I know that one can set up a single node cluster for proof of concept, but I would like to know what is the minimum number of nodes, and what spec (amount of RAM & disk space) for a proper cluster. Imagine a low throughput as it's only an initial test cluster (fewer than 10 users). And we only need Kafka, HDFS, Pig & Hive services to run.
We generally have the ability to spin up Centos 6 VM's with 4GB RAM each, and I might be able to up that to 8GB each. But Reading many of the setup pages, it's quoting minimums of 10s of GB of RAM (e.g. http://blog.cloudera.com/blog/2013/08/how-to-select-the-right-hardware-for-your-new-hadoop-cluster/)... but the cloudera manager setup only asks for at least 4GB on that node (http://www.cloudera.com/content/www/en-us/documentation/enterprise/5-3-x/topics/cm_ig_cm_requirement... and mentions nothing around the other node's specs.
Let me know if you need any more information. I realise it's probably too vague as is.
Our initial test server for Hadoop cluster is:
1 Namenode (64GB ram + 24 core) + 2 hdd 1 for os, 1 for hdfs storage.
3 Datanode (each 32GB ram + 16 core) + 2 hdd 1 for os, 1 for dfs storage.
- the datanode is also used for: zookeeper, kafka, spark, YARN/mapreduce, Impala and Pig/Hive gateway.
As the best practice to run hadoop environment, all server should be a bare metal and not VM.
IMHO, maybe you could make the namenode server smaller like 32GB of ram with less core. But for the datanode sides, I don't recommend to have less specs than that, especially the minimum memory.
Our test cluster (on amazon):
- 5 workers m4.xlarge, 250 GB disk magnetic (we increased the disk to 1T afterwards)
* we used one of the 5 machine just for flume(kafka)
- 2 masters m4.2xlarge, 125 GB SSD (we decreased the memory and CPU afterwards ==> m4.xlarge)
This was perfect for us for testing purposes.
as usual, it depends of what you need...
The cloudera VM has 1 node with everything and it allows you to see it...
A quite simple cluster could have 2..3 MVs for CM & masters and at least 3 VMs for Workers.
As I said and as you can imagine, it depends what you want to test on it
Believe me, you really need a Cloudera admin to get what you want...
In another thread refered to this blog
I hope, this will help you