Hi, I'm try to set up an Hadoop cluster with Cloudera for testing, I'm pretty much completely new to this technology. I read that you can just set up a single node cluster for proof of concept, but I would like to know what is the minimum number of nodes, and what spec (RAM & disk space) for cluster. As I said is just a test so imagine a low throughput. The test I've been trying so far is using Cloudera manager on 2 VM nodes (Centos 6.4, 8G RAM, 20G hd) but the installation always fails with different type of error (Oozie fails to install, heap memory issue ecc). Is it becuase is a low number of nodes?... not enough RAM or what else?
Thanks for reaching out on this, and welcome to the world of Hadoop.
I helped someone out awhile back regarding best practices for planning a hadoop cluster. You can see the discussion here. There's quite a bit of info in my response that should help you get a better idea of how to plan the best cluster. I’d like to highlight some details, as they address your question regarding nodes.
- As stated, the bare minimum I’d recommend for a cluster would be five nodes (2 master, 3 worker)
- Total cluster hard drive memory for HDFS needs to be at least three times the amount of memory you plan on using to store data in HDFS. This is because of the default replication factor, which is three.
- Certain services require multiple nodes. Zookeeper requires at least three, and HDFS also requires at least three to function properly.
- The hardware recommendations are referenced in the my community response. I’ll include a link at the bottom for your reference.
As far as testing and proof of concepts go, you can use the quickstart VM, which is the single node cluster you mentioned. There's also a guide for setting it up. You can also use a path b install to make a multi node cluster for testing and proof of concept purposes. This is also the method you would use to setup a full production cluster.
As far as your current test setup goes, I'd like to get some clarification. Did you use the quickstart VM for your test setup, or the path B install, or another method? This can give me a better idea of how to approach the issue you're having. It would also be very helpful if you could post any guides you've used so far.
Let me know if this info is helpful, or if you have any questions.
Setting up a cluster like a boss
A good guide for setting up a VM cluster