07-11-2014 03:28 AM
I have a cluster of 8 machines :
head node: Intel core i5 with 8GB memory.
1 i5 machine with 2GB memory
6 Dual core machines with 2-4 GB memory.
I installed CDH5 with Cloudera management services, and hive, HDFS, Hive, Oozie and Yarn (MR2)
My problem is that the system is quite slow and the memory of the headnode is already full even though I didn't insert any data yet in hdfs.
Any advice or suggestoin on how to make it faster is very much appreciated!
07-11-2014 10:34 AM
Those memory sizes are way to low. You can't run anything serious with such small memory sizes.
Anything under 4 GB is pretty useless, and even 4GB might be barely usable for very simple tasks, but anything serious will fail.
8GB might be ok for the master host, since you aren't deploying all services. Still on the low side though, and you'll probably see issues as your cluster activity grows.
07-11-2014 05:05 PM
By "without cloudera" I assume you mean "without Cloudera Manager". There's no memory benefit to using upstream Hadoop versus Cloudera's Distribution of Hadoop. Running Cloudera Manager only really takes up resources on the hosts that have Cloudera Manager and the monitoring daemons, each of which take up notable memory. On small clusters, this is by default the first (largest) host, which I would expect is your 8 GB host. For small clusters with low activity, I would expect that to function ok, but definitely not in a production cluster.
2 gigs on a host is just not enough to run your dataNode + taskTracker + MR job + operating system unless your MR job is tiny. It won't work for a production cluster.
So I don't expect you to see notable speed benefits running without Cloudera Manager. You should get larger machines if you want to run in production.
07-12-2014 12:42 AM
07-13-2014 04:32 PM
4GB isn't really huge memory. The chepest possible consumer desktop from Dell has 4GB or more RAM already. 1TB is probably considered huge.
I would not compare couchbase with mapreduce. MapReduce is more for batch processing, whereas couchbase is a NoSQL database optimized for latency. MapReduce will never give you subsecond response times. HBase will be a more reasonable comparison with couchbase. HBase requires HDFS.
I'm not a hardware or performance testing expert, so I can't really say what exactly you'd need to do your test, but I would strongly suspect that your test is not feasible on the current hardware. You have to at least run HDFS + HBase daemons on your slave nodes, which take up 1 and 4 gigs by default (at least, the defaults Cloudera Manager uses). Leaving some for the OS, that's at least 6 gigs of RAM to run with default configurations. Performance tuning, depending on your workload and whatever experts / books say, could change this further.