Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

Which cluster configuration is best for hadoop


Hi all,


what is the current best practice for setting up a HDP 3.1 framework on a single machine of 20 cores and 512GB RAM? 


Do we still need to split up the machine to, e.g., 2 x VM, in order to ensure at least 2x dfs replication? Or would a single node cluster be more efficient? 


If I understand correctly, even if using a JBOD disk setup for the HDFS, with a single node it cluster it will be possible to use at most 1x replication through the ``dfs.replication`` setting? 


What are your recommendations at this point? 




Cloudera Employee

A single node cluster will only allow you for a single replica of data. You can configuration the DFS replication to 1, so this is certainly doable. 


The question is, what are you trying to accomplish? If you are running a simple functional test, and do not care for HA, data loss, or performance, then a single node sandbox is fine. If you want to test HA configuration, then splitting the node in two (better, 3) VMs will be necessary. If you care about performance testing, then you certainly need more than 1 node. 


Thanks for the response! Could you elaborate a little on choosing to split into 3 VMs over 2 VM? Apart from allowing 3 replications, would this add additional performance gains?


While this will be a "dev" cluster, both data loss and performance is important. Depending on the performance and possible bottlenecks, the goal is to expand to additional machines in the future. However, in the meantime, would want to have a clean and future-proof start with the existing machine/configuration. 

Cloudera Employee

Hi @mRabramS ,


The comment about 3 vs 2 is in order to accommodate the default 3x replication and to accommodate components like ZK that require 3 services for HA. 


If dataloss and uptime is important, then I would advise against running it all on 1 node. Obviously if the node goes down, you are in trouble. No HA and no data replicas can help you. 


To give you an idea, what we typically recommend as a minimum HA cluster is 5 data nodes and 3 master nodes. Why not 3? Because with 3, if one goes down, your cluster is under-replicated and a lot of issues arise (even though you don't lose data). Why not 4? Because if one goes down, then you are at 3 (which is borderline) and it also means 25% of your data now has to be replicated on your other 3 data nodes, which is a lot of data movement. We consider 5 to be a reasonable place to start.


If you don't care about HA and want to go with minimum viable, then 3 data nodes and 1 master can be used. If you want to go even smaller, you can co-locate master and worker services, but it's unrealistic to have high expectations for performance at that point. 


In terms of performance, there are a lot of things to consider:

  • running on bare-metal performs better than introducing a virtual layer in the middle (like VMWare)
  • separating masters from workers gives better and more predictable performance
  • from an I/O perspective, dedicating disks to certain master services (like ZK) and having more spindles for your data will perform better
  • also from an I/O perspective, if you do run on VMWare, mapping local disks to appropriate VMs to guarantee basically local reads/writes is also preferred 
  • tuning is super important
  • etc...

At the end of the day though, performance is relative, and it depends on what your applications and SLAs are. You could have a poorly tuned cluster that still runs your workloads within SLAs. 


Sorry for the long post, but I hope all this was helpful. In summary, if you are really limited to 1 machine right now, you can still run hadoop, but you need to have realistic expectations that this won't be a particularly performant, reliable, or future-proof configuration.


Thank you! This is a very helpful response! 


Regarding the 3x replication for the DFS, is this still relevant with the latest releases of hadoop? For example, starting with hadoop 3, they did seem to introduce Erasure Coding to deal with this, although I'm not yet sure whether this is used by default by HDP3 and similar packs/platforms.


The other things for consideration that you mentioned are very useful, will keep this in mind. It seems like a good idea to start with the minimum viable setup and scale up from there for the HA. In my experience, the migration of services between the nodes will be simple enough through Ambari, so this should have us covered. 


Cloudera Employee

Hi @mRabramS ,


Erasure coding is a whole different topic, and there are pros and cons to it, but it's not the solution to your 1-node question. 


Erasure Coding increases the efficiency by which you use your disk space. So instead of 1TB of data occupying 3TB of disk space, it now can occupy ~1.5TB and still maintains great fault-tolerance. However, it doesn't reduce the number of data nodes needed. On the contrary, it typically requires more nodes (although it's configurable). For example, using the typical RS (6,3) policy, you would require at least 9 data nodes