question Re: Which cluster configuration is best for hadoop in Support Questions

question Re: Which cluster configuration is best for hadoop in Support Questions https://community.cloudera.com/t5/Support-Questions/Which-cluster-configuration-is-best-for-hadoop/m-p/299678#M219793 Hi <a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/55507">@mRabramS</a> , The comment about 3 vs 2 is in order to accommodate the default 3x replication and to accommodate components like ZK that require 3 services for HA.  If dataloss and uptime is important, then I would advise against running it all on 1 node. Obviously if the node goes down, you are in trouble. No HA and no data replicas can help you.  To give you an idea, what we typically recommend as a minimum HA cluster is 5 data nodes and 3 master nodes. Why not 3? Because with 3, if one goes down, your cluster is under-replicated and a lot of issues arise (even though you don't lose data). Why not 4? Because if one goes down, then you are at 3 (which is borderline) and it also means 25% of your data now has to be replicated on your other 3 data nodes, which is a lot of data movement. We consider 5 to be a reasonable place to start. If you don't care about HA and want to go with minimum viable, then 3 data nodes and 1 master can be used. If you want to go even smaller, you can co-locate master and worker services, but it's unrealistic to have high expectations for performance at that point.  In terms of performance, there are a lot of things to consider:<UL><LI>running on bare-metal performs better than introducing a virtual layer in the middle (like VMWare)</LI><LI>separating masters from workers gives better and more predictable performance</LI><LI>from an I/O perspective, dedicating disks to certain master services (like ZK) and having more spindles for your data will perform better</LI><LI>also from an I/O perspective, if you do run on VMWare, mapping local disks to appropriate VMs to guarantee basically local reads/writes is also preferred </LI><LI>tuning is super important</LI><LI>etc...</LI></UL>At the end of the day though, performance is relative, and it depends on what your applications and SLAs are. You could have a poorly tuned cluster that still runs your workloads within SLAs.  Sorry for the long post, but I hope all this was helpful. In summary, if you are really limited to 1 machine right now, you can still run hadoop, but you need to have realistic expectations that this won't be a particularly performant, reliable, or future-proof configuration. Tue, 14 Jul 2020 17:55:25 GMT Ifi 2020-07-14T17:55:25Z