<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Which cluster configuration is best for hadoop in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Which-cluster-configuration-is-best-for-hadoop/m-p/299678#M219793</link>
    <description>&lt;P&gt;Hi &lt;a href="https://community.cloudera.com/t5/user/viewprofilepage/user-id/55507"&gt;@mRabramS&lt;/a&gt;&amp;nbsp;,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The comment about 3 vs 2 is in order to accommodate the default 3x replication and to accommodate components like ZK that require 3 services for HA.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If dataloss and uptime is important, then I would advise against running it all on 1 node. Obviously if the node goes down, you are in trouble. No HA and no data replicas can help you.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;To give you an idea, what we typically recommend as a minimum HA cluster is 5 data nodes and 3 master nodes. Why not 3? Because with 3, if one goes down, your cluster is under-replicated and a lot of issues arise (even though you don't lose data). Why not 4? Because if one goes down, then you are at 3 (which is borderline) and it also means 25% of your data now has to be replicated on your other 3 data nodes, which is a lot of data movement. We consider 5 to be a reasonable place to start.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If you don't care about HA and want to go with minimum viable, then 3 data nodes and 1 master can be used. If you want to go even smaller, you can co-locate master and worker services, but it's unrealistic to have high expectations for performance at that point.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;In terms of performance, there are a lot of things to consider:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;running on bare-metal performs better than introducing a virtual layer in the middle (like VMWare)&lt;/LI&gt;&lt;LI&gt;separating masters from workers gives better and more predictable performance&lt;/LI&gt;&lt;LI&gt;from an I/O perspective, dedicating disks to certain master services (like ZK) and having more spindles for your data will perform better&lt;/LI&gt;&lt;LI&gt;also from an I/O perspective, if you do run on VMWare, mapping local disks to appropriate VMs to guarantee basically local reads/writes is also preferred&amp;nbsp;&lt;/LI&gt;&lt;LI&gt;tuning is super important&lt;/LI&gt;&lt;LI&gt;etc...&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;At the end of the day though, performance is relative, and it depends on what your applications and SLAs are. You could have a poorly tuned cluster that still runs your workloads within SLAs.&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Sorry for the long post, but I hope all this was helpful. In summary, if you are really limited to 1 machine right now, you can still run hadoop, but you need to have realistic expectations that this won't be a particularly performant, reliable, or future-proof configuration.&lt;/P&gt;</description>
    <pubDate>Tue, 14 Jul 2020 17:55:25 GMT</pubDate>
    <dc:creator>Ifi</dc:creator>
    <dc:date>2020-07-14T17:55:25Z</dc:date>
  </channel>
</rss>

