About Ifi

Ifi · ‎07-15-2020

Hi @mRabramS , Erasure coding is a whole different topic, and there are pros and cons to it, but it's not the solution to your 1-node question. Erasure Coding increases the efficiency by which you use your disk space. So instead of 1TB of data occupying 3TB of disk space, it now can occupy ~1.5TB and still maintains great fault-tolerance. However, it doesn't reduce the number of data nodes needed. On the contrary, it typically requires more nodes (although it's configurable). For example, using the typical RS (6,3) policy, you would require at least 9 data nodes.

Ifi · ‎07-14-2020

Hi @mRabramS , The comment about 3 vs 2 is in order to accommodate the default 3x replication and to accommodate components like ZK that require 3 services for HA. If dataloss and uptime is important, then I would advise against running it all on 1 node. Obviously if the node goes down, you are in trouble. No HA and no data replicas can help you. To give you an idea, what we typically recommend as a minimum HA cluster is 5 data nodes and 3 master nodes. Why not 3? Because with 3, if one goes down, your cluster is under-replicated and a lot of issues arise (even though you don't lose data). Why not 4? Because if one goes down, then you are at 3 (which is borderline) and it also means 25% of your data now has to be replicated on your other 3 data nodes, which is a lot of data movement. We consider 5 to be a reasonable place to start. If you don't care about HA and want to go with minimum viable, then 3 data nodes and 1 master can be used. If you want to go even smaller, you can co-locate master and worker services, but it's unrealistic to have high expectations for performance at that point. In terms of performance, there are a lot of things to consider: running on bare-metal performs better than introducing a virtual layer in the middle (like VMWare) separating masters from workers gives better and more predictable performance from an I/O perspective, dedicating disks to certain master services (like ZK) and having more spindles for your data will perform better also from an I/O perspective, if you do run on VMWare, mapping local disks to appropriate VMs to guarantee basically local reads/writes is also preferred tuning is super important etc... At the end of the day though, performance is relative, and it depends on what your applications and SLAs are. You could have a poorly tuned cluster that still runs your workloads within SLAs. Sorry for the long post, but I hope all this was helpful. In summary, if you are really limited to 1 machine right now, you can still run hadoop, but you need to have realistic expectations that this won't be a particularly performant, reliable, or future-proof configuration.

Ifi · ‎07-14-2020

Good morning, Druid is not currently supported in Cloudera Manager. It is planned to be supported in future CDP versions. I imagine it's possible to install it manually (unsupported) on CDH5/6 but I have not attempted that, so I cannot offer any helpful advice...

Ifi · ‎07-14-2020

A single node cluster will only allow you for a single replica of data. You can configuration the DFS replication to 1, so this is certainly doable. The question is, what are you trying to accomplish? If you are running a simple functional test, and do not care for HA, data loss, or performance, then a single node sandbox is fine. If you want to test HA configuration, then splitting the node in two (better, 3) VMs will be necessary. If you care about performance testing, then you certainly need more than 1 node.

Online	Offline
Last Visited	‎02-05-2021 09:11 AM

Member Since	‎12-07-2015 07:07 AM
Last Visited	‎02-05-2021 09:11 AM
Posts	4
Kudos received	3

Cloudera Community

Re: Manage Druid through Cloudera Manager

Re: Which cluster configuration is best for hadoop

Re: Which cluster configuration is best for hadoop

Re: Manage Druid through Cloudera Manager

Re: Which cluster configuration is best for hadoop