hi all, seeking some advise. We've recently try to embark on a big-data platform and chose Cloudera as the platform as a form of POC, we setup Cloudera 5.10.1 on a virtualize platform and testing with our application looks good, while we are using a big-data platform but the dataset size is small, just few GB in size.
We realise the current VM specification (both CPU and memory) are very low in usage (just few % utilized and in a stress test environment it only has a short peak at 20%). Initially when this environment been setup with lots of CPU and memory allocation, the vendor that helped on setting this up also did the resource allocation per each of the host (e.g. certain process get approximately how many CPU etc). Since the usage is so low, we are interested to find out what are the minimum but still feasible specification to run this environment, I am reluctant to shrink this at the VM level immediately since there's a more granular resource allocation already been configured at the Cloudera's host level.
I've a question: - Can the cloudera's host resource allocation be disable and let Cloudera manage the process automatically based on whatever resouirce it has on that host (we are thinking of shrinking down the current allocation of CPU/Memory to 1/4 of original). Appreciate any feedback or comment
While you are waiting for someone with more experience to chime in, allow me to share a few links that may be of some assistance or at least of interest.
This one is an older blog article but has some good descriptions: How-to: Select the Right Hardware for Your New Hadoop Cluster
Here are a couple of newer blog posts you may take interest in:
I hope this helps, or is at least interesting reading. :)