I've been trying to deploy an all-in-one HDP machine using Ambari on AWS similar to the sandbox using the latest version of Ambari and HDP. I've found that if I use a large instance size on AWS (8 Gb of ram and 2VCPU/CORES) the Ambari-server crashes every time during deployment, breaking the installation, requiring a total rebuild.
However, using an XL instance, with 16GB of RAM and 4 cores/vCPU the installation works fine every time and I get no errors. I have tested this 3 times, doing installations on both servers side-by-side (XL vs Large instances).
* The services installed on the single node are: HDFS, YARN + MapReduce2, Hive, HBase, Pig, Sqoop, Oozie, ZooKeeper, Flume, Ambair Infra, Ambari Metrics, Kafka, SmartSense, Spark and Spark2.
What is strange about this is that the deployment is that I can't find any reference to the hardware requirements anywhere and you can run most of these services in the sandbox with less memory and cores. Horton works and most big data trainers drill into you that Hadoop will 'run on all commodity hardware' but these hardware constraints seem to suggest otherwise.
Can anyone shed any light on this?
Hi Calvin. Before I type anything else please realize that I do not know YOUR specific use cases, but... I doubt anyone will argue with me that there are going to be very few of them that would really make sense to run on a single-node (aka pseudo) cluster. If all of your data can fit on one machine and all run within the constraints of 8GB of memory, then... quite possibly you just don't need Hadoop for that scenario. Additionally, even HDFS cannot do what it is supposed to in a single-node configuration since it has no additional nodes for replication to occur on.
All that said, the HDP Sandbox is a way to jumpstart your initial hands-on efforts with Hadoop and to provide a playground for our publicly available tutorials and similarly sized & scoped investigative activities you may undertake. A full HDP stack takes many more resources than is typical in a single server with characteristics like a simple laptop or desktop. The Sandbox team makes MANY configuration adjustments to try to shoehorn the whole stack into a single image. In fact, you'll notice that not all service are running at any given time which is a pattern I'd recommend (start only what you need for an experiment and stop everything else).
All that said, please do realize we are still talking about "commodity hardware", but we are not talking about "tiny hardware". Most on-prem servers are quite big and https://community.hortonworks.com/questions/37565/cluster-sizing-calculator.html points you to https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.5.3/bk_cluster-planning/content/ch_hardware-rec... which makes some suggestions on pilot clusters and full production ones. You'll also read some additional thoughts on the whole "commodity hardware" versus "enterprise data center server" characteristics in that documentation.
Good luck and happy Hadooping!