I am running an HDP cluster on AWS using Ambari, and I find myself struggling with an architectural dilemma.
Since our HDFS volumes are mounted on EBS, my computation and storage are separated by default, and so the closest I can bring these two units together (e.g a spark job and it's destination HDFS node) is running the job on the same EC2 instance the EBS volume is attached to.
Now, in order to be able to scale my computational components and my storage components separately, I have divided the default blueprint provided by Ambari to instances containing the datanode and hbase_regionserver components, and instances containing only the nodemanager component (and metrics_monitor on all nodes, naturally).
The purpose of this setup is to be able to launch spark jobs on machines designed for computation purposes only, allowing me to downscale quickly without having to redistribute my HDFS data, and scaling my HDFS nodes up only when I'm running low on space.
Does this design make any sense, or does it go against HDP best-practices?
Hey, thanks for the tips. This is great information to help me choose between EBS and instance store but it doesn't help me understand whether my setup of datanodes and purely computational nodes is correct. Do you have more info about that subject?
Hi @Yaron Idan
The region servers should contain the data. If you are dynamically provisioning more servers (for compute) executing Spark jobs there will be a hit on the network (since the data isn't local) - but as long as your SLA is being met, it should be fine.