06-28-2019 01:46 PM - last edited on 07-01-2019 06:17 AM by VidyaSargur
Cloud deployment for the Hadoop stack on public Cloud provides cost/performance benefits for the temporary/batch workloads. But there is a need to keep master/management nodes only permanently up and running to maintain the meta-data for Navigator and all the other configurations made, etc..
What is the best practice to have a temporary cluster on the Public Cloud without caveats (for eg: not to loose lineage in the Navigator, the cluster has to remain permanent and how to create a minimum permanent cluster with data nodes added/removed daily?)
07-01-2019 02:21 PM
This is kind of a wide-open question, so I'll give you a wide-open answer with some ideas for implementing "transient" clusters.
To start out with, you may want to think about having a cluster with a core of non-transient nodes that house management information, so-called "master" nodes that host the HDFS namenode, YARN resource manager, and so on. You also would want to keep around enough stateful nodes, like HDFS datanodes and Kudu tablet servers, so that fundamental data stays available (e.g., to stay above your chosen HDFS replication factor which defaults to 3). Then, you have the ability to scale out with stateless "compute" nodes, like YARN node managers and Spark workers, when the cluster workload increases, and then tear them down when the load is lighter.
Next, a good goal is to store important data on cloud storage services, like S3 for AWS and ADLS for Azure. Hadoop and other services have the ability to reference data in those services directly - for example, Hive and HDFS can be backed by S3 - or you can establish ways to copy data into a cluster from the services to work on it, and then copy final result data back out to the services. (You'd want to avoid saving intermediate data, like temporary HDFS files or Hive tables that only matter in the middle of a workflow, because that data can be regenerated.) Once you can persist data to cloud storage services, then you have a basis for making new clusters from nothing, and then pulling in data for them to work on.
Saving off metadata, such as Hive table mappings (metastore) and Sentry / Ranger authorization rules, to cloud storage is also a good idea. You can use cloud database services, like RDS on AWS, for that, or else general block storage services like S3 or ADLS. Metadata usually needs to apply to all of your clusters, transient or not, because they define common business rules. The idea behind SDX is to make saving common metadata an easy and usual thing, so that it's easier to hook up new clusters to it.
Automating the creation of clusters is really important, especially for transient clusters that you'd bring up and tear down all the time. That's the purpose for tools like Altus Director and Cloudbreak. We also have customers who use general tools like Ansible, Chef, Puppet, or the like, since they are more familiar with them, or have standardized on them. If you have automated cluster creation, and important data and metadata persisted in cloud storage services, then you've got the ingredients for successfully working with transient clusters in the cloud.
I know this isn't a precise answer for how to do transient clusters, but hopefully I've given you some avenues to explore.
07-01-2019 03:34 PM