I heard that Cloudera is working on Kubernetes as a platform. Is this true? If so, is there any news or updates? I would like to know if and when it will replace YARN. We currently are moving to Kubernetes to underpin all our services. It would be beneficial and simpler to maintain Kubernetes if we could use Cloudera Manager hangled this.
Thanks for your interest. Cloudera is indeed in the early stages of looking at Kubernetes to see how it might benefit Cloudera users. Nothing definite thus far.
You touched on the largest challenge: YARN. For Spark, a Spark-on-K8s project is making rapid progress on integrating these two tools. The result will be that YARN is needed only for MapReduce (MR). At present, there is no clear community solution for MR on Kubernetes, so we're looking into options.
You are right that some changes would be needed to Cloudera Manager (CM): CM need not be in the business of launching processes; CM would instead coordinate with K8s to launch containers.
Would be helpful to understand a bit more about how you'd want to use Kubernetes.
In your own deployment, do you use Spark? MR (perhaps via Hive)? Other distributed compute engines?
Would you want Kubernetes to manage your HDFS data nodes (which would require associating pods with the nodes that have disks), or would you use some other storage solution?
About how large would your cluster be (rough order-of-magnitude: 10, 50, 100, etc.)?
In a simplified model, Kubernetes is just a container orchestration tool.
Most of the discussions I hear for Kubernetes/Docker is about the "workload" side. How about the "management" side of this Platform?
I would love to see a solution where we could:
1) Build and deploy containers for individual Hadoop Service instances
2) Manage these service instances through CM
Example: Build a Hive Server container (pre-configured to point to the right cluster,etc) which could be deployed on a client edge node. The Hive metastore could be another container but running on the "service" node only.
Similarly for Sentry or some other Hadoop service which is not in HA mode yet. Having a container for them would simplify deployment and get rid of the need to have HA built in.
Another example could be: A gateway container which config xml's pointing to a specific cluster. this way clients do not need to manually or programatically update xmls to point to the right cluster.
what do you all think?
Agree on the idea of containerization.
One tricky bit is that Kubernetes grew out of the scalable web app space with stateless services fronted by load balancers. Many parts of Hadoop are stateful, and are tightly bound to their nodes. (Think ZooKeeper and HDFS.) bringing these two worlds together is a rather intersesting challenge.
There is action on the open source side. See the Kubernetes Big Data SIG and Hadoop Helm Chart project. This is not the full CM-managed stack, but it shows what can be done with just "stock" Kubernetes, Helm and Hadoop.
As a company, we are investigating a Kubernetes deployment across all our clusters spanning multiple geographically located data centers globally.
We currently use mostly Spark with a few legacy Hive jobs to handle our data batch processing. Spark is mainly used in coordination with Kafka to handle the streaming use case. HBase is in use as a temporary profile store until we move to something better, such as Kudu, Couchbase, or another similar alternative. HDFS is still in use for distributed data file storage.
Ideally, we would like to have Spark, Kafka, Kudu, and HDFS all on Kubernetes and easily deployable using Cloudera Manager. Then, we can proceed with our plan quickly.
Does this help?
Thanks much for the info. Yes, it does very much help.
As mentioned earlier, Spark, Kafka, Kudu, Impala and HDFS are the easiest to convert to Kubernetes.
MapReduce is a challenge because of the overlap of YARN and Kubernetes responsibliities. (Both allocate "containers". MR is tightly coupled to the YARN API. Only YARN has queues and mechanisms to handle the kinds of requests that MR makes.)
Some of the quick-and-dirty options that the community has floated for YARN are:
1. Run YARN outside of Kubernetes.
2. Run the YARN node managers as Kubernetes containers.
Would either of those work in your environment for your Hive jobs?