About PaulR

PaulR · ‎03-20-2018

Hi Ben, Thanks much for the info. Yes, it does very much help. As mentioned earlier, Spark, Kafka, Kudu, Impala and HDFS are the easiest to convert to Kubernetes. MapReduce is a challenge because of the overlap of YARN and Kubernetes responsibliities. (Both allocate "containers". MR is tightly coupled to the YARN API. Only YARN has queues and mechanisms to handle the kinds of requests that MR makes.) Some of the quick-and-dirty options that the community has floated for YARN are: 1. Run YARN outside of Kubernetes. 2. Run the YARN node managers as Kubernetes containers. Would either of those work in your environment for your Hive jobs? Thanks, - Paul

PaulR · ‎03-20-2018

Agree on the idea of containerization. One tricky bit is that Kubernetes grew out of the scalable web app space with stateless services fronted by load balancers. Many parts of Hadoop are stateful, and are tightly bound to their nodes. (Think ZooKeeper and HDFS.) bringing these two worlds together is a rather intersesting challenge. There is action on the open source side. See the Kubernetes Big Data SIG and Hadoop Helm Chart project. This is not the full CM-managed stack, but it shows what can be done with just "stock" Kubernetes, Helm and Hadoop.

PaulR · ‎03-15-2018

Hi Ben, Thanks for your interest. Cloudera is indeed in the early stages of looking at Kubernetes to see how it might benefit Cloudera users. Nothing definite thus far. You touched on the largest challenge: YARN. For Spark, a Spark-on-K8s project is making rapid progress on integrating these two tools. The result will be that YARN is needed only for MapReduce (MR). At present, there is no clear community solution for MR on Kubernetes, so we're looking into options. You are right that some changes would be needed to Cloudera Manager (CM): CM need not be in the business of launching processes; CM would instead coordinate with K8s to launch containers. Would be helpful to understand a bit more about how you'd want to use Kubernetes. In your own deployment, do you use Spark? MR (perhaps via Hive)? Other distributed compute engines? Would you want Kubernetes to manage your HDFS data nodes (which would require associating pods with the nodes that have disks), or would you use some other storage solution? About how large would your cluster be (rough order-of-magnitude: 10, 50, 100, etc.)? Thanks, - Paul

Online	Offline
Last Visited	‎04-30-2018 07:44 PM

Member Since	‎03-15-2018 06:09 PM
Last Visited	‎04-30-2018 07:44 PM
Posts	3
Kudos received	2

Cloudera Community

Re: CDH on Kubernetes

Re: CDH on Kubernetes

Re: CDH on Kubernetes