Created 02-11-2017 06:33 PM
I am just getting started with HDP Sandbox and Hadoop in general, so I have quite a few noob questions that I am hoping someone can kindly help with some answers.
It seems HDP 2.5 Sandbox now uses docker container within VM. I discovered that, thanks to community forums, when hadoop client tools didn't work when I ssh'ed on to VM, but they did when I ssh'd into docker (port 2222). Can someone explain me the different roles that VM and the docker container plays as far as the HDP 2.5 Sandbox is concerned? Am I correct to assume that since the docker container has the client tools installed, at least it plays the role of "edge node"? Then, between the VM and the container, who plays the roles of "name node" and "data node"? Or does the container plays all the roles, and the VM is just a minimal O/S that enables running docker?
Also, out of curiosity, in theory, would it not have been possible to create a sort of virtual hadoop cluster using multiple docker containers playing different nodes even on a modest hardware? I am just asking because HDP Sandbox contains just one container. I'd have thought there'd be multiple containers playing different roles.
Thanks in advance!
Created 02-13-2017 07:28 PM
You are correct, the VM is there to run docker. This allows the same sandbox container to be run on many different virtualization platforms, which reduces the variability of the experience for different users. The differences between the versions are only the packaging and port forwarding configurations specific to the virtualization platform. If you run docker directly on your local machine, the docker version of the sandbox does not include the intermediate VM.
You could certainly run a cluster of nodes in separate containers. In fact, there is a github repository of tools for running multi-node clusters in docker. You can find a link to that here: https://community.hortonworks.com/repos/75668/a-multi-node-docker-cluster-platform-to-quickly-sp.htm...
Created 02-11-2017 06:40 PM
The sandbox play roles of ambari, edge, master, and data node. It is setup to get you up and running quick to learn the hadoop stack. In a production environment, you would separate ambari, edge, master services (1:m on each node) and x number (min 3) of data nodes. You will scale your data nodes based on the compute and storage required for your workloadd
Created 02-12-2017 02:02 PM
Thanks for the answer.
When you say "Sandbox plays roles...", do you mean the docker container within the VM?
Any idea then what does the VM actually do besides providing the platform to run docker?
Created 02-13-2017 07:28 PM
You are correct, the VM is there to run docker. This allows the same sandbox container to be run on many different virtualization platforms, which reduces the variability of the experience for different users. The differences between the versions are only the packaging and port forwarding configurations specific to the virtualization platform. If you run docker directly on your local machine, the docker version of the sandbox does not include the intermediate VM.
You could certainly run a cluster of nodes in separate containers. In fact, there is a github repository of tools for running multi-node clusters in docker. You can find a link to that here: https://community.hortonworks.com/repos/75668/a-multi-node-docker-cluster-platform-to-quickly-sp.htm...