Community Articles

Find and share helpful community-sourced technical articles.
Celebrating as our community reaches 100,000 members! Thank you!
Labels (3)

You have a need of debugging, testing and operating a Hadoop cluster, especially when you run dangerous dfsadmin commands, try customized packages with changes of Hadoop/Spark source code, trying aggressive configuration values. You have a laptop and you have a production Hadoop cluster. You don't dare to operate the production cluster blindly, which is appreciated by your manager. You want to try something on a hadoop cluster and even you breaks it, no one blames you.

You have several choices (perhaps you're using one of them now):

  1. psudo-distributed Hadoop cluster on a single machine, which is nontrivial to run HA, to use per-node configurations, to pause and launch multiple nodes, or to test HDFS balancer/mover etc.
  2. setting up a real cluster, which is complex and heavy to use, and in the first place you can afford a real cluster.
  3. building Ambari cluster using vbox/vmware virtual machines, nice try. But if you run 5 nodes cluster, you'll see your CPU is overloaded and memory is eaten up.

How about using Docker containers instead of virtualbox virtual machines? Caochong is a tool that does this exactly! Specially, it outperforms its counterparts in that it is:

  • Customizable: you can specify the cluster specs easily, e.g. how many nodes to launch, Ambari version, Hadoop version repository, per-node Hadoop configurations. Meanwhile, you have the choice of full Hadoop eco-system stack, HDFS, Yarn, Spark, Hbase, Hive, Pig, Oozie... you name one!
  • Lightweight: imagine your physical machine can run as many containers as you wish. I ran 10 without any problem (well, my laptop was made slow though). Using docker, you can also pause and start the containers (consider you have to restart your laptop for an OS security update, you will need a snapshot, right).
  • Standard: The caochong tool employs Apache Ambari to set up a cluster, which is a tool for provisioning, managing, and monitoring Apache Hadoop clusters.
  • Automatic: you don't have to be Ambari, Docker or Hadoop experts to use it!

To use caochong, you only need to follow 9 steps. Only nine, indeed!

0. Download caochong, and install Docker.

1. [Optional] Choose Ambari version in from-ambari/Dockerfile file (default Ambari 2.2)

2. Run from-ambari/ to set up an Ambari cluster and launch it

$ ./ --help
Usage: ./ [--nodes=3] [--port=8080]
--nodes      Specify the number of total nodes
--port       Specify the port of your local machine to access Ambari Web UI (8080 - 8088)

3. Hit http://localhost:port from your browser on your local computer. The port is the parameter specified in the command line of running By default, it is http://localhost:8080. NOTE: Ambari Server can take some time to fully come up and ready to accept connections. Keep hitting the URL until you get the login page.

4. Login the Ambari webpage with the default username:password is admin:admin.

5. [Optional] Customize the repository Base URLs in the Select Stack step.

6. On the Install Options page, use the hostnames reported by as the Fully Qualified Domain Name (FQDN). For example:

Using the following hostnames:

7. Upload from-ambari/id_rsa as your SSH Private Key to automatically register hosts when asked.

8. Follow the onscreen instructions to install Hadoop (YARN + MapReduce2, HDFS) and Spark.

9. [Optional] Log in to any of the nodes and you're all set to use an Ambari cluster!

# login to your Ambari server node
$ docker exec -it caochong-ambari-0 /bin/bash

To know more or to get updates, please star the Caochong project at