This document is an informal guide to setting
up a test cluster on Amazon AWS, specifically the EC2 service. This is
not a best practice guide nor is it suitable for a full PoC or
production install of HDP.
Note: when instantiating instances, I increased the root partition to
100Gb on each of them. For long term use, you may want to create
separate volumes for your each of the datanodes to store larger amounts
of data. Typical raw storage per node is 12-24Tb per slave node.
Note: I edit the Name column in the EC2 Instances screen to the names mentioned above so I know which box I’m dealing with
Configure Security Groups
Used the following security group rules:
0 – 65535
50000 – 50100
0 – 65535
On each and every node (using root):
vim /etc/sysconfig/selinux (set SELINUX=disabled)
vim /etc/sysconfig/network (set HOSTNAME=<chosen_name>.hdp.hadoop where<chosen_name>is one of the following: ambarimaster, hdpmaster1, hdpmaster2, hdpslave1, hdpslave2, hdpslave3 – depending on what EC2 instance you are on)
chkconfig iptables off
chkconfig ip6tables off
shutdown -r now #(only after the commands above are completed)
Note: when I do a restart of the node in this manner, my external EC2
names did NOT change. They will change if you actually halt the
instance. This is separate concern from the internal IP addresses which
we will get to further on in these instructions
Note: SSH on the RHEL instances has a time out. If your session hangs
just give it a few seconds and you will get a “Write failed: Broken
pipe” message; just reconnect the box and everything will be fine.
Change the SSH timeout if you desire.
Logged onto the ambarimaster ONLY: ssh-keygen -t rsa
On your local box (assuming a linux/mac laptop/workstation, if not
use Cygwin, WinSCP, FileZilla, etc to accomplish the equivalent secure
Now you can logon to Ambari. Make a note of the external hostname of
your ambarimaster EC instance in the AWS console and go to: http://:8080
using your local host’s favorite web browser
Log on to Ambari with admin/admin
Using Ambari to Install
Going through the Ambari cluster install process:
Name your cluster whatever you want
Install Options::Target Hosts – on each line enter the fully
qualified hostnames as below (do not add our ambarimaster to the list):
Install Options::Host Registration Information – Find the id_rsa
(private key) file you downloaded from ambarimaster when you were
setting up. Click on choose file and select this file.
Install Options::Advanced Options – leave these as default
Click Register and Confirm
Confirm Hosts – Wait for the ambari agents to be installed and
registered on each of your nodes and click next when all have been
marked success. Note that you can always add nodes at a later time, but
make sure you have your two masters and at least 1 slave.
Choose Services – By default all services are selected. Note that you
cannot go back and reinstall services later in this version of Ambari
so choose what you want now.
Assign Masters – Likely the default is fine, but see below for a good
setup. Note that one of the slaves will need to be a ZooKeeper instance
to have an odd number for quorum.
Assign Slaves and Clients – For a demo cluster it is fine to have all
of the boxes run datanode, tasktracker, regionserver, and client
libraries. If you want to expand this cluster with many more slave nodes
then I would suggest only running the datanode, tasktracker, and
regionserver roles to the hdpslave nodes. The clients can be installed
where you like but be sure at least one or two boxes have a client role.
Click Next after you are done.
Customize Services – You will note that two services have red markers next to their name: Hive/HCat and Nagios.
Select Hive/HCat and choose your password for hive user on the MySQL
database (this stores metadata only); remember the password.
Select Nagios and choose your admin password. Setup your Hadoop admin
email to your email (or the email of someone you don’t like very much)
and you can experience Hadoop alerts from your cluster! Wow.
Review – Take note of the Print command in the top corner. I usually save this to a pdf. Then click Deploy. Get a coffee.
Note: you may need to refresh the web page if the installer appears
stuck (this happens very occasionally depending on various
Verify 100% installed and click Next
Summary – You should see that all of the Master services were
installed successfully and none should have failed. Click Complete.
At this point the Ambari installation and the HDP Cluster is complete so you should see the Ambari Dashboard.
You can leave your cluster running as long as you want but be warned
that the instances and volumes will cost you on AWS. To ensure that you
will not be charged you can terminate (not just stop) your instances and
delete your volumes in AWS. I encourage you to keep them for a a week
or so as you decide how to setup your actual Hadoop PoC cluster (be it
on actual hardware, Virtual Machines, or another cloud solution). The
instances you created will be handy for reference as you install your
next cluster and generally are low cost. Consult AWS documentation for
details on management and pricing. Please look into Rackspace as well.