Member since
02-09-2016
559
Posts
422
Kudos Received
98
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1668 | 03-02-2018 01:19 AM | |
2656 | 03-02-2018 01:04 AM | |
1827 | 08-02-2017 05:40 PM | |
1994 | 07-17-2017 05:35 PM | |
1385 | 07-10-2017 02:49 PM |
05-23-2017
09:41 PM
2 Kudos
This tutorial will walk you through the process of using Cloudbreak to deploy an HDP 2.6 cluster with Spark 2.1. We'll copy and edit the existing hdp-spark-cluster blueprint which deploys Spark 1.6 to create a new blueprint which installs Spark 2.1. This tutorial is part one of a two-part series. The second tutorial walks you through using Zeppelin to verify the Spark 2.1 installation. You can find that tutorial here: HCC Article
Prerequisites
You should already have a Cloudbreak v1.14.0 environment running. You can follow this article to create a Cloudbreak instance using Vagrant and Virtualbox: HCC Article
You should already have updated Cloudbreak to support deploying HDP 2.6 clusters. You can follow this article to enable that functionality: HCC Article
Scope
This tutorial was tested in the following environment:
Cloudbreak 1.14.4
AWS EC2
HDP 2.6
Spark 2.1
Steps
Create Blueprint
Before we can deploy a Spark 2.1 cluster using Cloudbreak, we need to create a blueprint that specifies Spark 2.1. Cloudbreak ships with 3 blueprints out of the box:
hdp-small-default: basic HDP cluster with Hive and HBase
hdp-spark-cluster: basic HDP cluster with Spark 1.6
hdp-streaming-cluster: basic HDP cluster with Kafka and Storm
We will use the hdp-spark-cluster as our base blueprint and edit it to deploy Spark 2.1 instead of Spark 1.6.
Click on the manage blueprints section of the UI. Click on the hdp-spark-cluster blueprint. You should see something similar to this:
Click on the blue copy & edit button. You should see something similar to this:
For the Name , enter hdp26-spark21-cluster . This tells us the blueprint is for an HDP 2.6 cluster using Spark 2.1. Enter the same information for the Description . You should see something similar to this:
Now, we need to edit the JSON portion of the blueprint. We need to change the Spark 1.6 components to Spark 2.1 components. We don't need change where they are deployed. The following entries within the JSON are for Spark 1.6:
"name": "SPARK_CLIENT"
"name": "SPARK_JOBHISTORYSERVER"
"name": "SPARK_CLIENT"
We will replace SPARK with SPARK2 . These entries should look as follows:
"name": "SPARK2_CLIENT"
"name": "SPARK2_JOBHISTORYSERVER"
"name": "SPARK2_CLIENT"
NOTE: There are two entries for SPARK_CLIENT. Make sure you change both.
We are going to add an entry for the LIVY component. We will add it to the same node as the SPARK_JOBHISTORYSERVER . We are also going to add an entry for the SPARK2_THRIFTSERVER component. We will add it to the same node as the SPARK_JOBHISTORYSERVER . Let's add those two entries just below SPARK2_CLIENT in the host_group_master_2 section.
Change the following:
{
"name": "SPARK2_JOBHISTORYSERVER"
},
{
"name": "SPARK2_CLIENT"
},
to this:
{
"name": "SPARK2_JOBHISTORYSERVER"
},
{
"name": "SPARK2_CLIENT"
},
{
"name": "SPARK2_THRIFTSERVER"
},
{
"name": "LIVY2_SERVER"
},
We also need to update the blueprint_name to hdp26-spark21-cluster and the stack_version to 2.6 . you should have something similar to this:
"Blueprints": {
"blueprint_name": "hdp26-spark21-cluster",
"stack_name": "HDP",
"stack_version": "2.6"
}
If you prefer, you can copy and paste the following blueprint JSON:
{
"host_groups": [
{
"name": "host_group_client_1",
"configurations": [],
"components": [
{
"name": "ZOOKEEPER_CLIENT"
},
{
"name": "PIG"
},
{
"name": "OOZIE_CLIENT"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "HCAT"
},
{
"name": "KNOX_GATEWAY"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "FALCON_CLIENT"
},
{
"name": "TEZ_CLIENT"
},
{
"name": "SPARK2_CLIENT"
},
{
"name": "SLIDER"
},
{
"name": "SQOOP"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "HIVE_CLIENT"
},
{
"name": "YARN_CLIENT"
},
{
"name": "METRICS_COLLECTOR"
},
{
"name": "MAPREDUCE2_CLIENT"
}
],
"cardinality": "1"
},
{
"name": "host_group_master_3",
"configurations": [],
"components": [
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "APP_TIMELINE_SERVER"
},
{
"name": "TEZ_CLIENT"
},
{
"name": "HBASE_MASTER"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "SECONDARY_NAMENODE"
}
],
"cardinality": "1"
},
{
"name": "host_group_slave_1",
"configurations": [],
"components": [
{
"name": "HBASE_REGIONSERVER"
},
{
"name": "NODEMANAGER"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "DATANODE"
}
],
"cardinality": "6"
},
{
"name": "host_group_master_2",
"configurations": [],
"components": [
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "ZOOKEEPER_CLIENT"
},
{
"name": "PIG"
},
{
"name": "MYSQL_SERVER"
},
{
"name": "HIVE_SERVER"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "SPARK2_JOBHISTORYSERVER"
},
{
"name": "SPARK2_CLIENT"
},
{
"name": "SPARK2_THRIFTSERVER"
},
{
"name": "LIVY2_SERVER"
},
{
"name": "TEZ_CLIENT"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "HIVE_METASTORE"
},
{
"name": "ZEPPELIN_MASTER"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "YARN_CLIENT"
},
{
"name": "MAPREDUCE2_CLIENT"
},
{
"name": "RESOURCEMANAGER"
},
{
"name": "WEBHCAT_SERVER"
}
],
"cardinality": "1"
},
{
"name": "host_group_master_1",
"configurations": [],
"components": [
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "HISTORYSERVER"
},
{
"name": "OOZIE_CLIENT"
},
{
"name": "NAMENODE"
},
{
"name": "OOZIE_SERVER"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "YARN_CLIENT"
},
{
"name": "FALCON_SERVER"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "MAPREDUCE2_CLIENT"
}
],
"cardinality": "1"
}
],
"Blueprints": {
"blueprint_name": "hdp26-spark21-cluster",
"stack_name": "HDP",
"stack_version": "2.6"
}
}
Once you have all of the changes in place, click the green create blueprint button.
Create Security Group
We need to create a new security group to use with our cluster. By default, the existing security groups only allow ports 22, 443, and 9443. As part of this tutorial, we will use Zeppelin to test Spark 2.1. We'll create a new security group that opens all ports to our IP address.
Click on the manage security groups section of the UI. You should see something similar to this:
Click on the green create security group button. You should see something similar to this:
First you need to select the appropriate cloud platform. I'm using AWS, so that is what I selected. We need to provide a unique name for our security group. I used all-ports-my-ip . You should use something descriptive. Provide a helpful description as well. Now we need to enter our personal IP address CIDR. I am using #.#.#.#/32 ; your IP address will obviously be different. You need to enter the port range. There is a known issue in Cloudbreak that prevents you from using 0-65356 , so we'll use 1-65356 . For the protocol, use tcp . Once you have everything entered, you should see something similar to this:
Click the green Add Rule button to add this rule to our security group. You can add multiple rules, but we have everything covered with our single rule. You should see something similar to this:
If everything looks good, click the green create security group button. This will create our new security group. You should see something like this:
Create Cluster
Now that our blueprint has been created and we have an new security group, we can begin building the cluster. Ensure you have selected the appropriate credential for your cloud environment. Then click the green create cluster button. You should see something similar to this:
Give your cluster a descriptive name. I used spark21test , but you can use whatever you like. Select an appropriate cloud region. I'm using AWS and selected US East (N. Virginia) , but you can use whatever you like. You should see something similar to this:
Click on the Setup Network and Security button. You should see something similar to this:
We are going to keep the default options here. Click on the Choose Blueprint button. You should see something similar to this:
Expand the blueprint dropdown menu. You should see the blueprint we created before, hdp26-spark21-cluster . Select the blueprint. You should see something similar to this:
You should notice the new security group is already selected. Cloudbreak did not automatically figure this out. The instance templates and security groups are selected alphabetically be default.
Now we need to select a node on which to deploy Ambari. I typically deploy Ambari on the master1 server. Check the Ambari check box on one of the master servers. If everything looks good, click on the green create cluster , You should see something similar to this:
Once the cluster has finished building, you can click on the arrow for the cluster we created to get expanded details. You should see something similar to this:
Verify Versions
Once the cluster is fully deployed, we can verify the versions of the components. Click on the Ambari link on the cluster details page. Once you login to Ambari, you should see something similar to this:
You should notice that Spark2 is shown in the component list. Click on Spark2 in the list. You should see something similar to this:
You should notice that both the Spark2 Thrift Server and the Livy2 Server have been installed. Now lets check the overall cluster verions. Click on the Admin link in the Ambari menu and select Stacks and Versions . Then click on the Versions tab. You should see something similar to this:
As you can see, HDP 2.6.0.3 was deployed.
Review
If you have successfully followed along with this tutorial, you should know how to create a new security group and blueprint. The blueprint allows you to deploy HDP 2.6 with Spark 2.1. The security group allows you to access all ports on the cluster from your IP address. Follow along in part 2 of the tutorial series to use Zeppelin to test Spark 2.1.
... View more
05-18-2017
03:00 PM
6 Kudos
Prerequisites
You should already have a Cloudbreak v1.14.4 environment running. You can follow this article to create a Cloudbreak instance using Vagrant and Virtualbox: HCC Article
You should already have credentials created in Cloudbreak for deploying on AWS (or Azure).
Scope
This tutorial was tested in the following environment:
macOS Sierra (version 10.12.4)
Cloudbreak 1.14.4
AWS EC2
NOTE: Cloudbreak 1.14.0 (TP) had a bug which caused HDP 2.6 clusters installs to fail. You should upgrade your Cloudbreak deployer instance to 1.14.4.
Steps
Create application.yml file
UPDATE 05/24/2017: The creation of a custom application.yml file is not required with Cloudbreak 1.14.4. This version of Cloudbreak includes support for HDP 2.5 and HDP 2.6. This step remains for educational purposes for future HDP updates.
You need to create an application.yml file in the etc directory within your Cloudbreak deployment directory. This file will contain the repo information for HDP 2.6. If you followed my tutorial linked above, then your Cloudbreak deployment directory should be /opt/cloudbreak-deployment . If you are using a Cloudbreak instance on AWS or Azure, then your Cloudbreak deployment directory is likely /var/lib/cloudbreak-deployment/ .
Edit your <cloudbreak-deployment>/etc/application.yml file using your favorite editor. Copy and paste the following in the file:
cb:
ambari:
repo:
version: 2.5.0.3-7
baseurl: http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.5.0.3
gpgkey: http://public-repo-1.hortonworks.com/ambari/centos6/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins
database:
vendor: embedded
host: localhost
port: 5432
name: postgres
username: ambari
password: bigdata
</p>
<p>
hdp:
entries:
2.5:
version: 2.5.0.1-210
repoid: HDP-2.5
repo:
stack:
repoid: HDP-2.5
redhat6: http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.5.5.0
redhat7: http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.5.5.0
util:
repoid: HDP-UTILS-1.1.0.21
redhat6: http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.21/repos/centos6
redhat7: http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.21/repos/centos7
2.6:
version: 2.6.0.0-598
repoid: HDP-2.6
repo:
stack:
repoid: HDP-2.6
redhat6: http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.6.0.3
redhat7: http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.0.3
util:
repoid: HDP-UTILS-1.1.0.21
redhat6: http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.21/repos/centos6
redhat7: http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.21/repos/centos7
Start Cloudbreak
Once you have created your application.yml file, you can start Cloudbreak.
$ cbd start
NOTE: It may take a couple of minutes before Cloudbreak is fully running.
Create HDP 2.6 Blueprint
To create an HDP 2.6 cluster, we need to update our blueprint to specify HDP 2.6. On the main Cloudbreak UI, click on manage blueprints . You should see something similar to this:
You should see 3 default blueprints. We are going to use the hdp-small-default blueprint as our base. Click on the hdp-small-default blueprint name. You should see something similar to this:
Now click on the blue copy & edit button. You should see something similar to this:
For the Name , you should enter something unqiue and descriptive. I suggest hdp26-small-default . For the Description , you can enter the same information. You should see something similar to this:
Now we need to edit the JSON portion of the blueprint. Scroll down to the bottom of the JSON. You should see something similar to this:
Now edit the blueprint_name value to be hdp26-small-default and edit the stack_version to be 2.6 . You should see something similar to this:
Now click on the green create blueprint button. You should see the new blueprint visible in the list of blueprints.
Create HDP 2.6 Small Default Cluster
Now that our blueprint has been created, we can create a cluster and select this blueprint to install HDP 2.6. Select the appropriate credential for your Cloud environment. Click on the create cluster button. You should see something similar to this:
Provide a unique, but descriptive Cluster Name . Ensure you select an appropriate Region . I chose hdp26test as my cluster name and I'm using the US East region:
Now advanced to the next step by clicking on Setup Network and Security . You should see something similar to this:
We don't need to make any changes here, so click on the Choose Blueprint button. You should see something similar to this:
In the Blueprint dropdown, you should see the blueprint we created. Select the hdp26-small-default blueprint. You should see something similar to this:
You need to select which node Ambari will run on. I typically select the master1 node. You should see something similar to this:
Now you can click on the Review and Launch button. You should see something similar to this:
Verify the information presented. If everything looks good, click on the create and start cluster button . Once the cluster build process has started, you should see something similar to this:
Verify HDP Version
Once the cluster has finished building, you can click on the cluster in the Cloudbreak UI. You should see something similar to this:
Click on the Ambari link to load Ambari. Login using the default username and password of admin . Now click on the Admin link in the menu. You should see something similar to this:
Click on the Stack and Versions link. You should see something similar to this:
You should notice that HDP 2.6.0.3 has been deployed.
Review
If you have successfully followed along with this tutorial, you should know how to create/update /etc/application.yml to add specific Ambair and HDP repositories. You should have successfully created an updated blueprint and deployed HDP 2.6 on your cloud of choice.
... View more
05-13-2017
01:57 AM
5 Kudos
Objectives
This tutorial will walk you through the process of using Cloudbreak recipes to install Anaconda on an your HDP cluster during cluster provisioning. This process can be used to automate many tasks on the cluster both pre-install and post-install.
Prerequisites
You should already have a Cloudbreak v1.14.0 environment running. You can follow this article to create a Cloudbreak instance using Vagrant and Virtualbox: HCC Article
You should already have credentials created in Cloudbreak for deploying on AWS (or Azure).
Scope
This tutorial was tested in the following environment:
macOS Sierra (version 10.12.4)
Cloudbreak 1.14.0 TP
AWS EC2
Anaconda 2.7.13
Steps
Create Recipe
Before you can use a recipe during a cluster deployment, you have to create the recipe. In the Cloudbreak UI, look for the "mange recipes" section. It should look similar to this:
If this is your first time creating a recipe, you will have 0 recipes instead of the 2 recipes show in my interface.
Now click on the arrow next to manage recipes to display available recipes. You should see something similar to this:
Now click on the green create recipe button. You should see something similar to this:
Now we can enter the information for our recipe. I'm calling this recipe anaconda . I'm giving it the description of Install Anaconda . You can choose to install Anaconda as either pre-install or post-install. I'm choosing to do the install post-install. This means the script will be run after the Ambari installation process has started. So choose the Execution Type of POST . Choose Script so we can copy and paste the shell script. You can also specify a file to upload or a URL (gist for example). Our script is very basic. We are going to download the Anaconda install script, then run it in silent mode. Here is the script:
#!/bin/bash
wget https://repo.continuum.io/archive/Anaconda2-4.3.1-Linux-x86_64.sh
bash ./Anaconda2-4.3.1-Linux-x86_64.sh -b -p /opt/anaconda
When you have finished entering all of the information, you should see something similar to this:
If everything looks good, click on the green create recipe button.
After the recipe has been created, you should see something similar to this:
Create a Cluster using a Recipe
Now that our recipe has been created, we can create a cluster that uses the recipe. Go through the process of creating a cluster up to the Choose Blueprint step. This step is when you select the recipe you want to use. The recipes are not selected by default; you have to select the recipes you wish to use. You specify recipes for 1 or more host groups. This allows you to run different recipes across different host groups (masters, slaves, etc). You can also select multiple recipes.
We want to use the ```hdp-small-default``` blueprint. This will create a basic HDP cluster.
If you select the anaconda recipe, you should see something similar to this:
[Select Recipe]( ) In our case, we are going to run the recipe on every host group. If you intend to use something like Anaconda across the cluster, you should install it on at least the slave nodes and the client nodes.
After you have selected the recipe for the host groups, click the Review & Launch button, then launch the cluster. As the cluster is building, you should see a message in the Cloudbreak UI that indicates the recipe is running. When that happens, you will see something similar to this:
Cloudbreak will create logs for each recipe that runs on each host. These logs are located at /var/log/recipe and have the name of the recipe and whether it is pre or post install. For example, our recipe log is called post-anaconda.log . You can tail this log file to following the execution of the script.
NOTE: Post install scripts won't be executed until the Ambari server is installed and the cluster is building. You can always monitor the /var/log/recipe directory on a node to see when the script is being executed. The time it takes to run the script will vary depending on the cloud environment and how long it takes to spin up the cluster.
On your cluster, you should be able to see the post-install log:
$ ls /var/log/recipes
post-anaconda.log post-hdfs-home.log
Once the install process is complete, you should be able to verify that Anaconda is installed. You need to ssh into one of the cloud instances. You can get the public ip address from the Cloudbreak UI. You will login using the corresponding private key to the public key you entered when you created the Cloudbreak credential. You should login as the cloudbreak user. You should see something similar to this:
$ ssh -i ~/Downloads/keys/cloudbreak_id_rsa cloudbreak@#.#.#.#
The authenticity of host '#.#.#.# (#.#.#.#)' can't be established.
ECDSA key fingerprint is SHA256:By1MJ2sYGB/ymA8jKBIfam1eRkDS5+DX1THA+gs8sdU.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '#.#.#.#' (ECDSA) to the list of known hosts.
Last login: Sat May 13 00:47:41 2017 from 192.175.27.2
__| __|_ )
_| ( / Amazon Linux AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-ami/2016.09-release-notes/
25 package(s) needed for security, out of 61 available
Run "sudo yum update" to apply all updates.
Amazon Linux version 2017.03 is available.
Once you are on the server, you can check the version of python:
$ /opt/anaconda/bin/python --version
Python 2.7.13 :: Anaconda 4.3.1 (64-bit)
Review
If you have successfully followed along with this tutorial, you should know how to create pre and post install scripts. You should have successfully deployed a cluster on either AWS or Azure with Anaconda installed under /opt/anaconda on the nodes you specified.
... View more
Labels:
07-14-2017
02:44 PM
Answering my own question: It doesn't work with the latest version of cloudbreak (1.16.1). After login to the GUI I do get the error "Cannot retrieve csrf token" . But it does work with version 1.14.4 .
... View more
05-09-2017
09:36 PM
Thanks for your explanation @Michael Young ! Helped a lot.
... View more
04-13-2017
03:05 PM
@Michael Young Thanks ! That worked like a charm. I still have no idea why it doesn't let me upload using the HDFS UI so if you know why then I would love to know.
... View more
03-08-2017
06:26 PM
1 Kudo
@glupu this is exactly what i did, re-imported a new sandbox (and deleted the previous one). The one thing i lost due to that is missed all my Zeppelin notebooks (should have take a bakcup of them).
... View more
03-18-2017
12:46 AM
@Yogesh Sharma The _all field is analyzed by default, so you shouldn't have problems performing case-insensitive queries. You are also specifying the analyze_wildcard: true parameter which will attempt to analyze the query string with wildcards before running the query. As you have shown, the query itself returns hits. So the problem is with the aggregations. For your aggregations you are using the include parameter. Can you try using ".*drama.*" as the include value instead of "*drama*"?
... View more
03-05-2017
06:58 PM
4 Kudos
Objective
This tutorial will walk you through the process of using Ansible to deploy Hortonworks Data Platform (HDP) on Amazon Web Services (AWS). We will use the ansible-hadoop Ansible playbook from ObjectRocket to do this. You can find more information on that playbook here: ObjectRocket Ansible-Hadoop
This tutorial is part 2 of a 2 part series. Part 1 in the series will show you how to use Ansible to create instances on Amazon Web Services (AWS). Part 1 is avaiablle here: HCC Article Part 1
This tutorial was created as a companion to the Ansible + Hadoop talk I gave at the Ansible NOVA Meetup in February 2017. You can find the slides to that talk here: SlideShare
Prerequisites
You must have an existing AWS account.
You must have access to your AWS Access and Secret keys.
You are responsible for all AWS costs incurred.
You should have 3-6 instances created in AWS. If you completed Part 1 of this series, then you have an easy way to do that.
Scope
This tutorial was tested using the following environment and components:
Mac OS X 10.11.6 and 10.12.3
Amazon Web Services
Anaconda 4.1.6 (Python 2.7.12)
Ansible 2.1.3.0
git 2.10.1
Steps
Create python virtual environment
We are going to create a Python virtual environment for installing the required Python modules. This will help eliminate module version conflicts between applications.
I prefer to use Continuum Anaconda for my Python distribution. Therefore the steps for setting up a python virtual environment will be based on that. However, you can use standard python and the virtualenv command to do something similar.
To create a virtual environment using Anaconda Python, you use the conda create command. We will name our virtual environment ansible-hadoop . The the following command: conda create --name ansible-hadoop python will create our virtual environment with the name specified. You should see something similar to the following:
$ conda create --name ansible-hadoop python
Fetching package metadata .......
Solving package specifications: ..........
Package plan for installation in environment /Users/myoung/anaconda/envs/ansible-hadoop:
The following NEW packages will be INSTALLED:
openssl: 1.0.2k-1
pip: 9.0.1-py27_1
python: 2.7.13-0
readline: 6.2-2
setuptools: 27.2.0-py27_0
sqlite: 3.13.0-0
tk: 8.5.18-0
wheel: 0.29.0-py27_0
zlib: 1.2.8-3
Proceed ([y]/n)? y
Linking packages ...
cp: /Users/myoung/anaconda/envs/ansible-hadoop:/lib/libcrypto.1.0.0.dylib: No such file or directory
mv: /Users/myoung/anaconda/envs/ansible-hadoop/lib/libcrypto.1.0.0.dylib-tmp: No such file or directory
[ COMPLETE ]|################################################################################################| 100%
#
# To activate this environment, use:
# $ source activate ansible-hadoop
#
# To deactivate this environment, use:
# $ source deactivate
#
Switch python environments
Before installing python packages for a specific development environment, you should activate the environment. This is done with the command source activate <environment> . In our case the environment is the one we just created, ansible-hadoop . You should see something similar to the following:
$ source activate ansible-hadoop
As you can see there is no output to indicate if we were successful in changing our environment.
To verify, you can use the conda info --envs command list the available environments. The active environment will have a * . You should see something similar to the following:
$ conda info --envs
# conda environments:
#
ansible-hadoop * /Users/myoung/anaconda/envs/ansible-hadoop
root /Users/myoung/anaconda
As you can see, the ansible-hadoop environment has the * which means it is the active environment.
If you want to remove your python virtual environment, you can use the following command: conda remove --name <environment> --all . If you want to remove the environment we just created you should see something similar to the following:
$ conda remove --name ansible-hadoop --all
Package plan for package removal in environment /Users/myoung/anaconda/envs/ansible-hadoop:
The following packages will be REMOVED:
openssl: 1.0.2k-1
pip: 9.0.1-py27_1
python: 2.7.13-0
readline: 6.2-2
setuptools: 27.2.0-py27_0
sqlite: 3.13.0-0
tk: 8.5.18-0
wheel: 0.29.0-py27_0
zlib: 1.2.8-3
Proceed ([y]/n)? y
Unlinking packages ...
[ COMPLETE ]|################################################################################################| 100%
HW11380:test myoung$ conda info --envs
# conda environments:
#
root * /Users/myoung/anaconda
Install Python modules in virtual environment
The ansible-hadoop playbook requires a specific version of Ansible. You need to install Ansible 2.1.3.0 before using the playbook. You can do that easily with the following command:
pip install ansible==2.1.3.0
Using a Python Virtual environment allows us to easily use Ansbile 2.1.3.0 for our playbook without impacting the default Python versions.
Clone ansible-hadoop github repo
You need to clone the ansible-hadoop github repo to a working directory on your computer. I typically do this in ~/Development.
$ cd ~/Development
$ git clone https://github.com/objectrocket/ansible-hadoop.git
You should see something similar to the following:
$ git clone https://github.com/objectrocket/ansible-hadoop.git
Cloning into 'ansible-hadoop'...
remote: Counting objects: 3879, done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 3879 (delta 1), reused 0 (delta 0), pack-reused 3873
Receiving objects: 100% (3879/3879), 6.90 MiB | 0 bytes/s, done.
Resolving deltas: 100% (2416/2416), done.
Configure ansible-hadoop
You should make the ansible-hadoop repo directory your current working directory. There are a few configuration items we need to change.
$ cd ansible-hadoop
You should already have 3-6 instances available in AWS. You will need the public IP address of those instances.
Configure ansible-hadoop/inventory/static
We need to modify the inventory/static file to include the public IP addresses of our AWS instances. We need to assign master and slave nodes in the file. The instances are all the same configuration by default, so it doesn't matter which IP addresses you put for master and slave.
The default version of the inventory/static file should look similar to the following:
[master-nodes]
master01 ansible_host=192.168.0.2 bond_ip=172.16.0.2 ansible_user=rack ansible_ssh_pass=changeme
#master02 ansible_host=192.168.0.2 bond_ip=172.16.0.2 ansible_user=root ansible_ssh_pass=changeme
[slave-nodes]
slave01 ansible_host=192.168.0.3 bond_ip=172.16.0.3 ansible_user=rack ansible_ssh_pass=changeme
slave02 ansible_host=192.168.0.4 bond_ip=172.16.0.4 ansible_user=rack ansible_ssh_pass=changeme
[edge-nodes]
#edge01 ansible_host=192.168.0.5 bond_ip=172.16.0.5 ansible_user=rack ansible_ssh_pass=changeme
I'm going to be using 6 instances in AWS. I will put 3 instances as master servers and 3 instances as slave servers. There are a couple of extra options in the default file we don't need. The only values we need are:
hostname : which should be master, slave or edge with a 1-up number like master01 and slave01
ansible_host : should be the AWS public IP address of the instances
ansible_user : should be the username you SSH into the instance using the private key.
You can easily get the public IP address of your instances from the AWS console. Here is what mine looks like:
If you followed the part 1 tutorial, then the username for your instances should be centos . Edit your inventory/static . You should have something similar to the following:
[master-nodes]
master01 ansible_host=#.#.#.# ansible_user=centos
master03 ansible_host=#.#.#.# ansible_user=centos
master03 ansible_host=5#.#.#.# ansible_user=centos
[slave-nodes]
slave01 ansible_host=#.#.#.# ansible_user=centos
slave02 ansible_host=#.#.#.# ansible_user=centos
slave03 ansible_host=#.#.#.# ansible_user=centos
#[edge-nodes]
Your public IP addresses will be different. Also note the #[edge-nodes] value in the file. Because we are not using any edge nodes, we should comment that host group line in the file.
Once you have all of your edits in place, save the file.
Configure ansible-hadoop/ansible.cfg
There are a couple of changes we need to make to the ansible.cfg file. This file provides overall configuration settings for Ansible. The default file in the playbook should look similar to the following:
[defaults]
host_key_checking = False
timeout = 60
ansible_keep_remote_files = True
library = playbooks/library/cloudera
We need to change the library line to be library = playbooks/library/site_facts . We will be deploying HDP which requires the site_facts module. We also need to tell Ansible where to find the private key file for connecting to the instances.
Edit the ansible.cfg file. You should modify the file to be similar to the following:
[defaults]
host_key_checking = False
timeout = 60
ansible_keep_remote_files = True
library = playbooks/library/site_facts
private_key_file=/Users/myoung/Development/ansible-hadoop/ansible.pem
Note the path of your private_key_file will be different. Once you have all of your edits in place, save the file.
Configure ansible-hadoop/group_vars/hortonworks
This step is optional. The group_vars/hortonworks file allows you to change how HDP is deployed. You can modify the version of HDP and Ambari. You can modify which components are installed. You can also specify custom repos and Ambari blueprints.
I will be using the default file, so there are no changes made.
Run bootstrap_static.sh
Before installing HDP, we need to ensure our OS configuration on the AWS instances meet the installation prerequisites. This includes things like ensuring DNS and NTP are working and all of the OS packages are updated. These are tasks that you often find people doing manually. This would obviously be tedious across 100s or 1000s of nodes. It would also introduce a far greater number of opportunties for human error. Ansible makes it incredibly easy to perform these kinds of tasks.
Running the bootstrap process is as easy as bash bootstrap_static.sh . This script essentially runs ansible-playbook -i inventory/static playbooks/boostrap.yml for you. This process will typically take 7-10 minutes depending on the size of the instances you selected.
When the script is finished, you should see something similar to the following;
PLAY RECAP *********************************************************************
localhost : ok=3 changed=2 unreachable=0 failed=0
master01 : ok=21 changed=15 unreachable=0 failed=0
master03 : ok=21 changed=15 unreachable=0 failed=0
slave01 : ok=21 changed=15 unreachable=0 failed=0
slave02 : ok=21 changed=15 unreachable=0 failed=0
slave03 : ok=21 changed=15 unreachable=0 failed=0
As you can see, all of the nodes had 21 total task performed. Of those tasks, 15 tasks required modifications to be compliant with the desire configuration state.
Run hortonworks_static.sh
Now that the bootstrap process is complete, we can install HDP. The hortonworks_static.sh script is all you have to run to install HDP. This script essentially runs ansible-playbook -i inventory/static playbooks/hortonworks.yml for you. The script installs the Ambari Server on the last master node in our list. In my case, the last master node is master03. The script also installs the Ambari Agent on all of the nodes. The installation of HDP is performed by submitting an request to the Ambari Server API using an Ambari Blueprint.
This process will typically take 10-15 minutes depending on the size of the instances you selected, the number of master nodes and the list of HDP components you have enabled.
If you forgot to install the specific version of Ansible, you will likely see something similar to the following:
TASK [site facts processing] ***************************************************
fatal: [localhost]: FAILED! => {"failed": true, "msg": "ERROR! The module sitefacts.py dnmemory=\"31.0126953125\" mnmemory=\"31.0126953125\" cores=\"8\" was not found in configured module paths. Additionally, core modules are missing. If this is a checkout, run 'git submodule update --init --recursive' to correct this problem."}
PLAY RECAP *********************************************************************
localhost : ok=4 changed=2 unreachable=0 failed=1
master01 : ok=8 changed=0 unreachable=0 failed=0
master03 : ok=8 changed=0 unreachable=0 failed=0
slave01 : ok=8 changed=0 unreachable=0 failed=0
slave02 : ok=8 changed=0 unreachable=0 failed=0
slave03 : ok=8 changed=0 unreachable=0 failed=0
To resolve this, simply perform the pip install ansible==2.1.3.0 command within your Python virtual environment. Now you can rerun the bash hortonworks_static.sh script.
The last task of the playbook is to install HDP via an Ambari Blueprint. It is normal to see something similar to the following:
TASK [ambari-server : Create the cluster instance] *****************************
ok: [master03]
TASK [ambari-server : Wait for the cluster to be built] ************************
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (180 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (179 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (178 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (177 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (176 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (175 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (174 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (173 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (172 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (171 retries left).
FAILED - RETRYING: TASK: ambari-server : Wait for the cluster to be built (170 retries left).
Once you see 3-5 of the retry messages, you can access the Ambari interface via your web browser. The default login is admin and the default password is admin . You should see something similar to the following:
Click on the Operations icon that shows 10 operations in progress. You should see something similar to the following:
The installation task each takes between 400-600 seconds. The start task each take between 20-300 seconds. The master servers typically take longer to install and star than the slave servers.
When everything is running properly, you should see something similar to this:
If you look back at your terminal window, you should see something similar to the following:
ok: [master03]
TASK [ambari-server : Fail if the cluster create task is in an error state] ****
skipping: [master03]
TASK [ambari-server : Change Ambari admin user password] ***********************
skipping: [master03]
TASK [Cleanup the temporary files] *********************************************
changed: [master03] => (item=/tmp/cluster_blueprint)
changed: [master03] => (item=/tmp/cluster_template)
changed: [master03] => (item=/tmp/alert_targets)
ok: [master03] => (item=/tmp/hdprepo)
PLAY RECAP *********************************************************************
localhost : ok=5 changed=3 unreachable=0 failed=0
master01 : ok=8 changed=0 unreachable=0 failed=0
master03 : ok=30 changed=8 unreachable=0 failed=0
slave01 : ok=8 changed=0 unreachable=0 failed=0
slave02 : ok=8 changed=0 unreachable=0 failed=0
slave03 : ok=8 changed=0 unreachable=0 failed=0
Destroy the cluster
You should remember that you will incur AWS costs while the cluster is running. You can either shutdown or terminate the instances. If you want to use the cluster later, then use Ambari to stop all of the services before shutting down the instances.
Review
If you successfully followed along with this tutorial, you should have been able to easy deploy Hortonworks Data Platform 2.5 on AWS using the Ansible playbook. The process to deploy the cluster typicall takes 10-20 minutes.
For more information on how the instance types and number of master nodes impacted the installation time, review the Ansbile + Hadoop slides I linked at the top of the article.
... View more
03-04-2017
06:05 PM
4 Kudos
Objective
This tutorial will walk you through the process of using Ansible, an agent-less automation tool, to create instances on AWS. The Ansible playbook we will use is relatively simple; you can use it as a base to experiment with more advanced features. You can read more about Ansible here: Ansible.
Ansible is written in Python and is installed as a Python module on the control host. The only requirement for the hosts managed by Ansible is the ability to login with SSH. There is no requirement to install any software on the host managed by Ansible.
If you have never used Ansible, you can become more familiar with it by going through some basic tutorials. The following two tutorials are a good starting point:
Automate All Things With Ansible: Part One
Automate All Things With Ansible: Part Two
This tutorial is part 1 of a 2 part series. Part 2 in the series will show you how to use Ansible to deploy Hortonworks Data Platform (HDP) on Amazon Web Services (AWS).
This tutorial was created as a companion to the Ansible + Hadoop talk I gave at the Ansible NOVA Meetup in February 2017. You can find the slides to that talk here: SlideShare
You can get a copy of the playbook from this tutorial here: Github
Prerequisites
You must have an existing AWS account.
You must have access to your AWS Access and Secret keys.
You are responsible for all AWS costs incurred.
Scope
This tutorial was tested using the following environment and components:
Mac OS X 10.11.6 and 10.12.3
Amazon Web Services
Anaconda 4.1.6 (Python 2.7.12)
Ansible 2.0.0.2 and 2.1.3.0
Steps
Create a project directory
You need to create a directory for your Ansible playbook. I prefer to create my project directories in ~/Development.
mkdir ~/Development/ansible-aws
cd ~/Development/ansible-aws
Install Ansible module
If you use the Anaconda version of Python, you already have access to Ansible. If you are not using Anaconda, then you can usually install Ansible using the following command:
pip install ansible
To read more about how to install Ansible: Ansible Installation
Overview of our Ansible playbook
Our playbook is relatively simple. It consists of a single inventory file, single group_vars file and a single playbook file. Here is the layout of the file and directory structure:
+- ansible-aws/
|
+- group_vars/
| +- all
|
+- inventory/
| +- hosts
|
+- playbooks/
| +- ansible-aws.yml
group_vars/all
You can use variables in your playbooks using the {{variable name}} syntax. These variables are populated based on values stored in your variable files. You can explicitly load variable files in your playbooks.
However, all playbooks will automatically load the variables in the group_vars/all variable file. The all variable file is loaded for all hosts regardless of the groups the host may be in. In our playbook, we are placing our AWS configuration values in the all file.
Edit the group_vars/all file. Copy and paste the following text into the file:
aws_access_key: <enter AWS access key>
aws_secret_key: <enter AWS secret key>
key_name: <enter private key file alias name>
aws_region: <enter AWS region>
vpc_id: <enter VPC ID>
ami_id: ami-6d1c2007
instance_type: m4.2xlarge
my_local_cidr_ip: <enter cidr_ip>
aws_access_key : You need to enter your AWS Access key
aws_secret_key : You need to enter your AWS Secret key
key_name : The alias name you gave to the AWS private key which you will use to SSH into the instances. In my case I created a key called ansible .
aws_region : The AWS region where you want to deploy your instances. In my case I am using us-east-1 .
vpc_id : The specific VPC in which you want to place your instances.
ami_id : The specific AMI you want to deploy for your instances. The ami-6d1c2007 AMI is a CentOS 7 image.
instance_type : The type of AWS instance. For deploying Hadoop, I recommend at least m4.2xlarge . A faster alternative is c4.4xlarge .
my_local_cidr_ip : Your local computer's CIDR IP address. This is used for creating the security rules that allow your local computer to access the instances. An example CIDR format is 192.168.1.1/32 . Make sure this set to your computer's public IP address.
After you have entered your appropriate settings, save the file.
inventory/hosts
Ansible requires a list of known hosts against which playbooks and tasks are run. We will tell Ansible to use a specific host file with the -i inventory/hosts parameter.
Edit the inventory/hosts file. Copy and paste the following text into the file:
[local]
localhost ansible_python_interpreter=/Users/myoung/anaconda/bin/python
[local] : Defines the group the host belongs to. You have the option for a playbook to run against all hosts, a specific group of hosts, or an individual host. This AWS playbook only runs on your local computer. That is because it uses the AWS APIs to communicate with AWS.
localhost : This is the hostname. You can list multiple hosts, 1 per line under each group heading. A host can belong to multiple groups.
ansible_python_interpreter : Optional entry that tells Ansible which specific version of Python to run. Because I am using Anaconda Python, I've included that setting here.
After you have entered your appropriate settings, save the file.
playbooks/ansible-aws.yml
The playbook is where we define the list of tasks we want to perform. Our playbook will consist of 2 tasks. The first task is to create a specific AWS Security Group. The second tasks is to create a specific configuration of 6 instances on AWS.
Edit the file playbooks/ansible-aws.yml . Copy and paste the following text into the file:
---
# Basic provisioning example
- name: Create AWS resources
hosts: localhost
connection: local
gather_facts: False
tasks:
- name: Create a security group
ec2_group:
name: ansible
description: "Ansible Security Group"
region: "{{aws_region}}"
vpc_id: "{{vpc_id}}""
aws_access_key: "{{aws_access_key}}"
aws_secret_key: "{{aws_secret_key}}"
rules:
- proto: all
cidr_ip: "{{my_local_cidr_ip}}"
- proto: all
group_name: ansible
rules_egress:
- proto: all
cidr_ip: 0.0.0.0/0
register: firewall
- name: Create an EC2 instance
ec2:
aws_access_key: "{{aws_access_key}}"
aws_secret_key: "{{aws_secret_key}}"
key_name: "{{key_name}}"
region: "{{aws_region}}"
group_id: "{{firewall.group_id}}"
instance_type: "{{instance_type}}"
image: "{{ami_id}}"
wait: yes
volumes:
- device_name: /dev/sda1
volume_type: gp2
volume_size: 100
delete_on_termination: true
exact_count: 6
count_tag:
Name: aws-demo
instance_tags:
Name: aws-demo
register: ec2
This playbook uses the Ansible ec2 and ec2_group modules. You can read more about the options available to those modules here:
ec2
ec2_group
The task to create the EC2 security group creates a group named ansible . It defines 2 ingress rules and 1 egress rule for that security group. The first ingress rule is to allow all inbound traffic from any host in the security group ansible . The second ingress rule is to allow all inbound traffic from your local computer IP address. The egress rule allows all traffic out from all of the hosts.
The task to create the EC2 instances creates 6 hosts because of the exact_count setting. It creates a tag called hadoop-demo on each of the instances and uses that tag to determine how many hosts exists. You can chose to use smaller number of hosts.
You can specify volumes to mount on each of the instances. The default volume size is 8 GB and is too small for deploying Hadoop later. I recommend setting the size to at least 100 GB as above. I also recommend you set delete_on_termination to true . This will tell AWS to delete the storage after you have deleted the instances. If you do not do this, then storage will be kept and you will be charged for it.
After you have entered your appropriate settings, save the file.
Running the Ansible playbook
Now that our 3 files have been created and saved with the appropriate settings, we can run the playbook. To run the playbook, you use the ansible-playbook -i inventory/hosts playbooks/ansible-aws.yml command. You should see something similar to the following:
$ ansible-playbook -i inventory/hosts playbooks/ansible-aws.yml
PLAY [Create AWS resources] ****************************************************
TASK [Create a security group] *************************************************
changed: [localhost]
TASK [Create an EC2 instance] **************************************************
changed: [localhost]
PLAY RECAP *********************************************************************
localhost : ok=2 changed=2 unreachable=0 failed=0
The changed lines indicate that Ansible found a configuration that needed to be modify to be consistent with our requested state. For the security group task, you would see this if your security group didn't exist or if you had a different set of ingress or egress rules. For the instance tasks, you would see this if there were less than or more than 6 hosts tagged as aws-demo .
Check AWS console.
If you check your AWS console, you should be able to confirm the instances are created. You should see something similar to the following:
Review
If you successfully followed along with this tutorial, you have created a simple Ansible playbook with 2 tasks using the ec2 and ec2_group Ansible modules. The playbook creates an AWS security group and instances which can be used later for deploying HDP on AWS.
... View more