Member since
02-09-2016
559
Posts
422
Kudos Received
98
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
2103 | 03-02-2018 01:19 AM | |
3422 | 03-02-2018 01:04 AM | |
2330 | 08-02-2017 05:40 PM | |
2333 | 07-17-2017 05:35 PM | |
1692 | 07-10-2017 02:49 PM |
05-24-2017
06:06 PM
3 Kudos
This tutorial will walk you through the process of using Cloudbreak recipes to install TensorFlow for Anaconda Python on an HDP 2.6 cluster during cluster provisioning. We'll then update Zeppelin to use the newly install version of Anaconda and run a quick TensorFlow test.
Prerequisites
You should already have a Cloudbreak v1.14.4 environment running. You can follow this article to create a Cloudbreak instance using Vagrant and Virtualbox: HCC Article
You should already have created a blueprint that deploys HDP 2.6 with Spark 2.1. You can follow this article to get the blueprint setup. Do not create the cluster yet, as we will do that in this tutorial: HCC Article
You should already have credentials created in Cloudbreak for deploying on AWS (or Azure). This tutorial does not cover creating credentials.
Scope
This tutorial was tested in the following environment:
Cloudbreak 1.14.4
AWS EC2
HDP 2.6
Spark 2.1
Anaconda 2.7.13
TensorFlow 1.1.0
Steps
Create Recipe
Before you can use a recipe during a cluster deployment, you have to create the recipe. In the Cloudbreak UI, look for the mange recipes section. It should look similar to this:
If this is your first time creating a recipe, you will have 0 recipes instead of the 2 recipes show in my interface.
Now click on the arrow next to manage recipes to display available recipes. You should see something similar to this:
Now click on the green create recipe button. You should see something similar to this:
Now we can enter the information for our recipe. I'm calling this recipe tensorflow . I'm giving it the description of Install TensorFlow Python . You can choose to run the script as either pre-install or post-install . I'm choosing to do the install post-install . This means the script will be run after the Ambari installation process has started. So choose the Execution Type of POST . The script is fairly basic. We are going to download the Anaconda install script, then run it in silent mode. Then we'll use the Anaconda version of pip to install TensorFlow. Here is the script:
#!/bin/bash
wget https://repo.continuum.io/archive/Anaconda2-4.3.1-Linux-x86_64.sh
bash ./Anaconda2-4.3.1-Linux-x86_64.sh -b -p /opt/anaconda
/opt/anaconda/bin/pip install --ignore-installed --upgrade https://storage.googleapis.com/tensorflow/linux/cpu/tensorflow-1.1.0-cp27-none-linux_x86_64.whl
You can read more about installing TensorFlow on Anaconda here: TensorFlow Docs.
When you have finished entering all of the information, you should see something similar to this:
If everything looks good, click on the green create recipe button.
You should be able to see the recipe in your list of recipes:
NOTE: You will most likely have a different list of recipes.
Create a Cluster using a Recipe
Now that our recipe has been created, we can create a cluster that uses the recipe. Go through the process of creating a cluster up to the Choose Blueprint step. This step is where you select the recipe you want to use. The recipes are not selected by default; you have to select the recipes you wish to use. You can specify recipes for 1 or more host groups. This allows you to run different recipes across different host groups (masters, slaves, etc). You can also select multiple recipes.
We want to use the hdp26-spark-21-cluster blueprint. This will create an HDP 2.6 cluster with Spark 2.1 and Zeppelin. You should have created this blueprint when you followed the prerequisite tutorial. You should see something similar to this:
In our case, we are going to run the tensorflow recipe on every host group. If you intend to use something like TensorFlow across the cluster, you should install it on at least the slave nodes and the client nodes.
After you have selected the recipe for the host groups, click the Review & Launch button, then launch the cluster. As the cluster is building, you should see a message in the Cloudbreak UI that indicates the recipe is running. When that happens, you will see something similar to this:
If you click on the building cluster, you can see more detailed information. You should see something similar to this:
Once the cluster has finished building, you should see something similar to this:
Cloudbreak will create logs for each recipe that runs on each host. These logs are located at /var/log/recipe and have the name of the recipe and whether it is pre or post install. For example, our recipe log is called post-tensorflow.log . You can tail this log file to following the execution of the script.
NOTE: Post install scripts won't be executed until the Ambari server is installed and the cluster is building. You can always monitor the /var/log/recipe directory on a node to see when the script is being executed. The time it takes to run the script will vary depending on the cloud environment and how long it takes to spin up the cluster.
On your cluster, you should be able to see the post-install log:
$ ls /var/log/recipes
post-tensorflow.log post-hdfs-home.log
Verify Anaconda Install
Once the install process is complete, you should be able to verify that Anaconda is installed. You need to ssh into one of the cloud instances. You can get the public ip address from the Cloudbreak UI. You will login using the corresponding private key to the public key you entered when you created the Cloudbreak credential. You should login as the cloudbreak user. You should see something similar to this:
$ ssh -i ~/Downloads/keys/cloudbreak_id_rsa cloudbreak@#.#.#.#
The authenticity of host '#.#.#.# (#.#.#.#)' can't be established.
ECDSA key fingerprint is SHA256:By1MJ2sYGB/ymA8jKBIfam1eRkDS5+DX1THA+gs8sdU.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '#.#.#.#' (ECDSA) to the list of known hosts.
Last login: Sat May 13 00:47:41 2017 from 192.175.27.2
__| __|_ )
_| ( / Amazon Linux AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-ami/2016.09-release-notes/
25 package(s) needed for security, out of 61 available
Run "sudo yum update" to apply all updates.
Amazon Linux version 2017.03 is available.
Once you are on the server, you can check the version of Python:
$ /opt/anaconda/bin/python --version
Python 2.7.13 :: Anaconda 4.3.1 (64-bit)
Update Zeppelin Interpreter
We need to update the default spark2 interpreter configuration in Zeppelin. We need to access the Zeppelin UI from Ambari. You can login to Ambari for the new cluster from the Cloudbreak UI cluster details page. Once you login to Ambari, you can access the Zeppelin UI from the Ambari Quicklink. You should see something similar to this:
After you access the Zeppelin UI, click the blue login button in the upper right corner of the interface. You can login using the default username and password of admin . After you login to Zeppelin, click the admin button in the upper right corner of the interface. This will expose the options menu. You should see something similar to this:
Click on the Interpreter link in the menu. This will display all of the configured interpreters. Find the spark2 interpreter. You can see the default setting for zeppelin.pyspark.python is set to python . This will use whichever Python is found in the path. You should see something similar to this:
We will need to change this to /opt/anaconda/bin/python which is where we have Anaconda Python installed. Click on the edit button and change zeppelin.pyspark.python to /opt/anaconda/bin/python . You should see something similar to this:
Now we can click the blue save button at the bottom. The configuration changes are now saved, but we need to restart the interpreter for the changes to take affect. Click on the restart button to restart the interpreter.
Create Zeppelin Notebook
Now that our spark2 interpreter configuration has been updated, we can create a notebook to test Anaconda + TensorFlow. Click on the Notebook menu. You should see something similar to this:
Click on the Create new note link. You can give the notebook any descriptive name you like. Select spark2 as the default interpreter. You should see something similar to this:
Your notebook will start with a blank paragraph. For the first paragraph, let's test the version of Spark we are using. Enter the following in the first paragraph:
%spark2.pyspark
sc.version
Now click the run button for the paragraph. You should see something similar to this:
u'2.1.0.2.6.0.3-8'
As you can see, we are using Spark 2.1 Now in the second paragraph, we'll test the version of Python. We already know the command line verison is 2.7.13. Enter the following in the second paragraph:
%spark2.pyspark
import sys
print sys.version_info
Now click the run button for the paragraph. You should see something similar to this:
sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0)
As you can see, we are runnig Python version 2.7.13.
Now we can test TensorFlow. Enter the following in the third paragraph:
%spark2.pyspark
import tensorflow as tf
hello = tf.constant('Hello, TensorFlow!')
sess = tf.Session()
print(sess.run(hello))
a = tf.constant(10)
b = tf.constant(32)
print(sess.run(a + b))
This simple code comes from the TensorFlow website: [TensorFlow] (https://www.tensorflow.org/versions/r0.10/get_started/os_setup#anaconda_installation). Now click the run button for the paragraph. You may see some warning messages the first time you run it, but you should also see the following output:
Hello, TensorFlow!
42
As you can see, TensorFlow is working from Zeppelin which is using Spark 2.1 and Anaconda. If everything works properly, your notebook should look something similar this:
Admittedly this example is very basic, but it demonstrates the components are working together. For next steps, try running other TensorFlow code. Here are some examples you can work with: GitHub.
Review
If you have successfully followed along with this tutorial, you should have deployed an HDP 2.6 cluster in the cloud with Anaconda installed under /opt/anaconda and added the TensorFlow Python modules using a Cloudbreak recipe. You should have created a Zeppelin notebook which uses Anaconda Python, Spark 2.1 and TensorFlow.
... View more
Labels:
05-23-2017
11:16 PM
1 Kudo
This tutorial is part two of a two-part series. In this tutorial, we'll verify Spark 2.1 functionality using Zeppelin on an HDP 2.6 cluster deployed using Cloudbreak. The first tutorial covers using Cloudbreak to deploy the cluster. You can find the first tutorial here: HCC Article
Prerequisites
You should already have completed part one of this tutorial series and already have an Cloudbreak HDP 2.6 with Spark 2.1 cluster running.
Scope
This tutorial was tested in the following environment:
Cloudbreak 1.14.4
AWS EC2
HDP 2.6
Spark 2.1
Zeppelin 0.7
Steps
Login into Ambari
As mentioned in the prerequisites, you should already have a cluster built using Cloudbreak. Click on the cluster summary box in the Cloudbreak UI to display the cluster details. Now click on the link to your Ambari cluster. You may see something similar to this:
Your screen may vary depending on your browser of choice. I'm using Chrome. This warning is because we are using self-signed certificates which are not trusted. Click on the ADVANCED link. You should see something similar to this:
Click on the Proceed link to open the Ambari login screen. You should be able to login to Ambari using the username and password admin .
Login to Zeppelin
Now click on the Zeppelin component in the component status summary. You should see something similar to this:
Click on the Quicklinks link. You should see something similar to this:
Click on the Zeppelin UI link. This will load Zeppelin in a new browser tab. You should see something similar to this:
You should notice the blue Login button in the upper right corner of the Zeppelin UI. Click on this button. You should see something similar to this:
You should be able to login to Zeppelin using the username and password admin . Once you login, you should see something similar to this:
Load Getting Started Notebook
Now let's load the Apache Spark in 5 Minutes notebook by clicking on the Getting Started link. You should see something similar to this:
Click on the Apache Spark in 5 Minutes notebook. You should see something similar to this:
This is showing you the Zeppelin interpreters associated with this notebook. As you can see, the spark2 and livy2 interpreters are enabled. Click the blue Save button. You should see something similar to this:
This notebook defaults to using the Spark 2.x interpreter. You should be able to run the paragraphs without any changes. Scroll down the the notebook paragraph called Verify Spark Version . Click the play button on this paragraph. You should see something similar to this:
You should notice the Spark version is 2.1.0.2.6.0.3-8 . This confirms we are using Spark 2.1. It also confirms that Zeppelin is able to properly interact with Spark 2 on our HDP 2.6 cluster built with Cloudbreak. Try running the next two paragraphs. These paragraphs download a json file form github and then moves it to HDFS on our cluster. Now run the Load data into a Spark DataFrame paragraph. You should see something similar to this:
As you can see, the DataFrame should be properly loaded from the json file.
Next Steps
Try running the remaining paragraphs to ensure everything is working ok. For an extra challenge, try running some of the other Spark 2 notebooks that are included. You can also attempt to modify the Spark 1.6 notebooks to work with Spark 2.1.
Review
If you have successfully followed along with this tutorial, you should have been able to confirm Spark 2.1 works on our HDP 2.6 cluster deployed with Cloudbreak.
... View more
05-23-2017
09:41 PM
2 Kudos
This tutorial will walk you through the process of using Cloudbreak to deploy an HDP 2.6 cluster with Spark 2.1. We'll copy and edit the existing hdp-spark-cluster blueprint which deploys Spark 1.6 to create a new blueprint which installs Spark 2.1. This tutorial is part one of a two-part series. The second tutorial walks you through using Zeppelin to verify the Spark 2.1 installation. You can find that tutorial here: HCC Article
Prerequisites
You should already have a Cloudbreak v1.14.0 environment running. You can follow this article to create a Cloudbreak instance using Vagrant and Virtualbox: HCC Article
You should already have updated Cloudbreak to support deploying HDP 2.6 clusters. You can follow this article to enable that functionality: HCC Article
Scope
This tutorial was tested in the following environment:
Cloudbreak 1.14.4
AWS EC2
HDP 2.6
Spark 2.1
Steps
Create Blueprint
Before we can deploy a Spark 2.1 cluster using Cloudbreak, we need to create a blueprint that specifies Spark 2.1. Cloudbreak ships with 3 blueprints out of the box:
hdp-small-default: basic HDP cluster with Hive and HBase
hdp-spark-cluster: basic HDP cluster with Spark 1.6
hdp-streaming-cluster: basic HDP cluster with Kafka and Storm
We will use the hdp-spark-cluster as our base blueprint and edit it to deploy Spark 2.1 instead of Spark 1.6.
Click on the manage blueprints section of the UI. Click on the hdp-spark-cluster blueprint. You should see something similar to this:
Click on the blue copy & edit button. You should see something similar to this:
For the Name , enter hdp26-spark21-cluster . This tells us the blueprint is for an HDP 2.6 cluster using Spark 2.1. Enter the same information for the Description . You should see something similar to this:
Now, we need to edit the JSON portion of the blueprint. We need to change the Spark 1.6 components to Spark 2.1 components. We don't need change where they are deployed. The following entries within the JSON are for Spark 1.6:
"name": "SPARK_CLIENT"
"name": "SPARK_JOBHISTORYSERVER"
"name": "SPARK_CLIENT"
We will replace SPARK with SPARK2 . These entries should look as follows:
"name": "SPARK2_CLIENT"
"name": "SPARK2_JOBHISTORYSERVER"
"name": "SPARK2_CLIENT"
NOTE: There are two entries for SPARK_CLIENT. Make sure you change both.
We are going to add an entry for the LIVY component. We will add it to the same node as the SPARK_JOBHISTORYSERVER . We are also going to add an entry for the SPARK2_THRIFTSERVER component. We will add it to the same node as the SPARK_JOBHISTORYSERVER . Let's add those two entries just below SPARK2_CLIENT in the host_group_master_2 section.
Change the following:
{
"name": "SPARK2_JOBHISTORYSERVER"
},
{
"name": "SPARK2_CLIENT"
},
to this:
{
"name": "SPARK2_JOBHISTORYSERVER"
},
{
"name": "SPARK2_CLIENT"
},
{
"name": "SPARK2_THRIFTSERVER"
},
{
"name": "LIVY2_SERVER"
},
We also need to update the blueprint_name to hdp26-spark21-cluster and the stack_version to 2.6 . you should have something similar to this:
"Blueprints": {
"blueprint_name": "hdp26-spark21-cluster",
"stack_name": "HDP",
"stack_version": "2.6"
}
If you prefer, you can copy and paste the following blueprint JSON:
{
"host_groups": [
{
"name": "host_group_client_1",
"configurations": [],
"components": [
{
"name": "ZOOKEEPER_CLIENT"
},
{
"name": "PIG"
},
{
"name": "OOZIE_CLIENT"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "HCAT"
},
{
"name": "KNOX_GATEWAY"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "FALCON_CLIENT"
},
{
"name": "TEZ_CLIENT"
},
{
"name": "SPARK2_CLIENT"
},
{
"name": "SLIDER"
},
{
"name": "SQOOP"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "HIVE_CLIENT"
},
{
"name": "YARN_CLIENT"
},
{
"name": "METRICS_COLLECTOR"
},
{
"name": "MAPREDUCE2_CLIENT"
}
],
"cardinality": "1"
},
{
"name": "host_group_master_3",
"configurations": [],
"components": [
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "APP_TIMELINE_SERVER"
},
{
"name": "TEZ_CLIENT"
},
{
"name": "HBASE_MASTER"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "SECONDARY_NAMENODE"
}
],
"cardinality": "1"
},
{
"name": "host_group_slave_1",
"configurations": [],
"components": [
{
"name": "HBASE_REGIONSERVER"
},
{
"name": "NODEMANAGER"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "DATANODE"
}
],
"cardinality": "6"
},
{
"name": "host_group_master_2",
"configurations": [],
"components": [
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "ZOOKEEPER_CLIENT"
},
{
"name": "PIG"
},
{
"name": "MYSQL_SERVER"
},
{
"name": "HIVE_SERVER"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "SPARK2_JOBHISTORYSERVER"
},
{
"name": "SPARK2_CLIENT"
},
{
"name": "SPARK2_THRIFTSERVER"
},
{
"name": "LIVY2_SERVER"
},
{
"name": "TEZ_CLIENT"
},
{
"name": "HBASE_CLIENT"
},
{
"name": "HIVE_METASTORE"
},
{
"name": "ZEPPELIN_MASTER"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "YARN_CLIENT"
},
{
"name": "MAPREDUCE2_CLIENT"
},
{
"name": "RESOURCEMANAGER"
},
{
"name": "WEBHCAT_SERVER"
}
],
"cardinality": "1"
},
{
"name": "host_group_master_1",
"configurations": [],
"components": [
{
"name": "ZOOKEEPER_SERVER"
},
{
"name": "HISTORYSERVER"
},
{
"name": "OOZIE_CLIENT"
},
{
"name": "NAMENODE"
},
{
"name": "OOZIE_SERVER"
},
{
"name": "HDFS_CLIENT"
},
{
"name": "YARN_CLIENT"
},
{
"name": "FALCON_SERVER"
},
{
"name": "METRICS_MONITOR"
},
{
"name": "MAPREDUCE2_CLIENT"
}
],
"cardinality": "1"
}
],
"Blueprints": {
"blueprint_name": "hdp26-spark21-cluster",
"stack_name": "HDP",
"stack_version": "2.6"
}
}
Once you have all of the changes in place, click the green create blueprint button.
Create Security Group
We need to create a new security group to use with our cluster. By default, the existing security groups only allow ports 22, 443, and 9443. As part of this tutorial, we will use Zeppelin to test Spark 2.1. We'll create a new security group that opens all ports to our IP address.
Click on the manage security groups section of the UI. You should see something similar to this:
Click on the green create security group button. You should see something similar to this:
First you need to select the appropriate cloud platform. I'm using AWS, so that is what I selected. We need to provide a unique name for our security group. I used all-ports-my-ip . You should use something descriptive. Provide a helpful description as well. Now we need to enter our personal IP address CIDR. I am using #.#.#.#/32 ; your IP address will obviously be different. You need to enter the port range. There is a known issue in Cloudbreak that prevents you from using 0-65356 , so we'll use 1-65356 . For the protocol, use tcp . Once you have everything entered, you should see something similar to this:
Click the green Add Rule button to add this rule to our security group. You can add multiple rules, but we have everything covered with our single rule. You should see something similar to this:
If everything looks good, click the green create security group button. This will create our new security group. You should see something like this:
Create Cluster
Now that our blueprint has been created and we have an new security group, we can begin building the cluster. Ensure you have selected the appropriate credential for your cloud environment. Then click the green create cluster button. You should see something similar to this:
Give your cluster a descriptive name. I used spark21test , but you can use whatever you like. Select an appropriate cloud region. I'm using AWS and selected US East (N. Virginia) , but you can use whatever you like. You should see something similar to this:
Click on the Setup Network and Security button. You should see something similar to this:
We are going to keep the default options here. Click on the Choose Blueprint button. You should see something similar to this:
Expand the blueprint dropdown menu. You should see the blueprint we created before, hdp26-spark21-cluster . Select the blueprint. You should see something similar to this:
You should notice the new security group is already selected. Cloudbreak did not automatically figure this out. The instance templates and security groups are selected alphabetically be default.
Now we need to select a node on which to deploy Ambari. I typically deploy Ambari on the master1 server. Check the Ambari check box on one of the master servers. If everything looks good, click on the green create cluster , You should see something similar to this:
Once the cluster has finished building, you can click on the arrow for the cluster we created to get expanded details. You should see something similar to this:
Verify Versions
Once the cluster is fully deployed, we can verify the versions of the components. Click on the Ambari link on the cluster details page. Once you login to Ambari, you should see something similar to this:
You should notice that Spark2 is shown in the component list. Click on Spark2 in the list. You should see something similar to this:
You should notice that both the Spark2 Thrift Server and the Livy2 Server have been installed. Now lets check the overall cluster verions. Click on the Admin link in the Ambari menu and select Stacks and Versions . Then click on the Versions tab. You should see something similar to this:
As you can see, HDP 2.6.0.3 was deployed.
Review
If you have successfully followed along with this tutorial, you should know how to create a new security group and blueprint. The blueprint allows you to deploy HDP 2.6 with Spark 2.1. The security group allows you to access all ports on the cluster from your IP address. Follow along in part 2 of the tutorial series to use Zeppelin to test Spark 2.1.
... View more
05-18-2017
03:00 PM
6 Kudos
Prerequisites
You should already have a Cloudbreak v1.14.4 environment running. You can follow this article to create a Cloudbreak instance using Vagrant and Virtualbox: HCC Article
You should already have credentials created in Cloudbreak for deploying on AWS (or Azure).
Scope
This tutorial was tested in the following environment:
macOS Sierra (version 10.12.4)
Cloudbreak 1.14.4
AWS EC2
NOTE: Cloudbreak 1.14.0 (TP) had a bug which caused HDP 2.6 clusters installs to fail. You should upgrade your Cloudbreak deployer instance to 1.14.4.
Steps
Create application.yml file
UPDATE 05/24/2017: The creation of a custom application.yml file is not required with Cloudbreak 1.14.4. This version of Cloudbreak includes support for HDP 2.5 and HDP 2.6. This step remains for educational purposes for future HDP updates.
You need to create an application.yml file in the etc directory within your Cloudbreak deployment directory. This file will contain the repo information for HDP 2.6. If you followed my tutorial linked above, then your Cloudbreak deployment directory should be /opt/cloudbreak-deployment . If you are using a Cloudbreak instance on AWS or Azure, then your Cloudbreak deployment directory is likely /var/lib/cloudbreak-deployment/ .
Edit your <cloudbreak-deployment>/etc/application.yml file using your favorite editor. Copy and paste the following in the file:
cb:
ambari:
repo:
version: 2.5.0.3-7
baseurl: http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.5.0.3
gpgkey: http://public-repo-1.hortonworks.com/ambari/centos6/RPM-GPG-KEY/RPM-GPG-KEY-Jenkins
database:
vendor: embedded
host: localhost
port: 5432
name: postgres
username: ambari
password: bigdata
</p>
<p>
hdp:
entries:
2.5:
version: 2.5.0.1-210
repoid: HDP-2.5
repo:
stack:
repoid: HDP-2.5
redhat6: http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.5.5.0
redhat7: http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.5.5.0
util:
repoid: HDP-UTILS-1.1.0.21
redhat6: http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.21/repos/centos6
redhat7: http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.21/repos/centos7
2.6:
version: 2.6.0.0-598
repoid: HDP-2.6
repo:
stack:
repoid: HDP-2.6
redhat6: http://public-repo-1.hortonworks.com/HDP/centos6/2.x/updates/2.6.0.3
redhat7: http://public-repo-1.hortonworks.com/HDP/centos7/2.x/updates/2.6.0.3
util:
repoid: HDP-UTILS-1.1.0.21
redhat6: http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.21/repos/centos6
redhat7: http://public-repo-1.hortonworks.com/HDP-UTILS-1.1.0.21/repos/centos7
Start Cloudbreak
Once you have created your application.yml file, you can start Cloudbreak.
$ cbd start
NOTE: It may take a couple of minutes before Cloudbreak is fully running.
Create HDP 2.6 Blueprint
To create an HDP 2.6 cluster, we need to update our blueprint to specify HDP 2.6. On the main Cloudbreak UI, click on manage blueprints . You should see something similar to this:
You should see 3 default blueprints. We are going to use the hdp-small-default blueprint as our base. Click on the hdp-small-default blueprint name. You should see something similar to this:
Now click on the blue copy & edit button. You should see something similar to this:
For the Name , you should enter something unqiue and descriptive. I suggest hdp26-small-default . For the Description , you can enter the same information. You should see something similar to this:
Now we need to edit the JSON portion of the blueprint. Scroll down to the bottom of the JSON. You should see something similar to this:
Now edit the blueprint_name value to be hdp26-small-default and edit the stack_version to be 2.6 . You should see something similar to this:
Now click on the green create blueprint button. You should see the new blueprint visible in the list of blueprints.
Create HDP 2.6 Small Default Cluster
Now that our blueprint has been created, we can create a cluster and select this blueprint to install HDP 2.6. Select the appropriate credential for your Cloud environment. Click on the create cluster button. You should see something similar to this:
Provide a unique, but descriptive Cluster Name . Ensure you select an appropriate Region . I chose hdp26test as my cluster name and I'm using the US East region:
Now advanced to the next step by clicking on Setup Network and Security . You should see something similar to this:
We don't need to make any changes here, so click on the Choose Blueprint button. You should see something similar to this:
In the Blueprint dropdown, you should see the blueprint we created. Select the hdp26-small-default blueprint. You should see something similar to this:
You need to select which node Ambari will run on. I typically select the master1 node. You should see something similar to this:
Now you can click on the Review and Launch button. You should see something similar to this:
Verify the information presented. If everything looks good, click on the create and start cluster button . Once the cluster build process has started, you should see something similar to this:
Verify HDP Version
Once the cluster has finished building, you can click on the cluster in the Cloudbreak UI. You should see something similar to this:
Click on the Ambari link to load Ambari. Login using the default username and password of admin . Now click on the Admin link in the menu. You should see something similar to this:
Click on the Stack and Versions link. You should see something similar to this:
You should notice that HDP 2.6.0.3 has been deployed.
Review
If you have successfully followed along with this tutorial, you should know how to create/update /etc/application.yml to add specific Ambair and HDP repositories. You should have successfully created an updated blueprint and deployed HDP 2.6 on your cloud of choice.
... View more
05-13-2017
01:57 AM
5 Kudos
Objectives
This tutorial will walk you through the process of using Cloudbreak recipes to install Anaconda on an your HDP cluster during cluster provisioning. This process can be used to automate many tasks on the cluster both pre-install and post-install.
Prerequisites
You should already have a Cloudbreak v1.14.0 environment running. You can follow this article to create a Cloudbreak instance using Vagrant and Virtualbox: HCC Article
You should already have credentials created in Cloudbreak for deploying on AWS (or Azure).
Scope
This tutorial was tested in the following environment:
macOS Sierra (version 10.12.4)
Cloudbreak 1.14.0 TP
AWS EC2
Anaconda 2.7.13
Steps
Create Recipe
Before you can use a recipe during a cluster deployment, you have to create the recipe. In the Cloudbreak UI, look for the "mange recipes" section. It should look similar to this:
If this is your first time creating a recipe, you will have 0 recipes instead of the 2 recipes show in my interface.
Now click on the arrow next to manage recipes to display available recipes. You should see something similar to this:
Now click on the green create recipe button. You should see something similar to this:
Now we can enter the information for our recipe. I'm calling this recipe anaconda . I'm giving it the description of Install Anaconda . You can choose to install Anaconda as either pre-install or post-install. I'm choosing to do the install post-install. This means the script will be run after the Ambari installation process has started. So choose the Execution Type of POST . Choose Script so we can copy and paste the shell script. You can also specify a file to upload or a URL (gist for example). Our script is very basic. We are going to download the Anaconda install script, then run it in silent mode. Here is the script:
#!/bin/bash
wget https://repo.continuum.io/archive/Anaconda2-4.3.1-Linux-x86_64.sh
bash ./Anaconda2-4.3.1-Linux-x86_64.sh -b -p /opt/anaconda
When you have finished entering all of the information, you should see something similar to this:
If everything looks good, click on the green create recipe button.
After the recipe has been created, you should see something similar to this:
Create a Cluster using a Recipe
Now that our recipe has been created, we can create a cluster that uses the recipe. Go through the process of creating a cluster up to the Choose Blueprint step. This step is when you select the recipe you want to use. The recipes are not selected by default; you have to select the recipes you wish to use. You specify recipes for 1 or more host groups. This allows you to run different recipes across different host groups (masters, slaves, etc). You can also select multiple recipes.
We want to use the ```hdp-small-default``` blueprint. This will create a basic HDP cluster.
If you select the anaconda recipe, you should see something similar to this:
[Select Recipe]( ) In our case, we are going to run the recipe on every host group. If you intend to use something like Anaconda across the cluster, you should install it on at least the slave nodes and the client nodes.
After you have selected the recipe for the host groups, click the Review & Launch button, then launch the cluster. As the cluster is building, you should see a message in the Cloudbreak UI that indicates the recipe is running. When that happens, you will see something similar to this:
Cloudbreak will create logs for each recipe that runs on each host. These logs are located at /var/log/recipe and have the name of the recipe and whether it is pre or post install. For example, our recipe log is called post-anaconda.log . You can tail this log file to following the execution of the script.
NOTE: Post install scripts won't be executed until the Ambari server is installed and the cluster is building. You can always monitor the /var/log/recipe directory on a node to see when the script is being executed. The time it takes to run the script will vary depending on the cloud environment and how long it takes to spin up the cluster.
On your cluster, you should be able to see the post-install log:
$ ls /var/log/recipes
post-anaconda.log post-hdfs-home.log
Once the install process is complete, you should be able to verify that Anaconda is installed. You need to ssh into one of the cloud instances. You can get the public ip address from the Cloudbreak UI. You will login using the corresponding private key to the public key you entered when you created the Cloudbreak credential. You should login as the cloudbreak user. You should see something similar to this:
$ ssh -i ~/Downloads/keys/cloudbreak_id_rsa cloudbreak@#.#.#.#
The authenticity of host '#.#.#.# (#.#.#.#)' can't be established.
ECDSA key fingerprint is SHA256:By1MJ2sYGB/ymA8jKBIfam1eRkDS5+DX1THA+gs8sdU.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '#.#.#.#' (ECDSA) to the list of known hosts.
Last login: Sat May 13 00:47:41 2017 from 192.175.27.2
__| __|_ )
_| ( / Amazon Linux AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-ami/2016.09-release-notes/
25 package(s) needed for security, out of 61 available
Run "sudo yum update" to apply all updates.
Amazon Linux version 2017.03 is available.
Once you are on the server, you can check the version of python:
$ /opt/anaconda/bin/python --version
Python 2.7.13 :: Anaconda 4.3.1 (64-bit)
Review
If you have successfully followed along with this tutorial, you should know how to create pre and post install scripts. You should have successfully deployed a cluster on either AWS or Azure with Anaconda installed under /opt/anaconda on the nodes you specified.
... View more
Labels:
05-12-2017
10:17 PM
17 Kudos
Note: A newer version of this article is available here: https://community.hortonworks.com/articles/194076/using-vagrant-and-virtualbox-to-create-a-local-ins.html Objectives
This tutorial is designed to walk you through the process of using Vagrant and Virtualbox to create a local instance of Cloudbreak. This will allow you to start your Cloudbreak deployer when you want to spin up an HDP cluster on the cloud without incurring costs associated with hosting your Cloudbreak instance on the cloud itself. Prerequisites
You should already have installed VirtualBox 5.1.x. Read more here: VirtualBox You should already have installed Vagrant 1.9.x. Read more here: Vagrant You should already have installed the vagrant-vbguest plugin. This plugin will keep the VirtualBox Guest Additions software current as you upgrade your kernel and/or VirtualBox versions. Read more here: vagrant-vbguest You should already have installed the vagrant-hostmanager plugin. This plugin will automatically manage the /etc/hosts file on your local computer and in your virtual machines. Read more here: vagrant-hostmanager Scope
This tutorial was tested in the following environment:
macOS Sierra (version 10.12.4) VirtualBox 5.1.22 Vagrant 1.9.4 vagrant-vbguest plugin 0.14.1 vagrant-hostnamanger plugin 1.8.6 Cloudbreak 1.14.0 TP Steps Setup Vagrant Create Vagrant project directory
Before we get started, determine where you want to keep your Vagrant project files. Each Vagrant project should have its own directory. I keep my Vagrant projects in my ~/Development/Vagrant directory. You should also use a helpful name for each Vagrant project directory you create.
$ cd ~/Development/Vagrant
$ mkdir centos7-cloudbreak
$ cd centos7-cloudbreak
We will be using a CentOS 7.3 Vagrant box, so I include centos7 in the Vagrant project name to differentiate it from a CentOS 6 project. The project is for cloudbreak, so I include that in the name. Create Vagrantfile
The Vagrantfile tells Vagrant how to configure your virtual machines. You can copy/paste my Vagrantfile below:
# -*- mode: ruby -*-
# vi: set ft=ruby :
# Using yaml to load external configuration files
require 'yaml'
Vagrant.configure("2") do |config|
# Using the hostmanager vagrant plugin to update the host files
config.hostmanager.enabled = true
config.hostmanager.manage_host = true
config.hostmanager.manage_guest = true
config.hostmanager.ignore_private_ip = false
# Loading in the list of commands that should be run when the VM is provisioned.
commands = YAML.load_file('commands.yaml')
commands.each do |command|
config.vm.provision :shell, inline: command
end
# Loading in the VM configuration information
servers = YAML.load_file('servers.yaml')
servers.each do |servers|
config.vm.define servers["name"] do |srv|
srv.vm.box = servers["box"] # Speciy the name of the Vagrant box file to use
srv.vm.hostname = servers["name"] # Set the hostname of the VM
srv.vm.network "private_network", ip: servers["ip"], :adapater=>2 # Add a second adapater with a specified IP
srv.vm.provision :shell, inline: "sed -i'' '/^127.0.0.1\t#{srv.vm.hostname}\t#{srv.vm.hostname}$/d' /etc/hosts" # Remove the extraneous first entry in /etc/hosts
srv.vm.provider :virtualbox do |vb|
vb.name = servers["name"] # Name of the VM in VirtualBox
vb.cpus = servers["cpus"] # How many CPUs to allocate to the VM
vb.memory = servers["ram"] # How much memory to allocate to the VM
end
end
end
end
Create a servers.yaml file
The servers.yaml file contains the configuration information for our VMs. Here is the content from my file:
---
- name: cloudbreak
box: bento/centos-7.3
cpus: 2
ram: 4096
ip: 192.168.56.100
NOTE: You may need to modify the IP address to avoid conflicts with your local network. Create commands.yaml file
The commands.yaml file contains the list of commands that should be run on each VM when they are first provisioned. This allows us to automate configuration tasks that would other wise be tedious and/or repetitive. Here is the content from my file:
- "sudo yum -y update"
- "sudo yum -y install net-tools ntp wget lsof unzip tar iptables-services"
- "sudo systemctl enable ntpd && sudo systemctl start ntpd"
- "sudo systemctl disable firewalld && sudo systemctl stop firewalld"
- "sudo iptables --flush INPUT && sudo iptables --flush FORWARD && sudo service iptables save"
- "sudo sed -i --follow-symlinks 's/^SELINUX=.*/SELINUX=disabled/g' /etc/sysconfig/selinux"
Start Virtual Machines
Once you have created the 3 files in your Vagrant project directory, you are ready to start your cluster. Creating the cluster for the first time and starting it every time after that uses the same command:
$ vagrant up
You should notice Vagrant automatically updating the packages on the VM.
Once the process is complete you should have 1 servers running. You can verify by looking at the Virtualbox UI where you should see the cloudbreak VM running. You should see something similar to this:
Connect to each virtual machine
You are able to login to the VM via ssh using the vagrant ssh command.
$ vagrant ssh
[vagrant@cloudbreak ~]$
Install Cloudbreak
Most of the Cloudbreak installation is covered well in the docs:
Cloudbreak Install Docs. However, the first couple of steps in the docs has you install a few packages, change iptables settings, etc. That part of the install is actually handled by the Vagrant provisioning step, so you can skip those steps. You should be able to start at the Docker Service section of the docs.
We need to be root for most of this, so we'll use sudo.
sudo -i
Create Docker Repo
We need to add a repo so we can install Docker.
cat > /etc/yum.repos.d/docker.repo <<"EOF"
[dockerrepo]
name=Docker Repository
baseurl=https://yum.dockerproject.org/repo/main/centos/7
enabled=1
gpgcheck=1
gpgkey=https://yum.dockerproject.org/gpg
EOF
Install Docker Service
Now we need to install Docker and enable the service.
yum install -y docker-engine-1.9.1 docker-engine-selinux-1.9.1
systemctl start docker
systemctl enable docker
Install Cloudbreak Deployer
Now we can install Cloudbreak itself.
yum -y install unzip tar
curl -Ls s3.amazonaws.com/public-repo-1.hortonworks.com/HDP/cloudbreak/cloudbreak-deployer_1.14.0_$(uname)_x86_64.tgz | sudo tar -xz -C /bin cbd
Once the Cloudbreak Deployer is installed, you can check the version of the install software.
cbd --version
You should see something similar to this:
[root@cloudbreak cloudbreak-deployment]# cbd --version
Cloudbreak Deployer: 1.14.0
NOTE: Notice that we are installing version 1.14.0. You may want to consider installing the latest version, which is 1.16.1 as of August 2017. Create Cloudbreak Profile
You should make a Cloudbreak application directory. This is where the Cloudbreak configuration files and logs will be located.
cd /opt
mkdir cloudbreak-deployment
cd cloudbreak-deployment
Now you need to setup the Profile file. This file contains environment variables that determines how Cloudbreak runs. Edit Profile using your editor of choice.
I recommend the following settings for your profile:
export UAA_DEFAULT_SECRET='[SECRET]'
export UAA_DEFAULT_USER_EMAIL='<myemail>'
export UAA_DEFAULT_USER_PW='<mypassword>'
export PUBLIC_IP=192.168.56.100
export CLOUDBREAK_SMTP_SENDER_USERNAME='<myemail>'
export CLOUDBREAK_SMTP_SENDER_PASSWORD='<mypassword>'
export CLOUDBREAK_SMTP_SENDER_HOST='smtp.gmail.com'
export CLOUDBREAK_SMTP_SENDER_PORT=25
export CLOUDBREAK_SMTP_SENDER_FROM='<myemail>'
export CLOUDBREAK_SMTP_AUTH=true
export CLOUDBREAK_SMTP_STARTTLS_ENABLE=true
export CLOUDBREAK_SMTP_TYPE=smtp
You should set the UAA_DEFAULT_USER_EMAIL variable to the email address you want to use. This is the account you will use to login to Cloudbreak. You should set the UAA_DEFAULT_USER_PW variable to the password you want to use. This is the password you will use to login to Cloudbreak.
You should set the CLOUDBREAK_SMTP_SENDER_USERNAME variable to the username you use to authenticate to your SMTP server. You should set the CLOUDBREAK_SMTP_SENDER_PASSWORD variable to the password you use to authenticate to your SMTP server.
NOTE: The SMTP variables are how you enable Cloudbreak to send you an email when the cluster operations are done. This is optional and is only required if you want to use the checkbox to get emails when you build a cluster. The example above assumes you are using GMail. You should use the settings appropriate for your SMTP server. Initialize Cloudbreak Configuration
Now that you have a profile, you can initialize your Cloudbreak configuration files.
cbd generate
You should see something similar to this:
[root@cloudbreak cloudbreak-deployment]# cbd generate
* Dependency required, installing sed latest ...
* Dependency required, installing jq latest ...
* Dependency required, installing docker-compose 1.9.0 ...
* Dependency required, installing aws latest ...
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
03310923a82b: Pulling fs layer
6fc6c6aca926: Pulling fs layer
6fc6c6aca926: Verifying Checksum
6fc6c6aca926: Download complete
03310923a82b: Verifying Checksum
03310923a82b: Download complete
03310923a82b: Pull complete
6fc6c6aca926: Pull complete
Digest: sha256:7875e46eb14555e893e7c23a7f90a0d2396f6b56c8c3dcf68f9ed14879b8966c
Status: Downloaded newer image for alpine:latest
Generating Cloudbreak client certificate and private key in /opt/cloudbreak-deployment/certs.
generating docker-compose.yml
generating uaa.yml
[root@cloudbreak cloudbreak-deployment]#
Start Cloudbreak Deployer
You should be able to start the Cloudbreak Deployer application. This process will first pull down the Docker images used by Cloudbreak.
cbd pull
cbd start
You should notice a bunch of images being pulled down:
[root@cloudbreak cloudbreak-deployment]# cbd start
generating docker-compose.yml
generating uaa.yml
Pulling haveged (hortonworks/haveged:1.1.0)...
1.1.0: Pulling from hortonworks/haveged
ca26f34d4b27: Pull complete
bf22b160fa79: Pull complete
d30591ea011f: Pull complete
22615e74c8e4: Pull complete
ceb5854e0233: Pull complete
Digest: sha256:09f8cf4f89b59fe2b391747181469965ad27cd751dad0efa0ad1c89450455626
Status: Downloaded newer image for hortonworks/haveged:1.1.0
Pulling uluwatu (hortonworks/cloudbreak-web:1.14.0)...
1.14.0: Pulling from hortonworks/cloudbreak-web
16e32a1a6529: Pull complete
8e153fce9343: Pull complete
6af1e6403bfe: Pull complete
075e3418c7e0: Pull complete
9d8191b4be57: Pull complete
38e38dfe826c: Pull complete
d5d08e4bc6be: Pull complete
955b472e3e42: Pull complete
02e1b573b380: Pull complete
Digest: sha256:06ceb74789aa8a78b9dfe92872c45e045d7638cdc274ed9b0cdf00b74d118fa2
...
Creating cbreak_periscope_1
Creating cbreak_logsink_1
Creating cbreak_identity_1
Creating cbreak_uluwatu_1
Creating cbreak_haveged_1
Creating cbreak_consul_1
Creating cbreak_mail_1
Creating cbreak_pcdb_1
Creating cbreak_uaadb_1
Creating cbreak_cbdb_1
Creating cbreak_sultans_1
Creating cbreak_registrator_1
Creating cbreak_logspout_1
Creating cbreak_cloudbreak_1
Creating cbreak_traefik_1
Uluwatu (Cloudbreak UI) url:
https://192.168.56.100
login email:
<myemail>
password:
****
creating config file for hdc cli: /root/.hdc/config
The start command will output the IP address and the username to login which is based on what we setup in the Profile. Check Cloudbreak Logs
You can always look at the Cloudbreak logs in /opt/cloudbreak-deployment/cbreak.log. You can also use the cbd logs cloudbreak command to view logs in real time. Cloudbreak is ready to use when you see a message similar to Started CloudbreakApplication in 64.156 seconds (JVM running for 72.52). Login to Cloudbreak
Cloudbreak should now be running. We can login to the UI using the IP address specified in the Profile. In our case that is https://192.168.56.100. Notice Cloudbreak uses https.
You should see a login screen similar to this:
At this point you should be able the Cloudbreak UI screen where you can manage your credentials, blueprints, etc. This tutorial doesn't cover setting up credentials or deploying a cluster. Before you can deploy a cluster you need to setup a platform and credentials. See this link for setting up your credentials:
AWS: Cloudbreak AWS Credentisl Azure: Cloudbreak Azure Credentials Stopping Cloudbreak
When you are ready to shutdown Cloudbeak, the process is simple. First you need to stop the Cloudbreak deployer:
$ cbd kill
You should see something similar to this:
[root@cloudbreak cloudbreak-deployment]# cbd kill
Stopping cbreak_traefik_1 ... done
Stopping cbreak_cloudbreak_1 ... done
Stopping cbreak_logspout_1 ... done
Stopping cbreak_registrator_1 ... done
Stopping cbreak_sultans_1 ... done
Stopping cbreak_uaadb_1 ... done
Stopping cbreak_cbdb_1 ... done
Stopping cbreak_pcdb_1 ... done
Stopping cbreak_mail_1 ... done
Stopping cbreak_haveged_1 ... done
Stopping cbreak_consul_1 ... done
Stopping cbreak_uluwatu_1 ... done
Stopping cbreak_identity_1 ... done
Stopping cbreak_logsink_1 ... done
Stopping cbreak_periscope_1 ... done
Going to remove cbreak_traefik_1, cbreak_cloudbreak_1, cbreak_logspout_1, cbreak_registrator_1, cbreak_sultans_1, cbreak_uaadb_1, cbreak_cbdb_1, cbreak_pcdb_1, cbreak_mail_1, cbreak_haveged_1, cbreak_consul_1, cbreak_uluwatu_1, cbreak_identity_1, cbreak_logsink_1, cbreak_periscope_1
Removing cbreak_traefik_1 ... done
Removing cbreak_cloudbreak_1 ... done
Removing cbreak_logspout_1 ... done
Removing cbreak_registrator_1 ... done
Removing cbreak_sultans_1 ... done
Removing cbreak_uaadb_1 ... done
Removing cbreak_cbdb_1 ... done
Removing cbreak_pcdb_1 ... done
Removing cbreak_mail_1 ... done
Removing cbreak_haveged_1 ... done
Removing cbreak_consul_1 ... done
Removing cbreak_uluwatu_1 ... done
Removing cbreak_identity_1 ... done
Removing cbreak_logsink_1 ... done
Removing cbreak_periscope_1 ... done
[root@cloudbreak cloudbreak-deployment]#
Now exit the Vagrant box:
[root@cloudbreak cloudbreak-deployment]# exit
logout
[vagrant@cloudbreak ~]$ exit
logout
Connection to 127.0.0.1 closed.
Now we can shutdown the Vagrant box
$ vagrant halt
==> cbtest: Attempting graceful shutdown of VM...
Starting Cloudbreak
To startup Cloudbreak, the process is the opposite of stopping it. First you need to start the Vagrant box:
$ vagrant up
Once the Vagrant box is up, you need to ssh in to the box:
$ vagrant ssh
You need to be root:
$ sudo -i
Now start Cloudbreak:
$ cd /opt/cloudbreak-deployment
$ cbd start
You should see something similar to this:
[root@cloudbreak cloudbreak-deployment]# cbd start
generating docker-compose.yml
generating uaa.yml
Creating cbreak_consul_1
Creating cbreak_periscope_1
Creating cbreak_sultans_1
Creating cbreak_uluwatu_1
Creating cbreak_identity_1
Creating cbreak_uaadb_1
Creating cbreak_pcdb_1
Creating cbreak_mail_1
Creating cbreak_haveged_1
Creating cbreak_logsink_1
Creating cbreak_cbdb_1
Creating cbreak_logspout_1
Creating cbreak_registrator_1
Creating cbreak_cloudbreak_1
Creating cbreak_traefik_1
Uluwatu (Cloudbreak UI) url:
https://192.168.56.100
login email:
<myemail>
password:
****
creating config file for hdc cli: /root/.hdc/config
[root@cloudbreak cloudbreak-deployment]#
It takes a minute or two for the Cloudbreak application to fully start up. Now you can login to the Cloudbreak UI. Review
If you have successfully followed along with this tutorial, you should now have a Vagrant box you can spin up via vagrant up, startup Cloudbreak via cbd start and then create your clusters on the cloud.
... View more
Labels:
04-28-2017
02:58 PM
@Raphaël MARY Yes, more than likely. You can read more about Twitter TLS here: https://dev.twitter.com/overview/api/tls
... View more
04-24-2017
07:58 PM
2 Kudos
@Stefan Schuster The Sandbox is setup to assume that you have "sandbox.hortonworks.com" in your local computer host file. So all of the links will typically use "sandbox.hortonworks.com". If you don't update your local host file, you will fail to connect. Are you on Windows, Mac or Linux? that will determine the appropriate approach. Mac and Linux host files are usually /etc/hosts. I'm using the Docker Sandbox and I'm on a Mac. My /etc/hosts file looks like this: 127.0.0.1 localhost sandbox.hortonworks.com sandbox
... View more
04-24-2017
07:47 PM
6 Kudos
Objective This tutorial will walk you through the process of using the PyHive Python module from Dropbox to query HiveServer2. You can read more about PyHive here: PyHive Prerequisites
You should already have Python 2.7 installed. You should already have a version of the Hortonworks Sandbox 2.5 setup. Scope This tutorial was tested using the following environment and components:
Mac OS X 10.12.3 Anaconda 4.3.1 (Python 2.7.13) Hortonworks HDP Sandbox 2.5 PyHive 0.1.5 Steps Install PyHive and Dependancies Before we can query Hive using Python, we have to install the PyHive module and associated dependancies. Because I'm using Anaconda, I chose to use the conda command to install PyHive. Because the PyHive module is provided by a third party, Blaze, you must specify -c blaze with the command line. You can read more about Blaze PyHive for Anaconda here: Blaze PyHive We need to instal PyHive using the following command:
$ conda install -c blaze pyhive You will be doing this installation on your local computer. You should see something similar to the following:
$ conda install -c blaze pyhive
Fetching package metadata ...........
Solving package specifications: .
Package plan for installation in environment /Users/myoung/anaconda:
The following NEW packages will be INSTALLED:
pyhive: 0.1.5-py27_0 blaze
sasl: 0.1.3-py27_0 blaze
thrift: 0.9.2-py27_0 blaze
Proceed ([y]/n)? y
thrift-0.9.2-p 100% |#####################################################################################################################################| Time: 0:00:00 3.07 MB/s
sasl-0.1.3-py2 100% |#####################################################################################################################################| Time: 0:00:00 15.18 MB/s
pyhive-0.1.5-p 100% |#####################################################################################################################################| Time: 0:00:00 10.92 MB/s As you can see, PyHive is dependant on the SASL and Thrift modules. Both of these modules were installed. Create Python Script Now that our local computer has the PyHive module installed, we can create a very simple Python script which will query Hive. Edit a file called pyhive-test.py . You can do this anywhere you like, but I prefer to create a directory under ~/Development for this.
$ mkdir ~/Development/pyhive
cd ~/Development/pyhive Now copy and paste the following test into your file. You can use any text editor you like. I usually use Microsoft Visual Studio Code or Atom.
from pyhive import hive
cursor = hive.connect('sandbox.hortonworks.com').cursor()
cursor.execute('SELECT * FROM sample_07 LIMIT 50')
print cursor.fetchall()
The sample07 database is already on the Sandbox, so this query should work without any problems. Start Hortonworks HDP Sandbox Before we can run our Python script, we have to make sure the Sandbox is started. Go ahead and do that now. Run Python Script Now that the Sandbox is runnig, we can run our script to execute the query.
$ python pyhive-test.py You should see something similar to the following:
$ python pyhive-test.py
[[u'00-0000', u'All Occupations', 134354250, 40690], [u'11-0000', u'Management occupations', 6003930, 96150], [u'11-1011', u'Chief executives', 299160, 151370], [u'11-1021', u'General and operations managers', 1655410, 103780], [u'11-1031', u'Legislators', 61110, 33880], [u'11-2011', u'Advertising and promotions managers', 36300, 91100], [u'11-2021', u'Marketing managers', 165240, 113400], [u'11-2022', u'Sales managers', 322170, 106790], [u'11-2031', u'Public relations managers', 47210, 97170], [u'11-3011', u'Administrative services managers', 239360, 76370], [u'11-3021', u'Computer and information systems managers', 264990, 113880], [u'11-3031', u'Financial managers', 484390, 106200], [u'11-3041', u'Compensation and benefits managers', 41780, 88400], [u'11-3042', u'Training and development managers', 28170, 90300], [u'11-3049', u'Human resources managers, all other', 58100, 99810], [u'11-3051', u'Industrial production managers', 152870, 87550], [u'11-3061', u'Purchasing managers', 65600, 90430], [u'11-3071', u'Transportation, storage, and distribution managers', 92790, 81980], [u'11-9011', u'Farm, ranch, and other agricultural managers', 3480, 61030], [u'11-9012', u'Farmers and ranchers', 340, 42480], [u'11-9021', u'Construction managers', 216120, 85830], [u'11-9031', u'Education administrators, preschool and child care center/program', 47980, 44430], [u'11-9032', u'Education administrators, elementary and secondary school', 218820, 82120], [u'11-9033', u'Education administrators, postsecondary', 101160, 85870], [u'11-9039', u'Education administrators, all other', 28640, 74230], [u'11-9041', u'Engineering managers', 184410, 115610], [u'11-9051', u'Food service managers', 191460, 48660], [u'11-9061', u'Funeral directors', 24020, 57660], [u'11-9071', u'Gaming managers', 3740, 69600], [u'11-9081', u'Lodging managers', 31890, 51140], [u'11-9111', u'Medical and health services managers', 242640, 84980], [u'11-9121', u'Natural sciences managers', 39370, 113170], [u'11-9131', u'Postmasters and mail superintendents', 26500, 57850], [u'11-9141', u'Property, real estate, and community association managers', 159660, 53530], [u'11-9151', u'Social and community service managers', 112330, 59070], [u'11-9199', u'Managers, all other', 356690, 91990], [u'13-0000', u'Business and financial operations occupations', 6015500, 62410], [u'13-1011', u'Agents and business managers of artists, performers, and athletes', 11680, 82730], [u'13-1021', u'Purchasing agents and buyers, farm products', 12930, 53980], [u'13-1022', u'Wholesale and retail buyers, except farm products', 132550, 53580], [u'13-1023', u'Purchasing agents, except wholesale, retail, and farm products', 281950, 56060], [u'13-1031', u'Claims adjusters, examiners, and investigators', 279400, 55470], [u'13-1032', u'Insurance appraisers, auto damage', 12150, 52020], [u'13-1041', u'Compliance officers, except agriculture, construction, health and safety, and transportation', 231910, 52740], [u'13-1051', u'Cost estimators', 219070, 58640], [u'13-1061', u'Emergency management specialists', 11610, 51470], [u'13-1071', u'Employment, recruitment, and placement specialists', 193620, 52710], [u'13-1072', u'Compensation, benefits, and job analysis specialists', 109870, 55740], [u'13-1073', u'Training and development specialists', 202820, 53040], [u'13-1079', u'Human resources, training, and labor relations specialists, all other', 211770, 56740]] Review As you can see, using Python to query Hive is fairly straight forward. We were able to install the required Python modules in a single command, create a quick Python script and run the script to get 50 records from the sample07 database in Hive.
... View more
Labels:
04-21-2017
08:14 PM
@Raphaël MARY Which end point are you using for the processor? You should use the sample or filter endpoint. I don't believe you can use the firehouse endpoint unless you pay Twitter for access.
... View more