Created on 01-07-2016 02:23 AM
Google’s Cloud Platform provides the infrastructure to perform MapReduce data analysis using open source software such as Hadoop with Hive and Pig. Google's Compute Engine provides the compute power and Cloud Storage is used to store the input and output of the MapReduce jobs.
HDP deployment using CloudBreak.
Before we deploy HDP in GCE, we need to setup account in GCE and CloudBreak.
Signup for free trial account on https://cloud.google.com/free-trial/
Step1) Login into your google dashboard and then click Create a project. For example: I created a project called hadoop-install
Step 2) Create credentials.
Click Create new Client ID and then choose Service account.
Click Okay got it and it will download JSON key (We won’t be using this file). You will see Client ID, Email address and Certificate fingerprints in the same window after downloading JSON key. There will be an option to Generate new P12 key.
Step 3) Enable API
Default API when you login
Search google compute and click Google Compute Engine. You will see an option to Enable API that you need to click.
These are the API that I have with enabled status
For HDP deployment, you would need Project-id, Email address and P12 key file.
GCE setup completed so let’s move on to CloudBreak setup.
Signup for CloudBreak account.
Once you are logged into the Cloudbreak UI then setup GCP credentials
You will need project id and following details from the Credentials tab
My Cloudbreak UI looks like the following. We will be creating credentials, template and blueprint for HDP deployment and this is only one time process.
Under manage credentials, choose GCP.
Name – Credential name
Description – As you like
Project ID – hadoop-install (get this value from google dashboard)
Service Account Email Address – Credentials tab in google dashboard “Email address” under Service account
Service Account Key – Upload the file that you did rename as hadoop.12
SSH public key – Mac users can copy the content of id_rsa.pub. Windows users needs to get this from putty (google search – putty public ssh keys)
Next step is to manage resources (create template)
Name – Template name
Description – As you like
Instance-Type – You can choose as per your requirement (I chose n1-standard-2 for this test)
Volume Type – Magnetic/SSD
Attached volumes per instance – 1 for this test
Volume Size – 100GB (Increase this value as per your requirement)
You can download the blueprint from here. Copy the content and paste it into the create blueprint window.
I am saving the blueprint as hivegoogle. In case, you receive blue print error while creating blueprint in CloudBreak then you can usejsonvalidate to validate/format the blueprint.
Select your credentials
Click create cluster
Clustername: Name your cluster
Region: Choose region to deploy the cluster
Network: Choose network
Blueprint: Choose blueprint created, hivegoogle
cbgateway , master and slave – I am using minviable-gcp but you can choose the template as per you own choice.
Click “create and start cluster”
You can see the progress in the Event history.
Final snapshot of the cluster looks like this:
Verify google cloud related settings and provide project.id & google cloud service email. You can find these details from the google dashboard.
Verify tez.aux.uris and make sure to copy gcs connector at this location. I have covered copy process in the environment setup section as below.
Let’s setup the environment setup before running hdfs and hive commands.
We need hadoop.p12 and gcs connector in all the nodes.
First, from the localhost to the vm instance (External IP can be found from google dashboard under VM Instances)
HW11326:.ssh nsabharwal$ scp ~/Downloads/hadoop.p12 email@example.com:/tmp
hadoop.p12 100% 2572 2.5KB/s 00:00
Login to vm instance
HW11326:.ssh nsabharwal$ ssh location
[hdfs@hdpgcp-1-1435537523061 ~]$ wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
--2015-06-28 21:05:59-- https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
Resolving storage.googleapis.com... 184.108.40.206, 2607:f8b0:4001:c01::80
Connecting to storage.googleapis.com|220.127.116.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2494559 (2.4M) [application/java-archive]
Saving to: `gcs-connector-latest-hadoop2.jar'
100%[============================================================================================================================================================>] 2,494,559 7.30M/s in 0.3s
2015-06-28 21:05:59 (7.30 MB/s) - `gcs-connector-latest-hadoop2.jar' saved [2494559/2494559]
Copy the connector to HDFS location
[hdfs@hdpgcp-1-1435537523061 ~]$ hdfs dfs -put gcs-connector-latest-hadoop2.jar /apps/tez/aux-jars/
Let’s create storage bucket called hivetest in the google storage.
Login into your google compute engine account and click Storage.
We need to copy the connector into the hadoop-client location otherwise you will hit error “Google FileSystem not found”
cp gcs-connector-latest-hadoop2.jar /usr/hdp/current/hadoop-client/lib/
[hdfs@hdpgcp-1-1435537523061 ~]$ hdfs dfs -ls gs://hivetest/
15/06/28 21:15:32 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop2
15/06/28 21:15:33 WARN gcs.GoogleHadoopFileSystemBase: No working directory configured, using default: 'gs://hivetest/'
Found 3 items
drwx------ - hdfs hdfs 0 2015-06-28 15:29 gs://hivetest/ns
drwx------ - hdfs hdfs 0 2015-06-28 12:44 gs://hivetest/test
drwx------ - hdfs hdfs 0 2015-06-28 15:30 gs://hivetest/tmp
bash-4.1# su - hive
[hive@hdpgcptest-1-1435590069329 ~]$ hive
hive> create table testns ( info string) location 'gs://hivetest/testns';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found)
To avoid the above error, we have to copy gcs connector into all the nodes under hive-client
cp /tmp/gcs-connector-latest-hadoop2.jar /usr/hdp/current/hive-client/lib
Let’s run following Apache Hive test
We are writing to gs://hivetest
hive> create table batting (col_value STRING) location 'gs://hivetest/batting';
Time taken: 1.518 seconds
Run the following command to verify the location, 'gs://hivetest/batting'
hive> show create table batting;
CREATE TABLE `batting`(
ROW FORMAT SERDE
STORED AS INPUTFORMAT
Time taken: 0.981 seconds, Fetched: 12 row(s)
hive> select count(1) from batting;
hive> drop table batting;
You will notice that Batting.csv is deleted from the storage, as it was locally managed table.
In case of external table, Batting.csv won’t be removed from the storage bucket.
In case you want to test MR using Hive
hive> add jar /usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar;
Added [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar] to class path
Added resources: [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar]
hive> select count(1) from batting;
Query ID = hive_20150702095454_c17ae70f-b77e-4599-87e6-022d9bb9a00d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
In order to limit the maximum number of reducers:
In order to set a constant number of reducers:
Starting Job = job_1435841827745_0003, Tracking URL = http://hdpgcptest-1-1435590069329.node.dc1.consul:8088/proxy/application_1435841827745_0003/
Kill Command = /usr/hdp/18.104.22.168-2800/hadoop/bin/hadoop job -kill job_1435841827745_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-07-02 09:54:33,468 Stage-1 map = 0%, reduce = 0%
2015-07-02 09:54:42,947 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.2 sec
2015-07-02 09:54:51,719 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.6 sec
MapReduce Total cumulative CPU time: 4 seconds 600 msec
Ended Job = job_1435841827745_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.6 sec HDFS Read: 187 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 600 msec
Time taken: 29.855 seconds, Fetched: 1 row(s)
First, copy gcs connector to spark-historyserver to avoid “Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found”
I am following this article for Spark test
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@140dcdc5
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS batting ( col_value STRING) location 'gs://hivetest/batting' ")
scala> sqlContext.sql("select count(*) from batting").collect().foreach(println)
15/07/01 15:38:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 187 bytes
15/07/01 15:38:42 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 286 ms on hdpgcptest-2-1435590069361.node.dc1.consul (1/1)
15/07/01 15:38:42 INFO YarnClientClusterScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/07/01 15:38:42 INFO DAGScheduler: Stage 1 (collect at SparkPlan.scala:84) finished in 0.295 s
15/07/01 15:38:42 INFO DAGScheduler: Job 0 finished: collect at SparkPlan.scala:84, took 8.872396 s