Created on 01-07-2016 02:23 AM
Google’s Cloud Platform provides the infrastructure to perform MapReduce data analysis using open source software such as Hadoop with Hive and Pig. Google's Compute Engine provides the compute power and Cloud Storage is used to store the input and output of the MapReduce jobs.
HDP deployment using CloudBreak.
Before we deploy HDP in GCE, we need to setup account in GCE and CloudBreak.
Signup for free trial account on https://cloud.google.com/free-trial/
Step1) Login into your google dashboard and then click Create a project. For example: I created a project called hadoop-install
Step 2) Create credentials.
Click Create new Client ID and then choose Service account.
Click Okay got it and it will download JSON key (We won’t be using this file). You will see Client ID, Email address and Certificate fingerprints in the same window after downloading JSON key. There will be an option to Generate new P12 key.
Step 3) Enable API
Default API when you login
Search google compute and click Google Compute Engine. You will see an option to Enable API that you need to click.
These are the API that I have with enabled status
For HDP deployment, you would need Project-id, Email address and P12 key file.
GCE setup completed so let’s move on to CloudBreak setup.
Signup for CloudBreak account.
https://accounts.sequenceiq.com/
Login url
https://cloudbreak.sequenceiq.com/
Once you are logged into the Cloudbreak UI then setup GCP credentials
You will need project id and following details from the Credentials tab
Email address
My Cloudbreak UI looks like the following. We will be creating credentials, template and blueprint for HDP deployment and this is only one time process.
Credentials:
Under manage credentials, choose GCP.
Name – Credential name
Description – As you like
Project ID – hadoop-install (get this value from google dashboard)
Service Account Email Address – Credentials tab in google dashboard “Email address” under Service account
Service Account Key – Upload the file that you did rename as hadoop.12
SSH public key – Mac users can copy the content of id_rsa.pub. Windows users needs to get this from putty (google search – putty public ssh keys)
Template:
Next step is to manage resources (create template)
Name – Template name
Description – As you like
Instance-Type – You can choose as per your requirement (I chose n1-standard-2 for this test)
Volume Type – Magnetic/SSD
Attached volumes per instance – 1 for this test
Volume Size – 100GB (Increase this value as per your requirement)
Blueprint
You can download the blueprint from here. Copy the content and paste it into the create blueprint window.
I am saving the blueprint as hivegoogle. In case, you receive blue print error while creating blueprint in CloudBreak then you can usejsonvalidate to validate/format the blueprint.
Cluster Deployment
Select your credentials
Click create cluster
Clustername: Name your cluster
Region: Choose region to deploy the cluster
Network: Choose network
Blueprint: Choose blueprint created, hivegoogle
Hostgroup configuration:
cbgateway , master and slave – I am using minviable-gcp but you can choose the template as per you own choice.
Click “create and start cluster”
You can see the progress in the Event history.
Final snapshot of the cluster looks like this:
Verify google cloud related settings and provide project.id & google cloud service email. You can find these details from the google dashboard.
Verify tez.aux.uris and make sure to copy gcs connector at this location. I have covered copy process in the environment setup section as below.
Let’s setup the environment setup before running hdfs and hive commands.
We need hadoop.p12 and gcs connector in all the nodes.
First, from the localhost to the vm instance (External IP can be found from google dashboard under VM Instances)
HW11326:.ssh nsabharwal$ scp ~/Downloads/hadoop.p12 cloudbreak@130.211.184.135:/tmp
hadoop.p12 100% 2572 2.5KB/s 00:00
HW11326:.ssh nsabharwal$
Login to vm instance
HW11326:.ssh nsabharwal$ ssh location
[hdfs@hdpgcp-1-1435537523061 ~]$ wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
--2015-06-28 21:05:59-- https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar
Resolving storage.googleapis.com... 74.125.201.128, 2607:f8b0:4001:c01::80
Connecting to storage.googleapis.com|74.125.201.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2494559 (2.4M) [application/java-archive]
Saving to: `gcs-connector-latest-hadoop2.jar'
100%[============================================================================================================================================================>] 2,494,559 7.30M/s in 0.3s
2015-06-28 21:05:59 (7.30 MB/s) - `gcs-connector-latest-hadoop2.jar' saved [2494559/2494559]
Copy the connector to HDFS location
[hdfs@hdpgcp-1-1435537523061 ~]$ hdfs dfs -put gcs-connector-latest-hadoop2.jar /apps/tez/aux-jars/
[hdfs@hdpgcp-1-1435537523061 ~]$
Let’s create storage bucket called hivetest in the google storage.
Login into your google compute engine account and click Storage.
HDFS test
We need to copy the connector into the hadoop-client location otherwise you will hit error “Google FileSystem not found”
cp gcs-connector-latest-hadoop2.jar /usr/hdp/current/hadoop-client/lib/
[hdfs@hdpgcp-1-1435537523061 ~]$ hdfs dfs -ls gs://hivetest/
15/06/28 21:15:32 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop2
15/06/28 21:15:33 WARN gcs.GoogleHadoopFileSystemBase: No working directory configured, using default: 'gs://hivetest/'
Found 3 items
drwx------ - hdfs hdfs 0 2015-06-28 15:29 gs://hivetest/ns
drwx------ - hdfs hdfs 0 2015-06-28 12:44 gs://hivetest/test
drwx------ - hdfs hdfs 0 2015-06-28 15:30 gs://hivetest/tmp
[hdfs@hdpgcp-1-1435537523061 ~]$
Hive test
bash-4.1# su - hive
[hive@hdpgcptest-1-1435590069329 ~]$ hive
hive> create table testns ( info string) location 'gs://hivetest/testns';
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found)
hive>
To avoid the above error, we have to copy gcs connector into all the nodes under hive-client
cp /tmp/gcs-connector-latest-hadoop2.jar /usr/hdp/current/hive-client/lib
Let’s run following Apache Hive test
Data Set: http://seanlahman.com/files/database/lahman591-csv.zip
We are writing to gs://hivetest
hive> create table batting (col_value STRING) location 'gs://hivetest/batting';
OK
Time taken: 1.518 seconds
Run the following command to verify the location, 'gs://hivetest/batting'
hive> show create table batting;
OK
CREATE TABLE `batting`(
`col_value` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'gs://hivetest/batting'
TBLPROPERTIES (
'transient_lastDdlTime'='1435766262')
Time taken: 0.981 seconds, Fetched: 12 row(s)
hive> select count(1) from batting;
Upload Batting.csv
hive> drop table batting;
You will notice that Batting.csv is deleted from the storage, as it was locally managed table.
In case of external table, Batting.csv won’t be removed from the storage bucket.
In case you want to test MR using Hive
hive> add jar /usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar;
Added [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar] to class path
Added resources: [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar]
hive> select count(1) from batting;
Query ID = hive_20150702095454_c17ae70f-b77e-4599-87e6-022d9bb9a00d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1435841827745_0003, Tracking URL = http://hdpgcptest-1-1435590069329.node.dc1.consul:8088/proxy/application_1435841827745_0003/
Kill Command = /usr/hdp/2.2.6.0-2800/hadoop/bin/hadoop job -kill job_1435841827745_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2015-07-02 09:54:33,468 Stage-1 map = 0%, reduce = 0%
2015-07-02 09:54:42,947 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.2 sec
2015-07-02 09:54:51,719 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.6 sec
MapReduce Total cumulative CPU time: 4 seconds 600 msec
Ended Job = job_1435841827745_0003
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.6 sec HDFS Read: 187 HDFS Write: 6 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 600 msec
OK
95196
Time taken: 29.855 seconds, Fetched: 1 row(s)
hive>
Sparksql
First, copy gcs connector to spark-historyserver to avoid “Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found”
export SPARK_CLASSPATH=/usr/hdp/current/spark-historyserver/lib/gcs-connector-latest-hadoop2.jar
I am following this article for Spark test
scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@140dcdc5
scala>
scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS batting ( col_value STRING) location 'gs://hivetest/batting' ")
scala> sqlContext.sql("select count(*) from batting").collect().foreach(println)
15/07/01 15:38:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 187 bytes
15/07/01 15:38:42 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 286 ms on hdpgcptest-2-1435590069361.node.dc1.consul (1/1)
15/07/01 15:38:42 INFO YarnClientClusterScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/07/01 15:38:42 INFO DAGScheduler: Stage 1 (collect at SparkPlan.scala:84) finished in 0.295 s
[95196]15/07/01 15:38:42 INFO DAGScheduler: Job 0 finished: collect at SparkPlan.scala:84, took 8.872396 s
Created on 03-14-2018 03:57 PM
Very nice complete article. It worked as described. However i ran into a small issue, hive is throwing "Permission denied" error for gcs data. Here is full trace. I was able to do "ls" on gcs successfully. Only difference i see from your trace is owner for gcs shows as "hdfs" for you and "root" for me. Any suggestions please? I tried making file in bucket as public link as well.
hive> !hdfs dfs -ls gs://bmasna-csv-test/ ; 18/03/14 15:50:39 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.7.0-hadoop2 18/03/14 15:50:40 WARN gcs.GoogleHadoopFileSystemBase: No working directory configured, using default: 'gs://bmasna-csv-test/' Found 1 items drwx------ - root root 0 2018-03-13 17:23 gs://bmasna-csv-test/data1
hive> CREATE EXTERNAL TABLE IF NOT EXISTS EMP( > EmployeeID INT,FirstName STRING, Title STRING, > State STRING) > COMMENT 'Employee' > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE > LOCATION 'gs://bmasna-csv-test/data1/'; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.security.AccessControlException: Permission denied: user=root, path="gs://bmasna-csv-test/data1":atxhive:atxhive:drwx------)
Created on 05-28-2018 02:44 PM
Hi Neeraj,
We are trying to test GCS connector with HDP 2.6.5 ( https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/authentication... ) by having GCS as my storage.
When trying to create hive external table it’s failing with the following error:
Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.security.AccessControlException: Permission denied: user=hdpuser1, path="gs://hive_metastore/":hive:hive:drwx------) (state=08S01,code=1)
Syntax: CREATE EXTERNAL TABLE test1256(name string,id int) LOCATION 'gs://hive_metastore/' ;