Community Articles

Find and share helpful community-sourced technical articles.
Labels (1)
avatar
Master Mentor

Original Post

Hive and Google Cloud Storage

Google’s Cloud Platform provides the infrastructure to perform MapReduce data analysis using open source software such as Hadoop with Hive and Pig. Google's Compute Engine provides the compute power and Cloud Storage is used to store the input and output of the MapReduce jobs.

HDP deployment using CloudBreak.

Before we deploy HDP in GCE, we need to setup account in GCE and CloudBreak.

Signup for free trial account on https://cloud.google.com/free-trial/

Step1) Login into your google dashboard and then click Create a project. For example: I created a project called hadoop-install

Step 2) Create credentials.

Click Create new Client ID and then choose Service account.

Click Okay got it and it will download JSON key (We won’t be using this file). You will see Client ID, Email address and Certificate fingerprints in the same window after downloading JSON key. There will be an option to Generate new P12 key.

Step 3) Enable API

Default API when you login

Search google compute and click Google Compute Engine. You will see an option to Enable API that you need to click.

These are the API that I have with enabled status

For HDP deployment, you would need Project-id, Email address and P12 key file.

GCE setup completed so let’s move on to CloudBreak setup.

Signup for CloudBreak account.

https://accounts.sequenceiq.com/

Login url

https://cloudbreak.sequenceiq.com/

Once you are logged into the Cloudbreak UI then setup GCP credentials

You will need project id and following details from the Credentials tab

Email address

My Cloudbreak UI looks like the following. We will be creating credentials, template and blueprint for HDP deployment and this is only one time process.

Credentials:

Under manage credentials, choose GCP.

Name – Credential name

Description – As you like

Project ID – hadoop-install (get this value from google dashboard)

Service Account Email Address – Credentials tab in google dashboard “Email address” under Service account

Service Account Key – Upload the file that you did rename as hadoop.12

SSH public key – Mac users can copy the content of id_rsa.pub. Windows users needs to get this from putty (google search – putty public ssh keys)

Template:

Next step is to manage resources (create template)

Name – Template name

Description – As you like

Instance-Type – You can choose as per your requirement (I chose n1-standard-2 for this test)

Volume Type – Magnetic/SSD

Attached volumes per instance – 1 for this test

Volume Size – 100GB (Increase this value as per your requirement)

Blueprint

You can download the blueprint from here. Copy the content and paste it into the create blueprint window.

I am saving the blueprint as hivegoogle. In case, you receive blue print error while creating blueprint in CloudBreak then you can usejsonvalidate to validate/format the blueprint.

Cluster Deployment

Select your credentials

Click create cluster

Clustername: Name your cluster

Region: Choose region to deploy the cluster

Network: Choose network

Blueprint: Choose blueprint created, hivegoogle

Hostgroup configuration:

cbgateway , master and slave – I am using minviable-gcp but you can choose the template as per you own choice.

Click “create and start cluster”

You can see the progress in the Event history.

Final snapshot of the cluster looks like this:

Verify google cloud related settings and provide project.id & google cloud service email. You can find these details from the google dashboard.

Verify tez.aux.uris and make sure to copy gcs connector at this location. I have covered copy process in the environment setup section as below.

Let’s setup the environment setup before running hdfs and hive commands.

We need hadoop.p12 and gcs connector in all the nodes.

  • Copy hadoop.p12 as defined in Ambari parameter google.cloud.auth.service.account.keyfile
  • You can upload hadoop.p12 in dropbox and do wget or you can copy from your localhost.
  • Copy hadoop.p12 from local machine to VM instance. Cloudbreak uses docker containers to deploy the cluster so we need to copy file from local desktop to vm instance then copy it into the container.

First, from the localhost to the vm instance (External IP can be found from google dashboard under VM Instances)

HW11326:.ssh nsabharwal$ scp ~/Downloads/hadoop.p12 cloudbreak@130.211.184.135:/tmp

hadoop.p12 100% 2572 2.5KB/s 00:00

HW11326:.ssh nsabharwal$

Login to vm instance

HW11326:.ssh nsabharwal$ ssh location

[hdfs@hdpgcp-1-1435537523061 ~]$ wget https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar

--2015-06-28 21:05:59-- https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-latest-hadoop2.jar

Resolving storage.googleapis.com... 74.125.201.128, 2607:f8b0:4001:c01::80

Connecting to storage.googleapis.com|74.125.201.128|:443... connected.

HTTP request sent, awaiting response... 200 OK

Length: 2494559 (2.4M) [application/java-archive]

Saving to: `gcs-connector-latest-hadoop2.jar'

100%[============================================================================================================================================================>] 2,494,559 7.30M/s in 0.3s

2015-06-28 21:05:59 (7.30 MB/s) - `gcs-connector-latest-hadoop2.jar' saved [2494559/2494559]

Copy the connector to HDFS location

[hdfs@hdpgcp-1-1435537523061 ~]$ hdfs dfs -put gcs-connector-latest-hadoop2.jar /apps/tez/aux-jars/

[hdfs@hdpgcp-1-1435537523061 ~]$

Let’s create storage bucket called hivetest in the google storage.

Login into your google compute engine account and click Storage.

HDFS test

We need to copy the connector into the hadoop-client location otherwise you will hit error “Google FileSystem not found”

cp gcs-connector-latest-hadoop2.jar /usr/hdp/current/hadoop-client/lib/

[hdfs@hdpgcp-1-1435537523061 ~]$ hdfs dfs -ls gs://hivetest/

15/06/28 21:15:32 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.4.0-hadoop2

15/06/28 21:15:33 WARN gcs.GoogleHadoopFileSystemBase: No working directory configured, using default: 'gs://hivetest/'

Found 3 items

drwx------ - hdfs hdfs 0 2015-06-28 15:29 gs://hivetest/ns

drwx------ - hdfs hdfs 0 2015-06-28 12:44 gs://hivetest/test

drwx------ - hdfs hdfs 0 2015-06-28 15:30 gs://hivetest/tmp

[hdfs@hdpgcp-1-1435537523061 ~]$

Hive test

bash-4.1# su - hive

[hive@hdpgcptest-1-1435590069329 ~]$ hive

hive> create table testns ( info string) location 'gs://hivetest/testns';

FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found)

hive>

To avoid the above error, we have to copy gcs connector into all the nodes under hive-client

cp /tmp/gcs-connector-latest-hadoop2.jar /usr/hdp/current/hive-client/lib

Let’s run following Apache Hive test

Data Set: http://seanlahman.com/files/database/lahman591-csv.zip

We are writing to gs://hivetest

hive> create table batting (col_value STRING) location 'gs://hivetest/batting';

OK

Time taken: 1.518 seconds

Run the following command to verify the location, 'gs://hivetest/batting'

hive> show create table batting;

OK

CREATE TABLE `batting`(

`col_value` string)

ROW FORMAT SERDE

'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

STORED AS INPUTFORMAT

'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

'gs://hivetest/batting'

TBLPROPERTIES (

'transient_lastDdlTime'='1435766262')

Time taken: 0.981 seconds, Fetched: 12 row(s)

hive> select count(1) from batting;

Upload Batting.csv

hive> drop table batting;

You will notice that Batting.csv is deleted from the storage, as it was locally managed table.

In case of external table, Batting.csv won’t be removed from the storage bucket.

In case you want to test MR using Hive

hive> add jar /usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar;

Added [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar] to class path

Added resources: [/usr/hdp/current/hive-client/lib/gcs-connector-latest-hadoop2.jar]

hive> select count(1) from batting;

Query ID = hive_20150702095454_c17ae70f-b77e-4599-87e6-022d9bb9a00d

Total jobs = 1

Launching Job 1 out of 1

Number of reduce tasks determined at compile time: 1

In order to change the average load for a reducer (in bytes):

set hive.exec.reducers.bytes.per.reducer=<number>

In order to limit the maximum number of reducers:

set hive.exec.reducers.max=<number>

In order to set a constant number of reducers:

set mapreduce.job.reduces=<number>

Starting Job = job_1435841827745_0003, Tracking URL = http://hdpgcptest-1-1435590069329.node.dc1.consul:8088/proxy/application_1435841827745_0003/

Kill Command = /usr/hdp/2.2.6.0-2800/hadoop/bin/hadoop job -kill job_1435841827745_0003

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1

2015-07-02 09:54:33,468 Stage-1 map = 0%, reduce = 0%

2015-07-02 09:54:42,947 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.2 sec

2015-07-02 09:54:51,719 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.6 sec

MapReduce Total cumulative CPU time: 4 seconds 600 msec

Ended Job = job_1435841827745_0003

MapReduce Jobs Launched:

Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.6 sec HDFS Read: 187 HDFS Write: 6 SUCCESS

Total MapReduce CPU Time Spent: 4 seconds 600 msec

OK

95196

Time taken: 29.855 seconds, Fetched: 1 row(s)

hive>

Sparksql

First, copy gcs connector to spark-historyserver to avoid “Caused by: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found”

export SPARK_CLASSPATH=/usr/hdp/current/spark-historyserver/lib/gcs-connector-latest-hadoop2.jar

I am following this article for Spark test

scala> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

sqlContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@140dcdc5

scala>

scala> sqlContext.sql("CREATE TABLE IF NOT EXISTS batting ( col_value STRING) location 'gs://hivetest/batting' ")

scala> sqlContext.sql("select count(*) from batting").collect().foreach(println)

15/07/01 15:38:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 187 bytes

15/07/01 15:38:42 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 286 ms on hdpgcptest-2-1435590069361.node.dc1.consul (1/1)

15/07/01 15:38:42 INFO YarnClientClusterScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool

15/07/01 15:38:42 INFO DAGScheduler: Stage 1 (collect at SparkPlan.scala:84) finished in 0.295 s

[95196]15/07/01 15:38:42 INFO DAGScheduler: Job 0 finished: collect at SparkPlan.scala:84, took 8.872396 s

10,407 Views
Comments
avatar
Contributor

@Neeraj Sabharwal

Very nice complete article. It worked as described. However i ran into a small issue, hive is throwing "Permission denied" error for gcs data. Here is full trace. I was able to do "ls" on gcs successfully. Only difference i see from your trace is owner for gcs shows as "hdfs" for you and "root" for me. Any suggestions please? I tried making file in bucket as public link as well.

hive> !hdfs dfs -ls gs://bmasna-csv-test/ ; 18/03/14 15:50:39 INFO gcs.GoogleHadoopFileSystemBase: GHFS version: 1.7.0-hadoop2 18/03/14 15:50:40 WARN gcs.GoogleHadoopFileSystemBase: No working directory configured, using default: 'gs://bmasna-csv-test/' Found 1 items drwx------ - root root 0 2018-03-13 17:23 gs://bmasna-csv-test/data1

hive> CREATE EXTERNAL TABLE IF NOT EXISTS EMP( > EmployeeID INT,FirstName STRING, Title STRING, > State STRING) > COMMENT 'Employee' > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > STORED AS TEXTFILE > LOCATION 'gs://bmasna-csv-test/data1/'; FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.security.AccessControlException: Permission denied: user=root, path="gs://bmasna-csv-test/data1":atxhive:atxhive:drwx------)

avatar
Contributor

Hi Neeraj,

We are trying to test GCS connector with HDP 2.6.5 ( https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/authentication... ) by having GCS as my storage.

When trying to create hive external table it’s failing with the following error:

Error: Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.security.AccessControlException: Permission denied: user=hdpuser1, path="gs://hive_metastore/":hive:hive:drwx------) (state=08S01,code=1)

Syntax: CREATE EXTERNAL TABLE test1256(name string,id int) LOCATION 'gs://hive_metastore/' ;