Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Installing HDFS Google Cloud Connector

avatar
Rising Star

Hi,

 

I'm looking to ingest a large amount of data from a public google cloud bucket, but our cluster is currently missing the google cloud connector. I thought copying the connector jar to /opt/cloudera/parcels/CDH/lib/hadoop would be sufficient, but after running the below command I am receiving the following error: No FileSystem for scheme: gs.

 

hdfs dfs -cp gs://gnomad-public/release-170228/gnomad.genomes.r2.0.1.sites.vds /my/local/hdfs/filesystem

 

Are any additional steps beyond this necessary?

1 ACCEPTED SOLUTION

avatar
Champion
It took a bit to track it down but here is the info you are looking for. There are really just two steps, download and put the jar in the Hadoop classpath, as you have already. The other step is to add the gs settings to the core-site.xml so that HDFS is aware of it as a additional filesystem (i.e. knows the class name and impl name).

https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md

View solution in original post

6 REPLIES 6

avatar
Champion
It took a bit to track it down but here is the info you are looking for. There are really just two steps, download and put the jar in the Hadoop classpath, as you have already. The other step is to add the gs settings to the core-site.xml so that HDFS is aware of it as a additional filesystem (i.e. knows the class name and impl name).

https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/INSTALL.md

avatar
Rising Star

Appreciate the quick response! I actually was originally following the guide you posted. In particular, I modified core-site.xml located at /etc/hadoop/conf as shown below. However, when running hdfs dfs -ls gs://gnomad-public (a public gcloud bucket), I get the following ClassNotFound: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

 

It appears that this issue is tied to the connector jar not being part of the classpath, which is odd as I created a symbolic link to it both in /opt/cloudera/parcels/CDH/lib/hadoop and /opt/cloudera/parcels/CDH/lib/hadoop-mapreduce. (The file is actually located in /opt/cloudera/parcels/CDH/jars - this is where all of the default connector jars were initially placed and I figured following the default placement would be the best practice).

 

 

<!--GCloud connection modification here!-->
<property>
<name>fs.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
<description>The FileSystem for gs: (GCS) uris.</description>
</property>
<property>
<name>fs.AbstractFileSystem.gs.impl</name>
<value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
<description>
The AbstractFileSystem for gs: (GCS) uris. Only necessary for use with Hadoop 2.
</description>
</property>
<property>
<name>fs.gs.project.id</name>
<value>PROJECTNAME</value>
<description>
Required. Google Cloud Project ID with access to configured GCS buckets.
</description>
</property>
<property>
<name>google.cloud.auth.service.account.enable</name>
<value>true</value>
<description>
Whether to use a service account for GCS authorizaiton. If an email and
keyfile are provided (see google.cloud.auth.service.account.email and
google.cloud.auth.service.account.keyfile), then that service account
willl be used. Otherwise the connector will look to see if it running on
a GCE VM with some level of GCS access in it's service account scope, and
use that service account.
</description>
</property>
<property>
<name>google.cloud.auth.service.account.json.keyfile</name>
<value>'/etc/hadoop/conf/KEYFILENAME.json'</value>
<description>
The JSON key file of the service account used for GCS
access when google.cloud.auth.service.account.enable is true.
</description>
</property>

avatar
Champion
Did you restart HDFS after these changes?

avatar
Champion

Based on the eror that you have  , the jar file is clearing missing a .class file (GoogleHadoopFileSystem)

you can extract the jar using 7 zip  any tool to inspect . 

 

.class file path that it is missing .

com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem

 

Solution is to download a fresh required jar , extra and check if it is having the all the .class file and add that to the path . this should resolve the error. 

 

avatar
Champion

Untitled.png

avatar
Rising Star

Went ahead and downloaded a fresh .jar and followed the steps in the guide posted above - got it working! Appreciate the help.