Welcome to the Cloudera Community

Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Who agreed with this topic

Making files stored on GCS accessible and queryable by Impala in CDH

avatar
New Contributor
Hello,
 
I have a CDH cluster with Cloudera Manager deployed on a bunch of Google Cloud Engine machines. There are only basic services enabled in my CDH: HDFS, Yarn, Zookeeper, Spark, Hive and Impala. Also I have a lot of files stored in Google Cloud Storage and I want to access them from Hadoop cluster. So I followed the steps from this guide https://cloud.google.com/dataproc/docs/concepts/connectors/install-storage-connector and made GCS files available for HDFS and Spark by using "gs://" prefix instead of "hdfs://“.
 
Then for Hive service I modified "Hive Auxiliary JARs Directory” configuration and put there a path to GCS connection jar file. So GCS files became available for Hive and I can create external tables based on files stored in GCS or just can create usual tables and set LOCATION param as some path in GCS. These Hive tables are queryable (MapReduce jobs also recognize gs:// prefix) and everything is fine.
 
The problem is that Impala doesn’t want to load GCS connection jar wherever I put it therefore it cannot operate with files located in GCS. So when I’m trying to do something related to GCS files I see the following error:
impala-error.png
I tried to put GCS connection jar to "/opt/cloudera/parcels/CDH/lib/impala/lib" (there are a lot of jars which seems to be used by Impala) but it looks like Impala just ignoring my jar. So I don’t know how to force it to load GCS connection jar. Is it even possible to make Impala working with gs:// prefix?
 
For me this situation is quite strange because there are official Cloudera guides how to integrate Impala with S3, ADLS
but there are no such integration with GCS. Are there any plans to add such integration in the near future?
Who agreed with this topic