I have a CDH cluster with Cloudera Manager deployed on a bunch of Google Cloud Engine machines. There are only basic services enabled in my CDH: HDFS, Yarn, Zookeeper, Spark, Hive and Impala. Also I have a lot of files stored in Google Cloud Storage and I want to access them from Hadoop cluster. So I followed the steps from this guide https://cloud.google.com/dataproc/docs/concepts/connectors/install-storage-connector and made GCS files available for HDFS and Spark by using "gs://" prefix instead of "hdfs://“.
Then for Hive service I modified "Hive Auxiliary JARs Directory” configuration and put there a path to GCS connection jar file. So GCS files became available for Hive and I can create external tables based on files stored in GCS or just can create usual tables and set LOCATION param as some path in GCS. These Hive tables are queryable (MapReduce jobs also recognize gs:// prefix) and everything is fine.
The problem is that Impala doesn’t want to load GCS connection jar wherever I put it therefore it cannot operate with files located in GCS. So when I’m trying to do something related to GCS files I see the following error:
I tried to put GCS connection jar to "/opt/cloudera/parcels/CDH/lib/impala/lib" (there are a lot of jars which seems to be used by Impala) but it looks like Impala just ignoring my jar. So I don’t know how to force it to load GCS connection jar. Is it even possible to make Impala working with gs:// prefix?
For me this situation is quite strange because there are official Cloudera guides how to integrate Impala with S3, ADLS
I checked our internal Cloudera Jira instance and found that GCS support is still under evaluation and there isn't an ETA on a decision one way or the other. For now, as you have observed, some points of integration may work, but full CDH integration and support is not available at this time.