Starting with version 2.6.5 on the 2.x line and 3.0.0 on the 3.x line, HDP has added support for running workloads against Google Cloud Storage (GCS). This article talks about how to get started using GCS for workloads. The FileSystem Abstraction
Apache Hadoop includes a ‘FileSystem’ abstraction, which is used to access data on HDFS. This is an abstraction, which can be used to talk not just to HDFS, but any other storage layer which has an implementation of the FileSystem abstract class for the specific storage layer.
Existing applications don’t need to change, to make use of GCS. Instead, some configuration and path changes is all that is required for existing jobs to run against Cloud Storage. Google Storage Connector
Google Storage Connector provides an implementation of the FileSystem abstraction, and is available in HDP versions mentioned above, to facilitate access to GCS.
“gs” is the filesystem prefix used for Google Cloud Storage. e.g. gs://tmp Setting up Credentials to Access GCS
There’s several ways in which credentials can be configured to access cloud storage, each with its own implications.
Configuring access to Google Cloud Storage (GCS) has details on required permissions for the service accounts, as well as steps on configuring credentials. Option1: Accessing GCS via VMs deployed on Google Compute Engine
When running on GCE VMs - it is possible to associate a service account to the launched VMs. This can be done via the Google Cloud Console when launching VMs, or it can be configured in Cloudbreak while deploying a cluster.
This is the simplest mechanism to get credentials set up. Keep in mind
This mechanism allows any process running on this VM to access Google Cloud resources that this service account has access to. (e.g. if users have ssh access to the nodes - they would be able to access data based on the service account permissions) Option2: Configuring credentials via configuration files
This is a more complicated approach, which involves propagating keys for the service account to all nodes in the cluster, and then making changes to relevant configuration files to point to these keys.
This approach can be used on VMs deployed on GCE, or on other nodes - such as on-prem deployments. It also provides more fine grained access control. As an example, it’s possible to configure Hive to use specific credentials while not configuring any default credentials for the rest of the cluster.
This approach can also be used by end users to provide their own credentials when running a job - instead of relying on cluster default credentials. Validating setup
Once credentials have been set up, access can be verified by running the following.
hadoop fs -ls gs://<bucket-name>
where ‘<bucket-name>’ is a GCS bucket which can be accessed using the configured service account. Additional Configuration Changes
Some additional configuration changes may be required depending on the version of Cloudbreak and Ambari being used. These are listed
These changes need to be made to core-site.xml
Make sure to restart affected services, as well as Hive and Spark after configuring credentials and making these changes. Running Hive Jobs against GCS
Option 1: Configure the Hive Warehouse Directory so that all new tables are created on GCS
Set the following configuration property in hive-site.xml
After this, any new tables created will have their data stores on GCS.
Options 2: Specify an explicit path when creating tables
In this case, the path for a table needs to be explicitly set during table creation.
CREATE TABLE employee (e_id int) STORED AS ORC
When running a cluster where Hive manages the data (doAs=false), it may be desirable to configure Cloud Storage credentials in a manner where only Hive has access to the data.
To do this, make the credentials config changes mentioned further up in hiveserver2-site and/or hiveserver2-interactive-site. Running Spark Jobs against GCS
Similar to Hive, this is as simple as replacing the path being used in the script, with a path pointing at a Google Cloud Storage location.
As an example.
text_file = sc.textFile("gs://<bucket_name>/file")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
Things to Watch Out For
As mentioned previously, configuring credentials at the instance level gives access to all users who have access to this VM (e.g. ssh from google cloud console into the running node)
File ownership and permissions returned by listing commands (hadoop fs -ls) do not reflect the actual owner or permissions for the file. These are controlled via Google cloud IAM instead. Wrapping Up
Accessing data stored on Google Cloud Storage requires very few changes to existing applications with the FileSystem abstraction being used. Cloudbreak provides a simple mechanism to launch these clusters on GCE with credentials configured. Deploying clusters via cloudbreak is covered
... View more
Hi @sudi ts Can you share some more information about this deployment. - Is doAs enabled (hive.server2.enable.doAs) - What is the authorization mechanism? Is the Ranger Authorizer being used. If you can pull a stack trace from the HiveServer2 logs, that'll be very useful. HDP-2.6.5 ships with the Google connector, so there's no need to replace any jars. The GS connectivity is working given that you can create this table if logging in as the hive user, and list files via hadoop fs -ls.
Cloud storage Access Control is generally handled via Cloud Provider constructs - such as IAM roles. Hadoop interaction in terms of file owners and permissions doesn't capture this. The user returned by hadoop fs -ls will typically be the logged in user, and the permissions don't indicate much.
... View more
This error normally indicates that the LLAP daemons itself have not started up. (It's been improved in more recent versions). As pointed out in the previous reply, I'd suggest looking into the LLAP configs, NM parameters, AM sizes, and specifically the LLAP logs to figure out why the daemons did not start up. More context: The ZK paths are created by the daemons when they start up. The hive llap client will not create the ZK path if it does not exist, and will end up running into this error if the daemons fail to start.
... View more
This error typically indicates that the LLAP daemons are not running. The error message does need to be improved. What needs to be looked at here is why the LLAP daemons are not up. If they are, we can look at next steps. More detail on the error: The client generates splits based on the number of instances up and running. If there's no instances, it's unable to generate splits and fails with an error indicating that there are 0 locations available. (HostAffinitySplitLocationProviderneeds at least 1 location to function at com.google.common.base.Preconditions.checkState(Preconditions.java:149))
... View more