Support Questions

Arjun_bedi · ‎02-06-2021

We are trying to implement a solution on our On prem installation of CDH 6.3.3. We are reading from AWS s3 as a dataframe and saving as a csv file on HDFS. We need to assume role to connect to S3 bucket. So we are using following Hadoop Configuration

sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider")
sc.hadoopConfiguration.set("fs.s3a.access.key", "A***********")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "xfJ*******************")
sc.hadoopConfiguration.set("fs.s3a.assumed.role.arn", "arn:aws:iam::45**********:role/can-********************/can********************")

While doing this, we were getting the below error, which was resolved by declaring AWS_REGION=ca-central-1 as env variable.

Error:

Instantiate org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider on : com.amazonaws.SdkClientException: Unable to find a region via the region provider chain. Must provide an explicit region in the builder or setup environment to supply a region.: Unable to find a region via the region provider chain. Must provide an explicit region in the builder or setup environment to supply a region.

Now: running spark job as --master local it runs fine, since AWS_REGION is defined. but running as --master yarn it still gives the same error. Our CDH Admin has tried to define the AWS_REGION as global env variable in all the cluster nodes and restart the spark service, but still its the same error.

Please suggest any resolution.

TIA

stevel · ‎06-23-2021

@Arjun_bedi

I'm afraid you've just hit a problem which we've only just started encountering:

HADOOP-17771 . S3AFS creation fails "Unable to find a region via the region provider chain."

This failure surfaces when _all_ the following conditions are met:

Deployment outside EC2.
Configuration option `fs.s3a.endpoint` is unset.
Without the file `~/.aws/config` existing or without a region set in it.
Without the JVM system property `aws.region` declaring a region.
Without the environment variable `AWS_REGION` declaring a region.

You can make this go away by setting the S3 endpoint to s3.amazonaws.com in core-site.xml

<property>
  <name>fs.s3a.endpoint</name>
  <value>s3.amazonaws.com</value>
</property>

in your scala code:

sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.amazonaws.com")

Even better, if you know the actual region your data lives in, set fs.s3a.endpoint to the regional endpoint. This will save an HTTP request to the central endpoint whenever an S3A filesystem instance is created.

We are working on the fix for this and will be backporting it where needed. I was not expecting CDH 6.3.x to be in need of it, but clearly it does.

Cloudera Community

Support Questions

AWS_REGION environment variable declaration