- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
AWS_REGION environment variable declaration
- Labels:
-
Apache Spark
Created ‎02-06-2021 03:32 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
We are trying to implement a solution on our On prem installation of CDH 6.3.3. We are reading from AWS s3 as a dataframe and saving as a csv file on HDFS. We need to assume role to connect to S3 bucket. So we are using following Hadoop Configuration
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider")
sc.hadoopConfiguration.set("fs.s3a.access.key", "A***********")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "xfJ*******************")
sc.hadoopConfiguration.set("fs.s3a.assumed.role.arn", "arn:aws:iam::45**********:role/can-********************/can********************")
While doing this, we were getting the below error, which was resolved by declaring AWS_REGION=ca-central-1 as env variable.
Error:
Created on ‎06-23-2021 10:09 AM - edited ‎06-23-2021 10:14 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
I'm afraid you've just hit a problem which we've only just started encountering:
HADOOP-17771 . S3AFS creation fails "Unable to find a region via the region provider chain."
This failure surfaces when _all_ the following conditions are met:
- Deployment outside EC2.
- Configuration option `fs.s3a.endpoint` is unset.
- Without the file `~/.aws/config` existing or without a region set in it.
- Without the JVM system property `aws.region` declaring a region.
- Without the environment variable `AWS_REGION` declaring a region.
You can make this go away by setting the S3 endpoint to s3.amazonaws.com in core-site.xml
<property>
<name>fs.s3a.endpoint</name>
<value>s3.amazonaws.com</value>
</property>
in your scala code:
sc.hadoopConfiguration.set("fs.s3a.endpoint", "s3.amazonaws.com")
Even better, if you know the actual region your data lives in, set fs.s3a.endpoint to the regional endpoint. This will save an HTTP request to the central endpoint whenever an S3A filesystem instance is created.
We are working on the fix for this and will be backporting it where needed. I was not expecting CDH 6.3.x to be in need of it, but clearly it does.
