We are trying to implement a solution on our On prem installation of CDH 6.3.3. We are reading from AWS s3 as a dataframe and saving as a csv file on HDFS. We need to assume role to connect to S3 bucket. So we are using following Hadoop Configuration
While doing this, we were getting the below error, which was resolved by declaring AWS_REGION=ca-central-1 as env variable.
Instantiate org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider on : com.amazonaws.SdkClientException: Unable to find a region via the region provider chain. Must provide an explicit region in the builder or setup environment to supply a region.: Unable to find a region via the region provider chain. Must provide an explicit region in the builder or setup environment to supply a region.
Now: running spark job as --master local it runs fine, since AWS_REGION is defined. but running as --master yarn it still gives the same error. Our CDH Admin has tried to define the AWS_REGION as global env variable in all the cluster nodes and restart the spark service, but still its the same error.
Even better, if you know the actual region your data lives in, set fs.s3a.endpoint to the regional endpoint. This will save an HTTP request to the central endpoint whenever an S3A filesystem instance is created.
We are working on the fix for this and will be backporting it where needed. I was not expecting CDH 6.3.x to be in need of it, but clearly it does.