We are trying to implement a solution on our On prem installation of CDH 6.3.3. We are reading from AWS s3 as a dataframe and saving as a csv file on HDFS. We need to assume role to connect to S3 bucket. So we are using following Hadoop Configuration
sc.hadoopConfiguration.set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider")
sc.hadoopConfiguration.set("fs.s3a.access.key", "A***********")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "xfJ*******************")
sc.hadoopConfiguration.set("fs.s3a.assumed.role.arn", "arn:aws:iam::45**********:role/can-********************/can********************")
While doing this, we were getting the below error, which was resolved by declaring AWS_REGION=ca-central-1 as env variable.
Error:
Instantiate org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider on : com.amazonaws.SdkClientException: Unable to find a region via the region provider chain. Must provide an explicit region in the builder or setup environment to supply a region.: Unable to find a region via the region provider chain. Must provide an explicit region in the builder or setup environment to supply a region.
Now: running spark job as --master local it runs fine, since AWS_REGION is defined. but running as --master yarn it still gives the same error. Our CDH Admin has tried to define the AWS_REGION as global env variable in all the cluster nodes and restart the spark service, but still its the same error.
Please suggest any resolution.
TIA