Support Questions
Find answers, ask questions, and share your expertise

Access S3 Bucket from Spark

Hi,

We have default S3 Bucket, say A which is configured in core-site.xml. If we try to access that bucket from Spark in client or cluster mode it is working fine. But for bucket, say B which is not configured in core-site.xml, it works fine in Client mode but in cluster mode it fails with below exception. As a workaround we are passing core-site.xml with bucket B jceks file and it works.

Why is this property not working in cluster mode. Let me know if we need to set any other property for cluster mode.

spark.hadoop.hadoop.security.credential.provider.path

Client-Mode (Working fine)

spark-submit --class <ClassName> --master yarn --deploy-mode client --files .../conf/hive-site.xml --conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/.../1.jceks --jars $SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar --queue default <jar_path>

Cluster- Mode (Not working )

spark-submit --class <ClassName> --master yarn --deploy-mode <strong>cluster</strong> 
  --files .../conf/hive-site.xml --conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/.../1.jceks --jars $SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar --queue default <jar_path>
Exception:

diagnostics: User class threw exception: java.nio.file.AccessDeniedException: s3a://<bucketname>/server_date=2017-08-23: getFileStatus on s3a://<bucketname>/server_date=2017-08-23: com.amazonaws.services.s3.model.AmazonS3Exception: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 92F94B902D52D864), S3 Extended Request ID: 1Df3YxG5znruRbsOpsGhCO40s4d9HKhvD14FKk1DSt//lFFuEdXjGueNg5+MYbUIP4aKvsjrZmw=

Cluster-Mode (Workaround which is working)

Step 1: cp /usr/hdp/current/hadoop-client/conf/core-site.xml /<home_dir>/core-site.xml
Step2: Edit core-site.xml and replace jceks://hdfs/.../default.jceks to jceks://hdfs/.../1.jceks
Step3:Pass core-site.xml to spark submit command

spark-submit --class <ClassName> --master yarn --deploy-mode cluster --files .../conf/hive-site.xml,../conf/core-site.xml --conf spark.hadoop.hadoop.security.credential.provider.path=jceks://hdfs/.../1.jceks

--jars $SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar

--verbose --queue default <jar_path>

Thanks

Subacini

1 REPLY 1

Re: Access S3 Bucket from Spark

S3A actually has an extra option to let you set per-bucket jceks files, fs.s3a.security.credential.provider.path This takes the same values as the normal one, but lets you take advantage of the per-bucket config feature of s3a, where every bucket-specific option of fs.s3a.bucket.* is remapped to fs.s3a.* before the bucket is set up.

you should be able to add a reference to it likes so

spark.hadoop.fs.s3a.bucket.b.security.credential.provider.path hdfs:///something.jceks

Hopefully this helps. One challenge we always have with the authentication work is that we can't log it at the detail we'd like, because that would leak secrets too easily...so even when logging at debug, not enough information gets printed. Sorry

see also: https://hortonworks.github.io/hdp-aws/s3-security/index.html

Oh, one more thing. spark-submit copies your local AWS_ environment variables over to the fs.s3a.secret,key and fs.s3a.access.key values. Try unsetting them before you submit work and see if that makes a difference