Support Questions

Find answers, ask questions, and share your expertise
Celebrating as our community reaches 100,000 members! Thank you!

Access S3 Bucket from Spark



We have default S3 Bucket, say A which is configured in core-site.xml. If we try to access that bucket from Spark in client or cluster mode it is working fine. But for bucket, say B which is not configured in core-site.xml, it works fine in Client mode but in cluster mode it fails with below exception. As a workaround we are passing core-site.xml with bucket B jceks file and it works.

Why is this property not working in cluster mode. Let me know if we need to set any other property for cluster mode.

Client-Mode (Working fine)

spark-submit --class <ClassName> --master yarn --deploy-mode client --files .../conf/hive-site.xml --conf --jars $SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar --queue default <jar_path>

Cluster- Mode (Not working )

spark-submit --class <ClassName> --master yarn --deploy-mode <strong>cluster</strong> 
  --files .../conf/hive-site.xml --conf --jars $SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar --queue default <jar_path>

diagnostics: User class threw exception: java.nio.file.AccessDeniedException: s3a://<bucketname>/server_date=2017-08-23: getFileStatus on s3a://<bucketname>/server_date=2017-08-23: Forbidden (Service: Amazon S3; Status Code: 403; Error Code: 403 Forbidden; Request ID: 92F94B902D52D864), S3 Extended Request ID: 1Df3YxG5znruRbsOpsGhCO40s4d9HKhvD14FKk1DSt//lFFuEdXjGueNg5+MYbUIP4aKvsjrZmw=

Cluster-Mode (Workaround which is working)

Step 1: cp /usr/hdp/current/hadoop-client/conf/core-site.xml /<home_dir>/core-site.xml
Step2: Edit core-site.xml and replace jceks://hdfs/.../default.jceks to jceks://hdfs/.../1.jceks
Step3:Pass core-site.xml to spark submit command

spark-submit --class <ClassName> --master yarn --deploy-mode cluster --files .../conf/hive-site.xml,../conf/core-site.xml --conf

--jars $SPARK_HOME/lib/datanucleus-api-jdo-3.2.6.jar,$SPARK_HOME/lib/datanucleus-rdbms-3.2.9.jar,$SPARK_HOME/lib/datanucleus-core-3.2.10.jar

--verbose --queue default <jar_path>





S3A actually has an extra option to let you set per-bucket jceks files, This takes the same values as the normal one, but lets you take advantage of the per-bucket config feature of s3a, where every bucket-specific option of fs.s3a.bucket.* is remapped to fs.s3a.* before the bucket is set up.

you should be able to add a reference to it likes so hdfs:///something.jceks

Hopefully this helps. One challenge we always have with the authentication work is that we can't log it at the detail we'd like, because that would leak secrets too even when logging at debug, not enough information gets printed. Sorry

see also:

Oh, one more thing. spark-submit copies your local AWS_ environment variables over to the fs.s3a.secret,key and fs.s3a.access.key values. Try unsetting them before you submit work and see if that makes a difference