Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Spark on S3

avatar
Expert Contributor

Unable to execute the queries on S3 data using SPARK and PYSPARK. It is throwing below error.

: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)

at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)

….

….

Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)

at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)

we have tried it by adding below parameters but no luck.

Parameter name: fs.s3a.impl

Parameter value: org.apache.hadoop.fs.s3a.S3AFileSystem

Added this paramter in hdfs.site.xml, core-site.xml, hive-site.xml and also added the aws jar files in mapred-site.xml (added to classpath)files.

1 ACCEPTED SOLUTION

avatar
Guru
6 REPLIES 6

avatar

avatar
Super Guru

Hi @Kirk Haslbeck,

don't know which version you are using but if you didn't see then take a look at below Jira it might help.

https://issues.apache.org/jira/browse/SPARK-7442

avatar
Guru

Take a look at https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.... which gives details on how to access S3 from spark.

avatar
Master Mentor

yep, S3A implementation is not complete yet, try using S3N for now or follow Alex's article referenced below.

avatar
Expert Contributor

Thanks all @Artem Ervits @Tom McCuch for the comments. I did get it resolved by passing all the S3 jars properly on the classpath. The articles included in your threads helped.

avatar

@Kirk Haslbeck - I was working on something similar. Writing PySpark to use SparkSQL to analyze data in S3 using the S3A filesystem client. I documented my work with instructions here:

https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.h...