Support Questions

khaslbeck · ‎05-26-2016

Unable to execute the queries on S3 data using SPARK and PYSPARK. It is throwing below error.

: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)

at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)

….

Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)

at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)

we have tried it by adding below parameters but no luck.

Parameter name: fs.s3a.impl

Parameter value: org.apache.hadoop.fs.s3a.S3AFileSystem

Added this paramter in hdfs.site.xml, core-site.xml, hive-site.xml and also added the aws jar files in mapred-site.xml (added to classpath)files.

ravi1 · ‎05-26-2016

Take a look at https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.... which gives details on how to access S3 from spark.

View solution in original post

tmccuch · ‎05-26-2016

Sorry, this was the article I meant to point you to:

https://community.hortonworks.com/articles/25578/how-to-access-data-files-stored-in-aws-s3-buckets.h...

jyadav · ‎05-26-2016

Hi @Kirk Haslbeck,

don't know which version you are using but if you didn't see then take a look at below Jira it might help.

https://issues.apache.org/jira/browse/SPARK-7442

ravi1 · ‎05-26-2016

Take a look at https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.... which gives details on how to access S3 from spark.

aervits · ‎05-26-2016

yep, S3A implementation is not complete yet, try using S3N for now or follow Alex's article referenced below.

khaslbeck · ‎05-26-2016

Thanks all @Artem Ervits @Tom McCuch for the comments. I did get it resolved by passing all the S3 jars properly on the classpath. The articles included in your threads helped.

bmathew · ‎05-30-2016

@Kirk Haslbeck - I was working on something similar. Writing PySpark to use SparkSQL to analyze data in S3 using the S3A filesystem client. I documented my work with instructions here:

https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.h...

Cloudera Community

Support Questions

Spark on S3