Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

Spark on S3

avatar
Expert Contributor

Unable to execute the queries on S3 data using SPARK and PYSPARK. It is throwing below error.

: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)

at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)

….

….

Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)

at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)

we have tried it by adding below parameters but no luck.

Parameter name: fs.s3a.impl

Parameter value: org.apache.hadoop.fs.s3a.S3AFileSystem

Added this paramter in hdfs.site.xml, core-site.xml, hive-site.xml and also added the aws jar files in mapred-site.xml (added to classpath)files.

1 ACCEPTED SOLUTION

avatar
Guru
6 REPLIES 6

avatar

avatar
Super Guru

Hi @Kirk Haslbeck,

don't know which version you are using but if you didn't see then take a look at below Jira it might help.

https://issues.apache.org/jira/browse/SPARK-7442

avatar
Guru

Take a look at https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.... which gives details on how to access S3 from spark.

avatar
Master Mentor

yep, S3A implementation is not complete yet, try using S3N for now or follow Alex's article referenced below.

avatar
Expert Contributor

Thanks all @Artem Ervits @Tom McCuch for the comments. I did get it resolved by passing all the S3 jars properly on the classpath. The articles included in your threads helped.

avatar

@Kirk Haslbeck - I was working on something similar. Writing PySpark to use SparkSQL to analyze data in S3 using the S3A filesystem client. I documented my work with instructions here:

https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.h...