question Re: Spark on S3 in Archives of Support Questions (Read Only)

Spark on S3

khaslbeck — Thu, 26 May 2016 20:02:05 GMT

Unable to execute the queries on S3 data using SPARK and PYSPARK. It is throwing below error.

: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)

at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638)

at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)

….

Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)

at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)

we have tried it by adding below parameters but no luck.

Parameter name: fs.s3a.impl

Parameter value: org.apache.hadoop.fs.s3a.S3AFileSystem

Added this paramter in hdfs.site.xml, core-site.xml, hive-site.xml and also added the aws jar files in mapred-site.xml (added to classpath)files.

Re: Spark on S3

tmccuch — Thu, 26 May 2016 20:26:06 GMT

Sorry, this was the article I meant to point you to:

https://community.hortonworks.com/articles/25578/how-to-access-data-files-stored-in-aws-s3-buckets.html

Re: Spark on S3

jyadav — Thu, 26 May 2016 20:38:30 GMT

Hi @Kirk Haslbeck,

don't know which version you are using but if you didn't see then take a look at below Jira it might help.

https://issues.apache.org/jira/browse/SPARK-7442

Re: Spark on S3

ravi1 — Thu, 26 May 2016 21:01:42 GMT

Take a look at https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.html which gives details on how to access S3 from spark.

Re: Spark on S3

aervits — Thu, 26 May 2016 22:41:00 GMT

yep, S3A implementation is not complete yet, try using S3N for now or follow Alex's article referenced below.

Re: Spark on S3

khaslbeck — Fri, 27 May 2016 02:41:49 GMT

Thanks all @Artem Ervits @Tom McCuch for the comments. I did get it resolved by passing all the S3 jars properly on the classpath. The articles included in your threads helped.

Re: Spark on S3

bmathew — Mon, 30 May 2016 23:57:05 GMT

@Kirk Haslbeck - I was working on something similar. Writing PySpark to use SparkSQL to analyze data in S3 using the S3A filesystem client. I documented my work with instructions here:

https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.html