Created 05-26-2016 01:02 PM
Unable to execute the queries on S3 data using SPARK and PYSPARK. It is throwing below error.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
….
….
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
we have tried it by adding below parameters but no luck.
Parameter name: fs.s3a.impl
Parameter value: org.apache.hadoop.fs.s3a.S3AFileSystem
Added this paramter in hdfs.site.xml, core-site.xml, hive-site.xml and also added the aws jar files in mapred-site.xml (added to classpath)files.
Created 05-26-2016 02:01 PM
Take a look at https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.... which gives details on how to access S3 from spark.
Created 05-26-2016 01:26 PM
Sorry, this was the article I meant to point you to:
Created 05-26-2016 01:38 PM
Hi @Kirk Haslbeck,
don't know which version you are using but if you didn't see then take a look at below Jira it might help.
Created 05-26-2016 02:01 PM
Take a look at https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets.... which gives details on how to access S3 from spark.
Created 05-26-2016 03:41 PM
yep, S3A implementation is not complete yet, try using S3N for now or follow Alex's article referenced below.
Created 05-26-2016 07:41 PM
Thanks all @Artem Ervits @Tom McCuch for the comments. I did get it resolved by passing all the S3 jars properly on the classpath. The articles included in your threads helped.
Created 05-30-2016 04:57 PM
@Kirk Haslbeck - I was working on something similar. Writing PySpark to use SparkSQL to analyze data in S3 using the S3A filesystem client. I documented my work with instructions here: