Support Questions

bharatbs13 · ‎06-13-2018

I am trying to read csv file from S3 . variable url is set to some value.

>>> DF = spark.read.load(url,
...                           format="com.databricks.spark.csv",
...                           header="true",
...                           inferschema="true",
...                           delimiter=",")
18/06/13 11:16:24 WARN DataSource: Error while looking for metadata directory.
Traceback (most recent call last):
  File "<stdin>", line 5, in <module>
  File "/opt/sw/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/readwriter.py", line 149, in load
    return self._df(self._jreader.load(path))
  File "/opt/sw/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/opt/sw/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/opt/sw/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

How can I fix this issue? You help is appreciated.

jsensharma · ‎06-13-2018

@bharat sharma

As we see the following error which indicates that you have not placed the hadoop-aws jars in the classpath:

py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found<br>

.

So can you please check and download the aws sdk for java https://aws.amazon.com/sdk-for-java/ Uploaded it to the hadoop directory.

please check your "spark.driver.extraClassPath" if it has the "hadoop-aws*.jar" and "aws-java-sdk*.jar"

For more details please refer to :

https://community.hortonworks.com/articles/25523/hdp-240-and-spark-160-connecting-to-aws-s3-buckets....

https://community.hortonworks.com/articles/36339/spark-s3a-filesystem-client-from-hdp-to-access-s3.h...

Cloudera Community

Support Questions

py4j.protocol.Py4JJavaError in pyspark while reading file from S3