Support Questions

edmund_prout · ‎09-28-2016

I'm attempting to write a parquet file to an S3 bucket, but getting the below error:

py4j.protocol.Py4JJavaError: An error occurred while calling o36.parquet. : java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:453) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745)

The line of python code that fails is:

df.write.parquet("s3a://myfolder/myotherfolder")

The same line line of code works successfully if I write it to HDFS instead of S3:

df.write.parquet("hdfs://myfolder/myotherfolder")

I'm using spark-2.0.2-bin-hadoop2.7 and aws-java-sdk-1.11.38 binaries. Right now I'm running it interactively in PyCharm on my Mac.

kasperaaquist · ‎10-04-2016

Hi @Ed Prout,

I have had the same error in some scala code. I came across this post when looking to solve the issue/problem.

Site: http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/

It states that if you see the error, then you need to bump down the "aws-java-sdk" to 1.7.4.

`If you see a different exception message:

<code>java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold(I)V

Then make sure you're using aws-java-sdk-1.7.4.jar and not a more recent version.`

I bumped my jar down to 1.7.4, and the problem disappeared.

I hope this helps.

/Kasper

View solution in original post

kasperaaquist · ‎10-04-2016

Hi @Ed Prout,

I have had the same error in some scala code. I came across this post when looking to solve the issue/problem.

Site: http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/

It states that if you see the error, then you need to bump down the "aws-java-sdk" to 1.7.4.

`If you see a different exception message:

<code>java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold(I)V

Then make sure you're using aws-java-sdk-1.7.4.jar and not a more recent version.`

I bumped my jar down to 1.7.4, and the problem disappeared.

I hope this helps.

/Kasper

imai · ‎01-14-2017

I rushed into same problem and this worked for me, thanks!

stevel · ‎01-14-2017

If things aren't working with HDP 2.5 or HDCloud, I'd recommend starting with [Troubleshooting S3a](https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-trouble/index.html)

If you are using ASF released binaries, then those docs are mostly valid too, though as we pulled in much of the later features coming in S3a on Hadoop 2.8 (after writing them!), the docs are a bit inconsistent. The closest ASF docs on troubleshooting are those for [Hadoop 2.8](https://github.com/apache/hadoop/blob/branch-2.8/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md#troubleshooting-s3a).

As Kasper pointed out, this is due to AWS JAR versioning. the Amazon SDK has been pretty brittle against change, and you *must* run with the same version of the AWS SDK which Hadoop was built with (which also needs a consistent version of jackson, ...).

Hadoop 2.7.x: AWS SDK 1.7.4

Hadoop 2.8.x: 1.10.6

Hadoop 2.9+: probably 10.11+ or later, with jackson bumped up to 2.7.8 to match.

Cloudera Community

Support Questions

Spark S3 write failed

Impala writes on Iceberg

Write / Read Parquet File in Spark

Comparing Performance of Cloudera Operational Data...

Spark + S3A filesystem client from HDP to access S...

Nifi Flow for writing to S3, WASB and Google Stora...

HDP 2.4.0 and Spark 1.6.0 connecting to AWS S3 buc...

Spark on S3

Writing parquet on HDFS using Spark Streaming

Write Spark HQL Query output to HDFS

Testing Spark write performance with Spark version...