question Re: Spark S3 write failed in Archives of Support Questions (Read Only)

Spark S3 write failed

edmund_prout — Thu, 29 Sep 2016 01:47:56 GMT

I'm attempting to write a parquet file to an S3 bucket, but getting the below error:

py4j.protocol.Py4JJavaError: An error occurred while calling o36.parquet. : java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManager.<init>(Lcom/amazonaws/services/s3/AmazonS3;Ljava/util/concurrent/ThreadPoolExecutor;)V at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:287) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:453) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:488) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:745)

The line of python code that fails is:

df.write.parquet("s3a://myfolder/myotherfolder")

The same line line of code works successfully if I write it to HDFS instead of S3:

df.write.parquet("hdfs://myfolder/myotherfolder")

I'm using spark-2.0.2-bin-hadoop2.7 and aws-java-sdk-1.11.38 binaries. Right now I'm running it interactively in PyCharm on my Mac.

Re: Spark S3 write failed

kasperaaquist — Tue, 04 Oct 2016 16:31:24 GMT

Hi @Ed Prout,

I have had the same error in some scala code. I came across this post when looking to solve the issue/problem.

Site: http://deploymentzone.com/2015/12/20/s3a-on-spark-on-aws-ec2/

It states that if you see the error, then you need to bump down the "aws-java-sdk" to 1.7.4.

`If you see a different exception message:

<code>java.lang.NoSuchMethodError: com.amazonaws.services.s3.transfer.TransferManagerConfiguration.setMultipartUploadThreshold(I)V

Then make sure you're using aws-java-sdk-1.7.4.jar and not a more recent version.`

I bumped my jar down to 1.7.4, and the problem disappeared.

I hope this helps.

/Kasper

Re: Spark S3 write failed

imai — Sat, 14 Jan 2017 14:42:23 GMT

I rushed into same problem and this worked for me, thanks!

Re: Spark S3 write failed

stevel — Sat, 14 Jan 2017 21:24:23 GMT

If things aren't working with HDP 2.5 or HDCloud, I'd recommend starting with [Troubleshooting S3a](https://docs.hortonworks.com/HDPDocuments/HDCloudAWS/HDCloudAWS-1.8.0/bk_hdcloud-aws/content/s3-trouble/index.html)

If you are using ASF released binaries, then those docs are mostly valid too, though as we pulled in much of the later features coming in S3a on Hadoop 2.8 (after writing them!), the docs are a bit inconsistent. The closest ASF docs on troubleshooting are those for [Hadoop 2.8](https://github.com/apache/hadoop/blob/branch-2.8/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md#troubleshooting-s3a).

As Kasper pointed out, this is due to AWS JAR versioning. the Amazon SDK has been pretty brittle against change, and you *must* run with the same version of the AWS SDK which Hadoop was built with (which also needs a consistent version of jackson, ...).

Hadoop 2.7.x: AWS SDK 1.7.4

Hadoop 2.8.x: 1.10.6

Hadoop 2.9+: probably 10.11+ or later, with jackson bumped up to 2.7.8 to match.