Support Questions

klprathyusha · ‎11-14-2018

Hi All,

I am trying to export the DF data to S3 bucket but i am not able to do. I am getting below error.

WARN FileOutputCommitter: Could not delete s3a://bucketname/Output/CheckResult/_temporary/0/_temporary/attempt_20181114215639_0002_m_000000_0
18/11/14 21:56:40 ERROR FileFormatWriter: Job job_20181114215639_0002 aborted.

I have tried below code for testing.

res.coalesce(1).write.format("csv").save("s3a://bucketname/Output/CheckResult")

I am not sure what is the issue exactly here? I heard that Spark does not really support writes to non-distributed storage.

Kindly help me how to achieve this?

Many thanks.

klprathyusha · ‎11-15-2018

Any help on this request? Please.

stevel · ‎11-30-2018

Sorry, missed this.
the issue here is that "S3" isn't a "real" filesystem, there's no file/directory rename, and instead we have to list every file created and copy it over. Which relies on listings being correct, which S3, being eventually consistent, doesn't always hold up. Looks like you've hit an inconsistency on a job commit

To get consistent listings (HDP 3) enable S3Guard

To avoid the slow rename process and the problems caused by inconistency within a single query, switch to the "S3A Committers" which come with Spark on HDP-3.0. These are specially designed to safely write work into S3

If you can't do either of those, you cannot safely use S3 as a direct destination of work. You should write into HDFS and then, afterwards, copy it to S3.

Cloudera Community

Support Questions

How to Export DF data to S3 bucket

How to copy between a cluster and S3 buckets

HDP 2.4.0 and Spark 1.6.0 connecting to AWS S3 buc...

Comparing Performance of Cloudera Operational Data...

Fetch objects from an IBM Cloud S3 bucket using Ap...

Listing AWS S3 buckets

Working with S3 Compatible Data Stores via Apache ...

CDW - RAZ Enabled : How to add S3 Bucket in Two S...

How can I use Spark to empty/delete data from an S...

Creating a CDE Job with Spark Application Code loc...

Automated Exporting of Data Flow Definition/Config