question Re: How can I use Spark to empty/delete data from an S3 bucket? in Archives of Support Questions (Read Only)

How can I use Spark to empty/delete data from an S3 bucket?

rgelhausen — Mon, 13 Jun 2016 22:37:22 GMT

My ingest pipeline writes small files to S3 frequently. I have a periodic job that aggregates these into bigger files. Is there a way to use Spark to empty an S3 path? Something like "insert overwrite s3://bucket/my_folder" with an empty DataFrame?

Re: How can I use Spark to empty/delete data from an S3 bucket?

jfrazee — Tue, 14 Jun 2016 01:00:08 GMT

There are two things you could do here:

Use the AWS SDK to do a prefix listing, parallelize the result, and then probably do a mapPartitions, applying the following approach for multi-deletes: http://docs.aws.amazon.com/AmazonS3/latest/dev/DeletingMultipleObjectsUsingJava.html
Use two buckets, one for the original files with a lifecycle policy that will apply the deletes, and another for the rolled up data: http://docs.aws.amazon.com/AmazonS3/latest/dev/delete-or-empty-bucket.html#delete-bucket-lifecycle

Re: How can I use Spark to empty/delete data from an S3 bucket?

bmathew — Tue, 14 Jun 2016 02:00:01 GMT

If your looking for a way to just delete the S3 path from your code and if your using PySpark, then the following will work:

import os

cmd="hdfs dfs -rm -r -skipTrash s3a://my-bucket/test_delete_me"

os.system(cmd)