Support Questions
Find answers, ask questions, and share your expertise

How can I use Spark to empty/delete data from an S3 bucket?

My ingest pipeline writes small files to S3 frequently. I have a periodic job that aggregates these into bigger files. Is there a way to use Spark to empty an S3 path? Something like "insert overwrite s3://bucket/my_folder" with an empty DataFrame?

1 ACCEPTED SOLUTION

Accepted Solutions

If your looking for a way to just delete the S3 path from your code and if your using PySpark, then the following will work:

import os

cmd="hdfs dfs -rm -r -skipTrash s3a://my-bucket/test_delete_me"

os.system(cmd)

View solution in original post

2 REPLIES 2

There are two things you could do here:

  1. Use the AWS SDK to do a prefix listing, parallelize the result, and then probably do a mapPartitions, applying the following approach for multi-deletes: http://docs.aws.amazon.com/AmazonS3/latest/dev/DeletingMultipleObjectsUsingJava.html
  2. Use two buckets, one for the original files with a lifecycle policy that will apply the deletes, and another for the rolled up data: http://docs.aws.amazon.com/AmazonS3/latest/dev/delete-or-empty-bucket.html#delete-bucket-lifecycle

If your looking for a way to just delete the S3 path from your code and if your using PySpark, then the following will work:

import os

cmd="hdfs dfs -rm -r -skipTrash s3a://my-bucket/test_delete_me"

os.system(cmd)

View solution in original post