Archives of Support Questions (Read Only)

This is an archived board for historical reference. Information and links may no longer be available or relevant
Announcements
This board is archived and read-only for historical reference. To ask a new question, please post a new topic on the appropriate active board.

How can I use Spark to empty/delete data from an S3 bucket?

avatar

My ingest pipeline writes small files to S3 frequently. I have a periodic job that aggregates these into bigger files. Is there a way to use Spark to empty an S3 path? Something like "insert overwrite s3://bucket/my_folder" with an empty DataFrame?

1 ACCEPTED SOLUTION

avatar

If your looking for a way to just delete the S3 path from your code and if your using PySpark, then the following will work:

import os

cmd="hdfs dfs -rm -r -skipTrash s3a://my-bucket/test_delete_me"

os.system(cmd)

View solution in original post

2 REPLIES 2

avatar

There are two things you could do here:

  1. Use the AWS SDK to do a prefix listing, parallelize the result, and then probably do a mapPartitions, applying the following approach for multi-deletes: http://docs.aws.amazon.com/AmazonS3/latest/dev/DeletingMultipleObjectsUsingJava.html
  2. Use two buckets, one for the original files with a lifecycle policy that will apply the deletes, and another for the rolled up data: http://docs.aws.amazon.com/AmazonS3/latest/dev/delete-or-empty-bucket.html#delete-bucket-lifecycle

avatar

If your looking for a way to just delete the S3 path from your code and if your using PySpark, then the following will work:

import os

cmd="hdfs dfs -rm -r -skipTrash s3a://my-bucket/test_delete_me"

os.system(cmd)