Archives of Support Questions (Read Only)

rgelhausen · ‎06-13-2016

My ingest pipeline writes small files to S3 frequently. I have a periodic job that aggregates these into bigger files. Is there a way to use Spark to empty an S3 path? Something like "insert overwrite s3://bucket/my_folder" with an empty DataFrame?

bmathew · ‎06-13-2016

If your looking for a way to just delete the S3 path from your code and if your using PySpark, then the following will work:

import os

cmd="hdfs dfs -rm -r -skipTrash s3a://my-bucket/test_delete_me"

os.system(cmd)

View solution in original post

jfrazee · ‎06-13-2016

There are two things you could do here:

Use the AWS SDK to do a prefix listing, parallelize the result, and then probably do a mapPartitions, applying the following approach for multi-deletes: http://docs.aws.amazon.com/AmazonS3/latest/dev/DeletingMultipleObjectsUsingJava.html
Use two buckets, one for the original files with a lifecycle policy that will apply the deletes, and another for the rolled up data: http://docs.aws.amazon.com/AmazonS3/latest/dev/delete-or-empty-bucket.html#delete-bucket-lifecycle

bmathew · ‎06-13-2016

If your looking for a way to just delete the S3 path from your code and if your using PySpark, then the following will work:

import os

cmd="hdfs dfs -rm -r -skipTrash s3a://my-bucket/test_delete_me"

os.system(cmd)

Cloudera Community

Archives of Support Questions (Read Only)

How can I use Spark to empty/delete data from an S3 bucket?