- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
How can I use Spark to empty/delete data from an S3 bucket?
- Labels:
-
Apache Spark
Created ‎06-13-2016 03:37 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
My ingest pipeline writes small files to S3 frequently. I have a periodic job that aggregates these into bigger files. Is there a way to use Spark to empty an S3 path? Something like "insert overwrite s3://bucket/my_folder" with an empty DataFrame?
Created ‎06-13-2016 07:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your looking for a way to just delete the S3 path from your code and if your using PySpark, then the following will work:
import os
cmd="hdfs dfs -rm -r -skipTrash s3a://my-bucket/test_delete_me"
os.system(cmd)
Created ‎06-13-2016 06:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
There are two things you could do here:
- Use the AWS SDK to do a prefix listing, parallelize the result, and then probably do a mapPartitions, applying the following approach for multi-deletes: http://docs.aws.amazon.com/AmazonS3/latest/dev/DeletingMultipleObjectsUsingJava.html
- Use two buckets, one for the original files with a lifecycle policy that will apply the deletes, and another for the rolled up data: http://docs.aws.amazon.com/AmazonS3/latest/dev/delete-or-empty-bucket.html#delete-bucket-lifecycle
Created ‎06-13-2016 07:00 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
If your looking for a way to just delete the S3 path from your code and if your using PySpark, then the following will work:
import os
cmd="hdfs dfs -rm -r -skipTrash s3a://my-bucket/test_delete_me"
os.system(cmd)
