Created 06-13-2016 03:37 PM
My ingest pipeline writes small files to S3 frequently. I have a periodic job that aggregates these into bigger files. Is there a way to use Spark to empty an S3 path? Something like "insert overwrite s3://bucket/my_folder" with an empty DataFrame?
Created 06-13-2016 07:00 PM
If your looking for a way to just delete the S3 path from your code and if your using PySpark, then the following will work:
import os
cmd="hdfs dfs -rm -r -skipTrash s3a://my-bucket/test_delete_me"
os.system(cmd)
Created 06-13-2016 06:00 PM
There are two things you could do here:
Created 06-13-2016 07:00 PM
If your looking for a way to just delete the S3 path from your code and if your using PySpark, then the following will work:
import os
cmd="hdfs dfs -rm -r -skipTrash s3a://my-bucket/test_delete_me"
os.system(cmd)