<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How can I use Spark to empty/delete data from an S3 bucket? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-use-Spark-to-empty-delete-data-from-an-S3-bucket/m-p/135965#M31718</link>
    <description>&lt;P&gt;There are two things you could do here:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Use the AWS SDK to do a prefix listing, parallelize the result, and then probably do a mapPartitions, applying the following approach for multi-deletes: &lt;A href="http://docs.aws.amazon.com/AmazonS3/latest/dev/DeletingMultipleObjectsUsingJava.html" target="_blank"&gt;http://docs.aws.amazon.com/AmazonS3/latest/dev/DeletingMultipleObjectsUsingJava.html&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;Use two buckets, one for the original files with a lifecycle policy that will apply the deletes, and another for the rolled up data: &lt;A href="http://docs.aws.amazon.com/AmazonS3/latest/dev/delete-or-empty-bucket.html#delete-bucket-lifecycle" target="_blank"&gt;http://docs.aws.amazon.com/AmazonS3/latest/dev/delete-or-empty-bucket.html#delete-bucket-lifecycle&lt;/A&gt;&lt;/LI&gt;&lt;/OL&gt;</description>
    <pubDate>Tue, 14 Jun 2016 01:00:08 GMT</pubDate>
    <dc:creator>jfrazee</dc:creator>
    <dc:date>2016-06-14T01:00:08Z</dc:date>
    <item>
      <title>How can I use Spark to empty/delete data from an S3 bucket?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-use-Spark-to-empty-delete-data-from-an-S3-bucket/m-p/135964#M31717</link>
      <description>&lt;P&gt;My ingest pipeline writes small files to S3 frequently. I have a periodic job that aggregates these into bigger files. Is there a way to use Spark to empty an S3 path? Something like "insert overwrite s3://bucket/my_folder" with an empty DataFrame?&lt;/P&gt;</description>
      <pubDate>Mon, 13 Jun 2016 22:37:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-use-Spark-to-empty-delete-data-from-an-S3-bucket/m-p/135964#M31717</guid>
      <dc:creator>rgelhausen</dc:creator>
      <dc:date>2016-06-13T22:37:22Z</dc:date>
    </item>
    <item>
      <title>Re: How can I use Spark to empty/delete data from an S3 bucket?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-use-Spark-to-empty-delete-data-from-an-S3-bucket/m-p/135965#M31718</link>
      <description>&lt;P&gt;There are two things you could do here:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Use the AWS SDK to do a prefix listing, parallelize the result, and then probably do a mapPartitions, applying the following approach for multi-deletes: &lt;A href="http://docs.aws.amazon.com/AmazonS3/latest/dev/DeletingMultipleObjectsUsingJava.html" target="_blank"&gt;http://docs.aws.amazon.com/AmazonS3/latest/dev/DeletingMultipleObjectsUsingJava.html&lt;/A&gt;&lt;/LI&gt;&lt;LI&gt;Use two buckets, one for the original files with a lifecycle policy that will apply the deletes, and another for the rolled up data: &lt;A href="http://docs.aws.amazon.com/AmazonS3/latest/dev/delete-or-empty-bucket.html#delete-bucket-lifecycle" target="_blank"&gt;http://docs.aws.amazon.com/AmazonS3/latest/dev/delete-or-empty-bucket.html#delete-bucket-lifecycle&lt;/A&gt;&lt;/LI&gt;&lt;/OL&gt;</description>
      <pubDate>Tue, 14 Jun 2016 01:00:08 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-use-Spark-to-empty-delete-data-from-an-S3-bucket/m-p/135965#M31718</guid>
      <dc:creator>jfrazee</dc:creator>
      <dc:date>2016-06-14T01:00:08Z</dc:date>
    </item>
    <item>
      <title>Re: How can I use Spark to empty/delete data from an S3 bucket?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-use-Spark-to-empty-delete-data-from-an-S3-bucket/m-p/135966#M31719</link>
      <description>&lt;P&gt;If your looking for a way to just delete the S3 path from your code and if your using PySpark, then the following will work:&lt;/P&gt;&lt;P&gt;import os &lt;/P&gt;&lt;P&gt;cmd="hdfs dfs -rm -r -skipTrash s3a://my-bucket/test_delete_me" &lt;/P&gt;&lt;P&gt;os.system(cmd)&lt;/P&gt;</description>
      <pubDate>Tue, 14 Jun 2016 02:00:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-can-I-use-Spark-to-empty-delete-data-from-an-S3-bucket/m-p/135966#M31719</guid>
      <dc:creator>bmathew</dc:creator>
      <dc:date>2016-06-14T02:00:01Z</dc:date>
    </item>
  </channel>
</rss>

