<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question How to save each partition of a Dataframe/Dataset in parallel  with partitionBy or InsertInto Hive in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/How-to-save-each-partition-of-a-Dataframe-Dataset-in/m-p/46811#M24813</link>
    <description>&lt;P&gt;Hello&lt;/P&gt;&lt;P&gt;I currently use spark 2.0.1 and i try to save my dataset into a "partitioned table Hive" with insertInto() or on S3 storage with partitionBy("col") with job in concurrency (parallel). But with this 2 methods each partition of my dataset is save sequentially one by one . It's very very SLOW.&lt;/P&gt;&lt;P&gt;I already know that I must use insertInto() or partitionBy() one at time.&lt;/P&gt;&lt;P&gt;I assume that in spark.2.0.1 Dataframe are Resilient Data Set .&lt;/P&gt;&lt;P&gt;My current code :&lt;/P&gt;&lt;PRE&gt;df.write.mode(SaveMode.Append).partitionBy("col").save("s3://bucket/diroutput")&lt;/PRE&gt;&lt;P&gt;Or&lt;/P&gt;&lt;PRE&gt;df.write.mode(SaveMode.Append).insertInto("TableHivealreadypartitioned")&lt;/PRE&gt;&lt;P&gt;So I try some stuff with df.foreachPartition like this :&lt;/P&gt;&lt;PRE&gt;df.foreachPartition{datasetpartition =&amp;gt; datasetpartition.foreach(row =&amp;gt; row.sometransformation)}&lt;/PRE&gt;&lt;P&gt;Unfortunately i still do not find a way to write/save in parallel each partition of my dataset.&lt;/P&gt;&lt;P&gt;Someone already done this?&lt;/P&gt;&lt;P&gt;Can you tell me how to proceed?&lt;/P&gt;&lt;P&gt;Is it a wrong direction?&lt;/P&gt;&lt;P&gt;thanks for your help&lt;/P&gt;</description>
    <pubDate>Fri, 16 Sep 2022 10:46:14 GMT</pubDate>
    <dc:creator>damdr</dc:creator>
    <dc:date>2022-09-16T10:46:14Z</dc:date>
  </channel>
</rss>

