<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How split calculate in Spark ? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-split-calculate-in-Spark/m-p/116209#M26195</link>
    <description>&lt;P&gt;The above 2 answers are very good. One caveat: keep in mind that when reading compressed file formats from disk, Spark partitioning depends on whether the format is &lt;EM&gt;splittable&lt;/EM&gt;. For instance, these formats are splittable: &lt;STRONG&gt;bzip2, snappy, LZO&lt;/STRONG&gt; (if indexed), while &lt;STRONG&gt;gzip&lt;/STRONG&gt; is not splittable. Here is documentation about why:&lt;/P&gt;&lt;P&gt;&lt;A href="http://comphadoop.weebly.com/"&gt;http://comphadoop.weebly.com/&lt;/A&gt;&lt;/P&gt;</description>
    <pubDate>Wed, 27 Apr 2016 02:02:33 GMT</pubDate>
    <dc:creator>phargis</dc:creator>
    <dc:date>2016-04-27T02:02:33Z</dc:date>
    <item>
      <title>How split calculate in Spark ?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-split-calculate-in-Spark/m-p/116206#M26192</link>
      <description>&lt;P&gt;While loading file from hdfs to RDD how data splitting happend across partitons. is there anything like hadoop input split ?&lt;/P&gt;</description>
      <pubDate>Tue, 26 Apr 2016 14:16:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-split-calculate-in-Spark/m-p/116206#M26192</guid>
      <dc:creator>vadivel_samband</dc:creator>
      <dc:date>2016-04-26T14:16:10Z</dc:date>
    </item>
    <item>
      <title>Re: How split calculate in Spark ?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-split-calculate-in-Spark/m-p/116207#M26193</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/3812/vadivelsambandam.html" nodeid="3812"&gt;@vadivel sambandam&lt;/A&gt; &lt;/P&gt;&lt;P&gt;Spark input splits works same way as Hadoop input splits, it uses same underlining hadoop InputFormat API's. When it comes to the spark partitions, by default it will create one partition for each hdfs blocks, For example: if you have file with 1GB size and your hdfs block size is 128 MB then you will have  total 8 HDFS blocks and spark will create 8 partitions by default . But incase if you want further split within partition then it would be done on line split.&lt;/P&gt;</description>
      <pubDate>Tue, 26 Apr 2016 16:34:25 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-split-calculate-in-Spark/m-p/116207#M26193</guid>
      <dc:creator>jyadav</dc:creator>
      <dc:date>2016-04-26T16:34:25Z</dc:date>
    </item>
    <item>
      <title>Re: How split calculate in Spark ?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-split-calculate-in-Spark/m-p/116208#M26194</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/3812/vadivelsambandam.html" nodeid="3812"&gt;@vadivel sambandam&lt;/A&gt;&lt;P&gt;On ingest, Spark relies on HDFS settings to determine the splits based on block size which maps 1:1 to RDD partition. However, Spark then gives you fine grain control over the number of partitions at run time. Spark provides transformation like repartition, coalesce, and repartitionAndSortWithinPartition give you direct control over the number of partitions being computed. When these transformations are used correctly, they can greatly improve the efficiency of the Spark job. &lt;/P&gt;</description>
      <pubDate>Tue, 26 Apr 2016 19:55:49 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-split-calculate-in-Spark/m-p/116208#M26194</guid>
      <dc:creator>vvaks</dc:creator>
      <dc:date>2016-04-26T19:55:49Z</dc:date>
    </item>
    <item>
      <title>Re: How split calculate in Spark ?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-split-calculate-in-Spark/m-p/116209#M26195</link>
      <description>&lt;P&gt;The above 2 answers are very good. One caveat: keep in mind that when reading compressed file formats from disk, Spark partitioning depends on whether the format is &lt;EM&gt;splittable&lt;/EM&gt;. For instance, these formats are splittable: &lt;STRONG&gt;bzip2, snappy, LZO&lt;/STRONG&gt; (if indexed), while &lt;STRONG&gt;gzip&lt;/STRONG&gt; is not splittable. Here is documentation about why:&lt;/P&gt;&lt;P&gt;&lt;A href="http://comphadoop.weebly.com/"&gt;http://comphadoop.weebly.com/&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 27 Apr 2016 02:02:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-split-calculate-in-Spark/m-p/116209#M26195</guid>
      <dc:creator>phargis</dc:creator>
      <dc:date>2016-04-27T02:02:33Z</dc:date>
    </item>
  </channel>
</rss>

