<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: increasing textFile() partitioning number anomoly in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/increasing-textFile-partitioning-number-anomoly/m-p/105778#M29974</link>
    <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/122/lmartin.html" nodeid="122"&gt;@Lester Martin&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Look at this blog post which describe the internal working of textFile : &lt;A href="http://www.bigsynapse.com/spark-input-output" target="_blank"&gt;http://www.bigsynapse.com/spark-input-output&lt;/A&gt;&lt;/P&gt;&lt;P&gt;This PR discussion gives you the rational on why the default values are what they are : &lt;A href="https://github.com/mesos/spark/pull/718" target="_blank"&gt;https://github.com/mesos/spark/pull/718&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Hope this helps&lt;/P&gt;</description>
    <pubDate>Mon, 30 May 2016 00:20:38 GMT</pubDate>
    <dc:creator>ahadjidj</dc:creator>
    <dc:date>2016-05-30T00:20:38Z</dc:date>
    <item>
      <title>increasing textFile() partitioning number anomoly</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/increasing-textFile-partitioning-number-anomoly/m-p/105776#M29972</link>
      <description>&lt;P&gt;I've got a weird/wild one for sure and wondering if anyone has any insight.  Heck, I'm giving out "BONUS POINTS" for this one.  I'm dabbling with using sc.textFile()'s minPartition optional parameter to make my Hadoop file have more RDD partitions than the number of HDFS blocks. &lt;/P&gt;&lt;P&gt;When testing with a single-block HDFS file, all works fine when I get up to 8 partitions, but at 9 onward, it seems to add an extra number of partitions as shown below. &lt;/P&gt;&lt;PRE&gt;&amp;gt;&amp;gt;&amp;gt; rdd1 = sc.textFile("statePopulations.csv",8)
&amp;gt;&amp;gt;&amp;gt; rdd1.getNumPartitions()
8
&amp;gt;&amp;gt;&amp;gt; rdd1 = sc.textFile("statePopulations.csv",9)
&amp;gt;&amp;gt;&amp;gt; rdd1.getNumPartitions()
10
&amp;gt;&amp;gt;&amp;gt; rdd1 = sc.textFile("statePopulations.csv",10)
&amp;gt;&amp;gt;&amp;gt; rdd1.getNumPartitions()
11 &lt;/PRE&gt;&lt;P&gt;I was wondering if there was some magical implementation activity happening at 9 partitions (or 9x the number of blocks), but I didn't see a similar behavior on a 5-block file I have. &lt;/P&gt;&lt;PRE&gt;&amp;gt;&amp;gt;&amp;gt; rdd2 = sc.textFile("/proto/2000.csv")
&amp;gt;&amp;gt;&amp;gt; rdd2.getNumPartitions()
5
&amp;gt;&amp;gt;&amp;gt; rdd2 = sc.textFile("/proto/2000.csv",9)
&amp;gt;&amp;gt;&amp;gt; rdd2.getNumPartitions()
9
&amp;gt;&amp;gt;&amp;gt; rdd2 = sc.textFile("/proto/2000.csv",45)
&amp;gt;&amp;gt;&amp;gt; rdd2.getNumPartitions()
45 &lt;/PRE&gt;&lt;P&gt;Really not a pressing concern, but sure has made me ask WTH? (What The Hadoop?)  Anyone know what's going on?&lt;/P&gt;</description>
      <pubDate>Sat, 28 May 2016 05:13:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/increasing-textFile-partitioning-number-anomoly/m-p/105776#M29972</guid>
      <dc:creator>LesterMartin</dc:creator>
      <dc:date>2016-05-28T05:13:10Z</dc:date>
    </item>
    <item>
      <title>Re: increasing textFile() partitioning number anomoly</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/increasing-textFile-partitioning-number-anomoly/m-p/105777#M29973</link>
      <description>&lt;P&gt;Under the cover Spark uses a Hadoop TextInputformat to read the file. The minPartitions number is given as an input to the FileInputFormat getSplits method.&lt;/P&gt;&lt;P&gt;&lt;A href="http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/org/apache/hadoop/mapred/FileInputFormat.java#FileInputFormat.getSplits%28org.apache.hadoop.mapred.JobConf%2Cint%29" target="_blank"&gt;http://grepcode.com/file/repo1.maven.org/maven2/org.jvnet.hudson.hadoop/hadoop-core/0.19.1-hudson-2/org/apache/hadoop/mapred/FileInputFormat.java#FileInputFormat.getSplits%28org.apache.hadoop.mapred.JobConf%2Cint%29&lt;/A&gt;&lt;/P&gt;&lt;P&gt;This function is pretty complex and uses a goalSize, blockSize and minSize to split up the file into splits. goalsize being totalsize/numbersplits.&lt;/P&gt;&lt;P&gt;Looking at it it normally should honour your request but you might be running into a scenario where you have a very small file and run into some rounding issues. You could try running the code with your blocksize to see if that is the case..&lt;/P&gt;&lt;P&gt;It should not matter though since Hadoop will make sure that each record is processed exactly once. ( By ignoring the first unfinished record of any block and overreading the split to finalize the last record. ), Y&lt;/P&gt;</description>
      <pubDate>Sun, 29 May 2016 23:50:06 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/increasing-textFile-partitioning-number-anomoly/m-p/105777#M29973</guid>
      <dc:creator>bleonhardi</dc:creator>
      <dc:date>2016-05-29T23:50:06Z</dc:date>
    </item>
    <item>
      <title>Re: increasing textFile() partitioning number anomoly</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/increasing-textFile-partitioning-number-anomoly/m-p/105778#M29974</link>
      <description>&lt;P&gt;Hi &lt;A rel="user" href="https://community.cloudera.com/users/122/lmartin.html" nodeid="122"&gt;@Lester Martin&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Look at this blog post which describe the internal working of textFile : &lt;A href="http://www.bigsynapse.com/spark-input-output" target="_blank"&gt;http://www.bigsynapse.com/spark-input-output&lt;/A&gt;&lt;/P&gt;&lt;P&gt;This PR discussion gives you the rational on why the default values are what they are : &lt;A href="https://github.com/mesos/spark/pull/718" target="_blank"&gt;https://github.com/mesos/spark/pull/718&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Hope this helps&lt;/P&gt;</description>
      <pubDate>Mon, 30 May 2016 00:20:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/increasing-textFile-partitioning-number-anomoly/m-p/105778#M29974</guid>
      <dc:creator>ahadjidj</dc:creator>
      <dc:date>2016-05-30T00:20:38Z</dc:date>
    </item>
  </channel>
</rss>

