<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: How to know the degree of parallelism available in my Hadoop Cluster? in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-know-the-degree-of-parallelism-available-in-my-Hadoop/m-p/169880#M54008</link>
    <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/15981/leonardobrunoaraujo.html" nodeid="15981"&gt;@Leonardo Araujo&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Number of mappers is determined by the split size. Use --direct-split-size option to specify how much data one mapper will handle. Use split-by to specify on which column you want to do a split. Following is from Sqoop doc:&lt;/P&gt;&lt;P&gt;When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a &lt;EM&gt;splitting column&lt;/EM&gt; to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of &lt;CODE&gt;id&lt;/CODE&gt; whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form &lt;CODE&gt;SELECT * FROM sometable WHERE id &amp;gt;= lo AND id &amp;lt; hi&lt;/CODE&gt;, with &lt;CODE&gt;(lo, hi)&lt;/CODE&gt; set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.&lt;/P&gt;&lt;P&gt;If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the &lt;CODE&gt;--split-by&lt;/CODE&gt; argument. For example, &lt;CODE&gt;--split-by employee_id&lt;/CODE&gt;. Sqoop cannot currently split on multi-column &lt;/P&gt;</description>
    <pubDate>Fri, 10 Feb 2017 23:37:38 GMT</pubDate>
    <dc:creator>mqureshi</dc:creator>
    <dc:date>2017-02-10T23:37:38Z</dc:date>
    <item>
      <title>How to know the degree of parallelism available in my Hadoop Cluster?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-know-the-degree-of-parallelism-available-in-my-Hadoop/m-p/169879#M54007</link>
      <description>&lt;P&gt;How to know the degree of parallelism available in my Hadoop Cluster? I'd like to understand the proper/good value for the number of mappers in job sqoop (--num-mappers argument). HDP version: 2.2.0&lt;/P&gt;&lt;P&gt;&lt;A href="http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_controlling_parallelism" target="_blank"&gt;http://sqoop.apache.org/docs/1.4.6/SqoopUserGuide.html#_controlling_parallelism&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 10 Feb 2017 21:43:40 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-know-the-degree-of-parallelism-available-in-my-Hadoop/m-p/169879#M54007</guid>
      <dc:creator>Leo_BR</dc:creator>
      <dc:date>2017-02-10T21:43:40Z</dc:date>
    </item>
    <item>
      <title>Re: How to know the degree of parallelism available in my Hadoop Cluster?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-know-the-degree-of-parallelism-available-in-my-Hadoop/m-p/169880#M54008</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/15981/leonardobrunoaraujo.html" nodeid="15981"&gt;@Leonardo Araujo&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Number of mappers is determined by the split size. Use --direct-split-size option to specify how much data one mapper will handle. Use split-by to specify on which column you want to do a split. Following is from Sqoop doc:&lt;/P&gt;&lt;P&gt;When performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a &lt;EM&gt;splitting column&lt;/EM&gt; to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of &lt;CODE&gt;id&lt;/CODE&gt; whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form &lt;CODE&gt;SELECT * FROM sometable WHERE id &amp;gt;= lo AND id &amp;lt; hi&lt;/CODE&gt;, with &lt;CODE&gt;(lo, hi)&lt;/CODE&gt; set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.&lt;/P&gt;&lt;P&gt;If the actual values for the primary key are not uniformly distributed across its range, then this can result in unbalanced tasks. You should explicitly choose a different column with the &lt;CODE&gt;--split-by&lt;/CODE&gt; argument. For example, &lt;CODE&gt;--split-by employee_id&lt;/CODE&gt;. Sqoop cannot currently split on multi-column &lt;/P&gt;</description>
      <pubDate>Fri, 10 Feb 2017 23:37:38 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-know-the-degree-of-parallelism-available-in-my-Hadoop/m-p/169880#M54008</guid>
      <dc:creator>mqureshi</dc:creator>
      <dc:date>2017-02-10T23:37:38Z</dc:date>
    </item>
    <item>
      <title>Re: How to know the degree of parallelism available in my Hadoop Cluster?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-know-the-degree-of-parallelism-available-in-my-Hadoop/m-p/169881#M54009</link>
      <description>&lt;P style="margin-left: 20px;"&gt;Thank you &lt;A rel="user" href="https://community.cloudera.com/users/10969/mqureshi.html" nodeid="10969"&gt;@mqureshi&lt;/A&gt;, but I'd like to know if there is some best pratice to calculate a good/proper number for the argument --num-mappers, for instance: 6, 8, 10, 30, 40 and etc ?  How to know which of them is the more appropriate for my sqoop job. Thanks Leonardo&lt;/P&gt;</description>
      <pubDate>Sat, 11 Feb 2017 00:59:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-know-the-degree-of-parallelism-available-in-my-Hadoop/m-p/169881#M54009</guid>
      <dc:creator>Leo_BR</dc:creator>
      <dc:date>2017-02-11T00:59:33Z</dc:date>
    </item>
    <item>
      <title>Re: How to know the degree of parallelism available in my Hadoop Cluster?</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-know-the-degree-of-parallelism-available-in-my-Hadoop/m-p/169882#M54010</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/15981/leonardobrunoaraujo.html" nodeid="15981"&gt;@Leonardo Araujo&lt;/A&gt;&lt;P&gt;Check this link:&lt;/P&gt;&lt;P&gt;&lt;A href="https://wiki.apache.org/hadoop/HowManyMapsAndReduces"&gt;https://wiki.apache.org/hadoop/HowManyMapsAndReduces&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Target one map job per block. If the file you are reading has five blocks distributed on 3 nodes (or four or five nodes) on five disks, then you should have five mappers, one for each disk.&lt;/P&gt;</description>
      <pubDate>Sat, 11 Feb 2017 01:06:42 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/How-to-know-the-degree-of-parallelism-available-in-my-Hadoop/m-p/169882#M54010</guid>
      <dc:creator>mqureshi</dc:creator>
      <dc:date>2017-02-11T01:06:42Z</dc:date>
    </item>
  </channel>
</rss>

