<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Hbase bulk load help, the last reducer is taking forever to finish... in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-bulk-load-help-the-last-reducer-is-taking-forever-to/m-p/25288#M5181</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am upgrading our cluster from CDH3 to 4. As part of this project I created a parallel cluster thats now running CDH4, and now I am importing the Hbase data that I exported and copied on to the new cluster.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am using the bulk load tool to import the data into the tables. Here is how its been done -&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;1. Exported Hbase tables on CDH3&lt;/P&gt;&lt;P&gt;2. Did distcp to the new cluster&lt;/P&gt;&lt;P&gt;3. Created tables with pre-split regions&lt;/P&gt;&lt;P&gt;4. Importing data using the bulk load tool. Here is the command thats being used -&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;hbase org.apache.hadoop.hbase.mapreduce.Import -Dimport.bulk.output=/backup/TABLE_NAME&amp;nbsp;TABLE_NAME /import/TABLE_NAME&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The mapping phase of this process goes pretty fast, but reducer takes forever to finish. I did pre-splitting of the regions to increase the number of reducers, but the load still spends a lot of time on the last reducer.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there anyway that I can improve the speed by letting all the reducers finish close to the sametime.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;To give the context, a 1.3 TB table has spent 45 min to finish Map phase, and another 1:15 to finish all but one reducer. Now the last reducer still running after nearly 4 hours and only 33% completed. I have more tables to import and they are much larger. Any help would be greatly appreciated.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Please let me know if you need more information.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you all in advance,&lt;/P&gt;&lt;P&gt;Venkat&lt;/P&gt;</description>
    <pubDate>Fri, 16 Sep 2022 09:23:12 GMT</pubDate>
    <dc:creator>Spinhoo</dc:creator>
    <dc:date>2022-09-16T09:23:12Z</dc:date>
    <item>
      <title>Hbase bulk load help, the last reducer is taking forever to finish...</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-bulk-load-help-the-last-reducer-is-taking-forever-to/m-p/25288#M5181</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am upgrading our cluster from CDH3 to 4. As part of this project I created a parallel cluster thats now running CDH4, and now I am importing the Hbase data that I exported and copied on to the new cluster.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I am using the bulk load tool to import the data into the tables. Here is how its been done -&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;1. Exported Hbase tables on CDH3&lt;/P&gt;&lt;P&gt;2. Did distcp to the new cluster&lt;/P&gt;&lt;P&gt;3. Created tables with pre-split regions&lt;/P&gt;&lt;P&gt;4. Importing data using the bulk load tool. Here is the command thats being used -&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;hbase org.apache.hadoop.hbase.mapreduce.Import -Dimport.bulk.output=/backup/TABLE_NAME&amp;nbsp;TABLE_NAME /import/TABLE_NAME&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;The mapping phase of this process goes pretty fast, but reducer takes forever to finish. I did pre-splitting of the regions to increase the number of reducers, but the load still spends a lot of time on the last reducer.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Is there anyway that I can improve the speed by letting all the reducers finish close to the sametime.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;To give the context, a 1.3 TB table has spent 45 min to finish Map phase, and another 1:15 to finish all but one reducer. Now the last reducer still running after nearly 4 hours and only 33% completed. I have more tables to import and they are much larger. Any help would be greatly appreciated.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Please let me know if you need more information.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thank you all in advance,&lt;/P&gt;&lt;P&gt;Venkat&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 09:23:12 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-bulk-load-help-the-last-reducer-is-taking-forever-to/m-p/25288#M5181</guid>
      <dc:creator>Spinhoo</dc:creator>
      <dc:date>2022-09-16T09:23:12Z</dc:date>
    </item>
    <item>
      <title>Re: Hbase bulk load help, the last reducer is taking forever to finish...</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-bulk-load-help-the-last-reducer-is-taking-forever-to/m-p/25290#M5182</link>
      <description>You could try taking a jstack of the reducer 4-5 times a minute apart each,&lt;BR /&gt;see if it is hung or just busy.&lt;BR /&gt;&lt;BR /&gt;Moreover, you'll need the following option to import from CDH3 to&lt;BR /&gt;CDH5: -Dhbase.import.version=0.94.&lt;BR /&gt;Could you try again and let us know?&lt;BR /&gt;&lt;BR /&gt;# sudo -u hdfs hbase -Dhbase.import.version=0.94&lt;BR /&gt;org.apache.hadoop.hbase.mapreduce.Import t1 /import&lt;BR /&gt;&lt;BR /&gt;</description>
      <pubDate>Wed, 04 Mar 2015 22:10:34 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-bulk-load-help-the-last-reducer-is-taking-forever-to/m-p/25290#M5182</guid>
      <dc:creator>GautamG</dc:creator>
      <dc:date>2015-03-04T22:10:34Z</dc:date>
    </item>
    <item>
      <title>Re: Hbase bulk load help, the last reducer is taking forever to finish...</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-bulk-load-help-the-last-reducer-is-taking-forever-to/m-p/25292#M5183</link>
      <description>&lt;P&gt;It just moved from COPY to SORT phase. So its not hung, but terribly busy.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I will try to do the solution you mentioned during the next import. I just hope each reducer does its own copy/sort/reduce for it's region (which they are doing partially) instead of one big long one at the end...&lt;/P&gt;</description>
      <pubDate>Wed, 04 Mar 2015 22:39:19 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-bulk-load-help-the-last-reducer-is-taking-forever-to/m-p/25292#M5183</guid>
      <dc:creator>Spinhoo</dc:creator>
      <dc:date>2015-03-04T22:39:19Z</dc:date>
    </item>
    <item>
      <title>Re: Hbase bulk load help, the last reducer is taking forever to finish...</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-bulk-load-help-the-last-reducer-is-taking-forever-to/m-p/25322#M5184</link>
      <description>&lt;P&gt;I figured why the last reducer is taking so long - User error (its me!)...&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;When I presplit the table based on target regions, I missed to include all the keys. This resulted in a table with last key being responsible for 80 times more data than other regions. This is what caused that reducer to spend so much amount of time.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;If he table is split evenly all reducers seem to be finishing close to each other.&lt;/P&gt;</description>
      <pubDate>Thu, 05 Mar 2015 19:14:17 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Hbase-bulk-load-help-the-last-reducer-is-taking-forever-to/m-p/25322#M5184</guid>
      <dc:creator>Spinhoo</dc:creator>
      <dc:date>2015-03-05T19:14:17Z</dc:date>
    </item>
  </channel>
</rss>

