Support Questions

Find answers, ask questions, and share your expertise

Hbase bulk load help, the last reducer is taking forever to finish...

avatar
Explorer

Hi,

 

I am upgrading our cluster from CDH3 to 4. As part of this project I created a parallel cluster thats now running CDH4, and now I am importing the Hbase data that I exported and copied on to the new cluster.

 

I am using the bulk load tool to import the data into the tables. Here is how its been done -

 

1. Exported Hbase tables on CDH3

2. Did distcp to the new cluster

3. Created tables with pre-split regions

4. Importing data using the bulk load tool. Here is the command thats being used -

 

hbase org.apache.hadoop.hbase.mapreduce.Import -Dimport.bulk.output=/backup/TABLE_NAME TABLE_NAME /import/TABLE_NAME

 

The mapping phase of this process goes pretty fast, but reducer takes forever to finish. I did pre-splitting of the regions to increase the number of reducers, but the load still spends a lot of time on the last reducer.

 

Is there anyway that I can improve the speed by letting all the reducers finish close to the sametime.

 

To give the context, a 1.3 TB table has spent 45 min to finish Map phase, and another 1:15 to finish all but one reducer. Now the last reducer still running after nearly 4 hours and only 33% completed. I have more tables to import and they are much larger. Any help would be greatly appreciated.

 

Please let me know if you need more information.

 

Thank you all in advance,

Venkat

1 ACCEPTED SOLUTION

avatar
Explorer

I figured why the last reducer is taking so long - User error (its me!)...

 

When I presplit the table based on target regions, I missed to include all the keys. This resulted in a table with last key being responsible for 80 times more data than other regions. This is what caused that reducer to spend so much amount of time.

 

If he table is split evenly all reducers seem to be finishing close to each other.

View solution in original post

3 REPLIES 3

avatar
You could try taking a jstack of the reducer 4-5 times a minute apart each,
see if it is hung or just busy.

Moreover, you'll need the following option to import from CDH3 to
CDH5: -Dhbase.import.version=0.94.
Could you try again and let us know?

# sudo -u hdfs hbase -Dhbase.import.version=0.94
org.apache.hadoop.hbase.mapreduce.Import t1 /import

Regards,
Gautam Gopalakrishnan

avatar
Explorer

It just moved from COPY to SORT phase. So its not hung, but terribly busy.

 

I will try to do the solution you mentioned during the next import. I just hope each reducer does its own copy/sort/reduce for it's region (which they are doing partially) instead of one big long one at the end...

avatar
Explorer

I figured why the last reducer is taking so long - User error (its me!)...

 

When I presplit the table based on target regions, I missed to include all the keys. This resulted in a table with last key being responsible for 80 times more data than other regions. This is what caused that reducer to spend so much amount of time.

 

If he table is split evenly all reducers seem to be finishing close to each other.