Support Questions
Find answers, ask questions, and share your expertise

Improve performance in loading Hbase table using Pig

Expert Contributor

I am trying to load a HBase table using Pig from HDFS file. The file is just 3 GB with 30350496 records.

It takes a long time to load the table. Pig is running in tez. Can you please suggest me any ways to improve the performance?

How to identify where the performance bottle neck is? I am not able to get much from the pig explain.

Any ways to identify if single Hbase region server is overloaded or if it is getting distributed properly.

How to identify Hbase regionserver splits ?

1 ACCEPTED SOLUTION

Mentor

Youbare experiencing hotspotting, thats when all regions reside on one RS. Use move command in CLI to move regions around https://learnhbase.wordpress.com/2013/03/02/hbase-shell-commands/

Also consider another rowkey as that's what impacts all rows going to same RS.

View solution in original post

9 REPLIES 9

Use the HBase web UI to determine if one region or regionserver if being overloaded with requests compared to another while running your pig job. You can also examine the number of regions (and their splits) via this UI.

You can reach this page via the Ambari HBase service page's "Quick Links" menu.

Expert Contributor

Hi Josh,

I saw the table split as you suggested. I see that the table has 18 regions. But the problem is all 18 regions of the table is in same node. (or region server). How do I split the regions across multiple region servers?

And also is there a Command to check the table splits and Hbase regions via CLI?

Any parameter I can use to improve the load performance using pig? currently I am using only hbase.scanner.caching to reduce roundtrips.

Thanks for your help!

Mentor

Youbare experiencing hotspotting, thats when all regions reside on one RS. Use move command in CLI to move regions around https://learnhbase.wordpress.com/2013/03/02/hbase-shell-commands/

Also consider another rowkey as that's what impacts all rows going to same RS.

Expert Contributor

Hi Artem, Thanks for your reply. I will try hashing my row-key. But even then I think I might need to pre-split the Hbase table so that there are many regions even in the beginning of the load. (I was just reading about creating the pre splits).

But even then Hbase might end up creating all the regions in same server right? Is there any option to ensure that all the regions of same table are distributed across multiple servers?

Mentor

The only way is to make sure you have good rowkey. Pre-splitting will help on initial load but good design will be essential in the long run. Play around with keys until you get it right, Java's UUID also similar to Pig's uniqueID will ensure good key distribution but you will lose ability to do lookups. It all depends on the type of access patterns your application requires.

Mentor

Another option is to try Phoenix, in Phoenix you can create a table and ensure a good key distribution. There is also Pig integration https://phoenix.apache.org/pig_integration.html

Bottom line, Phoenix makes working with HBase easy, still uses all HBase APIs underneath and has a familiar SQL syntax.

Expert Contributor

And one more question, I am actually loading the Hbase table through pig, where I am generating the sequence number before loading this table using Rank. Even though I pre split the table, when I looked at the UI, it showed that only three regions of the Hbase table was getting loaded initially for first few minutes. Why would this happen? When we pre-split the table, shouldn't all the regions of the Hbase table get its data from previous operator simultaneously and get loaded in parallel? Why is it that only three of the 10 regions were loading? Could it be because it was streaming the data as and when the previous operator generator row number and passed it down to Storer? Any suggestions would be helpful. Thanks!

Expert Contributor

Actually the main purpose of the table is to be a lookup table. Major problem is that the lookup with this table is based on the Sequence number so I chose sequence number as the row key and that is the reason I think this caused hot spots. May I know how the phoenix ensures a good distribution? Because I thought phoenix was just a SQL layer on top of Hbase to enable query on Hbase tables.

Mentor
; ;