Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HBase Scan slow after inserting million reords in table

avatar
Expert Contributor

Hi,

I am on HDP 2.3.4 ( 3 node cluster) , My HBase scans are slow after inserting a million row data

As I am new bee to HBase, Any suggestions experts can provide me to tune performance.

Would really appreciate the help.

Thanks,

Divya

1 ACCEPTED SOLUTION

avatar
Expert Contributor

@Divya Gehlot Are you specifying start and stop key in scans? Open ended scan which doesn't specify start and stop key usually ends up with complete table scan and hence becomes slow. As @Randy Gelhausen mentioned optimal rowkey design will help you in specifying start and stop key.

View solution in original post

5 REPLIES 5

avatar
Master Guru

@Divya Gehlot

Couple suggestions

  • HBase is not performant for scans as it is a db for random reads/writes.
  • If scans are to be performs do it on the key and not the columns.

avatar
Master Guru

Hi @Divya Gehlot, go to HBase -> Quick Links -> HBase Master UI, then select Table details on the top, locate and click on your table. It will show you the table regions, their server layout, and number of requests per region. You can then consider to split too busy regions, and move some regions to another nodes for a better load balancing. Refer to this for split/move, and to this for a good backgrounder. Since you have only 3 nodes the results might be limited. Regarding other properties, if you can afford, be sure to have enough RAM for Region servers, not less than 16G.

avatar

@Divya Gehlot- as @Sunile Manjee noted, HBase is an indexed lookup system which can also perform scans. This makes you think a bit about your data access/query patterns before you can create an optimal table design.

In general, you want to design your rowkeys around your access patterns. Ensure your highest order rowkey bits can always be known to your application at HBase read-time, else your access will be a full-scan instead of a range scan.

Users of the raw HBase API often find themselves performing logic in their application code instead of server-side within HBase's RegionServer processes. A simple, but powerful way to avoid both writing large amounts of client application code and pulling significant chunks of data back, consider using Apache Phoenix on top of HBase. It makes it easy to perform a more selective HBase query via SQL query language, which also:

1. Lends itself more naturally to thinking about how data is laid out in your tables

2. Lets you define secondary indices on the data your queries access regardless of whether your application knows a specific rowkey (or range) it needs to access.

avatar
Expert Contributor

@Divya Gehlot Are you specifying start and stop key in scans? Open ended scan which doesn't specify start and stop key usually ends up with complete table scan and hence becomes slow. As @Randy Gelhausen mentioned optimal rowkey design will help you in specifying start and stop key.

avatar
Contributor

if you are using hbase shell for scanning, you can try:

> scan '<table>', CACHE => 1000

this CACHE will tell hbase RS to cache some certain number of rows before return, which can save lots of RPC calls.