I am on HDP 2.3.4 ( 3 node cluster) , My HBase scans are slow after inserting a million row data
As I am new bee to HBase, Any suggestions experts can provide me to tune performance.
Would really appreciate the help.
Hi @Divya Gehlot, go to HBase -> Quick Links -> HBase Master UI, then select Table details on the top, locate and click on your table. It will show you the table regions, their server layout, and number of requests per region. You can then consider to split too busy regions, and move some regions to another nodes for a better load balancing. Refer to this for split/move, and to this for a good backgrounder. Since you have only 3 nodes the results might be limited. Regarding other properties, if you can afford, be sure to have enough RAM for Region servers, not less than 16G.
@Divya Gehlot- as @Sunile Manjee noted, HBase is an indexed lookup system which can also perform scans. This makes you think a bit about your data access/query patterns before you can create an optimal table design.
In general, you want to design your rowkeys around your access patterns. Ensure your highest order rowkey bits can always be known to your application at HBase read-time, else your access will be a full-scan instead of a range scan.
Users of the raw HBase API often find themselves performing logic in their application code instead of server-side within HBase's RegionServer processes. A simple, but powerful way to avoid both writing large amounts of client application code and pulling significant chunks of data back, consider using Apache Phoenix on top of HBase. It makes it easy to perform a more selective HBase query via SQL query language, which also:
1. Lends itself more naturally to thinking about how data is laid out in your tables
2. Lets you define secondary indices on the data your queries access regardless of whether your application knows a specific rowkey (or range) it needs to access.
if you are using hbase shell for scanning, you can try:
> scan '<table>', CACHE => 1000
this CACHE will tell hbase RS to cache some certain number of rows before return, which can save lots of RPC calls.