Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

HBase Java API scan is too slow

avatar
Explorer

Hi all,

I am using API for java to fetch data from a hbase table that contains 40 millions rows. I use the PrefixFilter on a Scanner to retrieve data. My java application is on a windows PC while my hbase is on hortonworks cluster. I have 4 region servers.

When i use ROWPREFIXFILTER in HBase Shell i retrieve data within 2 seconds. But with my java client application it takes around 10 mn to find the same result. Can someone explain me why this difference?

1 ACCEPTED SOLUTION

avatar
Master Collaborator

One aspect to note w.r.t. using PrefixFilter is that the start row is not automatically set.

You need to pass the correct start row along with PrefixFilter. Otherwise the number of rows scanned may be quite high.

View solution in original post

4 REPLIES 4

avatar
Super Guru
@Samie WALA

I am assuming your PC is remote to the cluster? You are working from home and cluster is your work cluster. You are connected using a VPN which is using your home internet connection. When you login to shell, that shell is running on the same machine as HBase. Is that right? As opposed to shell, your Java application is running on your home PC?

When you run your query in shell, it doesn't have to stream result over the network. The result stays right there and displayed right away. Shell is actually very highly optimized and doesn't have any overhead. It doesn't need much. Shell tends to be the fastest.

Your application running on your PC has to go over the network to make a request, which seems like pretty slow in this case. You didn't mention how big is the result that is being streamed over the network to your PC. If it's big then network issues might become more pronounced. You have not shared your code, but there could be some room for optimization there too.

One way to check your code if possible is to run your code on an edge node or some machine on the same network and see the difference.

avatar
Explorer

@mqureshi

Thank you for this answer.

I'm using a VPN to access the cluster. But the query returns only 312 rows. I executed the same application this morning while my PC was in our LAN and got the same duration. So i think there should be something else i'm missing.

avatar
Master Collaborator

One aspect to note w.r.t. using PrefixFilter is that the start row is not automatically set.

You need to pass the correct start row along with PrefixFilter. Otherwise the number of rows scanned may be quite high.

avatar
Explorer

@Ted Yu

Hi Ted. Thank you for your reply. When i provided the StartRow with the suitable information, i can retrieve data in less than 2 seconds. Thank you very much.