Created 04-14-2017 03:01 PM
(Hbase is installed on CentOS machine.
I am fetching HBase data from my windows 10 computer.)
I am running some performance tests on HBase Java client / Thrift / REST interface.
I have a table called “Airline” which has 500K rows.
I am fetching all 500K rows from the table through 4 different Java programs. (using JAVA Client, Thrift, Thrift2 and REST)
Following are the performance numbers with various fetch sizes.
For all these the batch size is set to 100000
. | Fetch Size (Number of Rows) | ||||||
. | 1000 | 2000 | 5000 | 7500 | 10000 | 15000 | 20000 |
REST | 135923 | 67520 | 31293 | 22417 | 18210 | 14281 | 12348 |
Thrift | 135912 | 78630 | 38525 | 32470 | 27617 | 25223 | 27127 |
Thrift2 | 133807 | 74559 | 39691 | 32457 | 28241 | 27189 | 25426 |
Java API | 45086 | 43945 | 44591 | 45393 | 44936 | 45849 | 45060 |
I could see that, there is a performance improvement as we increase the fetch size in case of REST, Thrift, and Thrift2.
But with Java API, I am seeing consistent performance, irrespective of fetch size.
Why fetch size is not impacting in JAVA Client?
Here is snippet of my Java Program
---------------------------------------
Table table = conn.getTable(TableName.valueOf("Airline"));
Scan scan = new Scan();
ResultScanner scanner = table.getScanner(scan);
for (Result[] result = scanner.next(fetchSize); result.length != 0; result = scanner.next(fetchSize))
{
-- process the rows
}
--------------------------------------------------------
Can someone help me in this. Am I using wrong methods/classes for data fetching through JAVA client.
Created 04-15-2017 01:34 AM
Two follow up questions..
1. How to enable caching on Java Client ?
I tried doing scan.setCaching(integer Max); scan.cacheBlocks(true); But I did not see any difference in performance for subsequent runs.
2. I shutdown everything and tried REST with 20000, but still I could see that it is better than Java Client.
3. Why fetch size is not taking effect in Java Client ? Am I doing anything wrong in my program ?
Created 04-16-2017 01:37 AM
Use Scan.setBatch(int) to control the number of records fetched per RPC with the Java API. The API call you are making only wraps calls to ResutlScanner.next(). It does not affect the underlying RPCs. You may also have to increase hbase.client.scanner.max.results.size as this caps the numbers of records return in a single RPC (default 2MB).
The Thrift and REST servers do NOT cache results. Please disregard the comment which asserts this.
Created 04-17-2017 04:43 PM
Thanks for your reply Josh Elser.
scan.setMaxResultSize() is set to 10 MB
I tried setting Scan.setBatch() with different values, but I did not see any variation in the performance. For any batch size, performance is consistent. I did not see any improvement on higher Batch size..
Created 04-18-2017 05:56 PM
Thanks for your reply Josh Elser.
scan.setMaxResultSize() is set to 10 MB
I tried setting Scan.setBatch() with different values, I could see that there is improvement compared to earlier, but I did not see any variation in the performance for different fetch sizes.
After setting scan.setMaxResultSize() is to 10 MB+
the new perf numbers are as below
. | Fetch Size (Number of Rows) | ||||||
. | 1000 | 2000 | 5000 | 7500 | 10000 | 15000 | 20000 |
Java API time | 17692 | 17158 | 21524 | 21289 | 18802 | 18786 | 18786 |
For any batch size, performance is almost consistent. Where as with REST I could see the improvement on higher fetch size.
Till batch Size 10000 - Java Client looks good. Above 10000 batch size REST looks better .. why ?
What other parameters might be impacting this .