Reply
Hef
Explorer
Posts: 8
Registered: ‎09-13-2015

Same HBase scan returns inconsistent results on multiple runs for same dataset

I'm encountering a strange behavior on MapReduce when using HBase as input format. I run my MR tasks on a same table, same dataset, with a same pattern of Fuzzy Row Filter, multiple times. The Input Records counters shown are not consistent, the smallest number can be 40% less than the largest one.
 
More specifically, 
- CDH version 5.9 
- the table is split into 18 regions, distributed on 3 region server. The TTL is set to 10 days for the record, though the dataset for MR only includes those inserted in 7days.
 
- The row key is defined as:
salt(1byte) + time_of_hour(4bytes) + uuid(36bytes)
 
 
- The scan is created as below:
Scan scan = new Scan();
scan.setBatch(100);
scan.setCaching(10000);
scan.setCacheBlocks(false);
scan.setMaxVersions(1);
 
And the row filter for the scan is a FuzzyRowFilter that filters only events of a given time_of_hour.
 
Everything looks fine while the result is out of expect. 
A same task runs 10 times, the Input Records counters  show 6 different numbers, and the final output shows 6 different results.
 
I also noticed there's once an issue of FuzzyRowFilter: https://issues.apache.org/jira/browse/HBASE-14269, while I think this has been fixed in current version of HBase I'm using.
 
Does anyone has this problem before?
What could be the cause of this inconsistency of HBase scan result? And how can I fix this?
 
 
 
 
Thanks