11-04-2014 12:53 AM
When I use bulkload to load 2~3GB data into a table every 3 minutes, I see some regions suspend doing split/compaction(it may start split but doesn't finish after 1 hour). Eventually, there are some orphan HFiles on the HBase, and make some data loss. Does anyone know is it a bug if I doing bulkload too often? Thanks!
Hbase version: 0.94
Hbase region size: 18 GB
Bulkload frequency: load 3GB data to a single table every 3 minutes
11-06-2014 10:15 AM
Frequent bulk loads is not a good long-term use case for HBase...for the exact reasons you mentioned. The compaction queue and data maintenance overhead just never catches up.
Can you either A) just stream the writes into HBase constantly using puts, or B) hold off on the bulk loads until the HFiles are full 18GB (eg. one region) in size? At least reduce the compactions and splits?
11-12-2014 06:13 AM
11-12-2014 06:45 AM
I agree with Clint, Bulk Loading into HBase every 3 minutes is too often and will cause a ton of compactions. To remedy the splits you should have an overall understanding of what your data will look like 6 months - 1 year from now and pre-split the table upon creation. This should give you enough regions to load all of your data without having to split everytime. This is a best practice for puts as well. Also with regards to Bulk Loading early versions of CDH4 had some issues with sequence numbers and I would advise moving to CDH 5.1.3.