Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

Region stop doing split/compaction after bulkload data to HBase every 3 minutes

avatar
New Contributor

Hi all,

When I use bulkload to load 2~3GB data into a table every 3 minutes, I see some regions suspend doing split/compaction(it may start split but doesn't finish after 1 hour). Eventually, there are some orphan HFiles on the HBase, and make some data loss. Does anyone know is it a bug if I doing bulkload too often? Thanks!

Platform: CDH4
Hbase version: 0.94
Hbase region size: 18 GB
Bulkload frequency: load 3GB data to a single table every 3 minutes

1 ACCEPTED SOLUTION

avatar
Guru

Frequent bulk loads is not a good long-term use case for HBase...for the exact reasons you mentioned.  The compaction queue and data maintenance overhead just never catches up.

 

Can you either A) just stream the writes into HBase constantly using puts, or B) hold off on the bulk loads until the HFiles are full 18GB (eg. one region) in size?  At least reduce the compactions and splits?

View solution in original post

3 REPLIES 3

avatar
Guru

Frequent bulk loads is not a good long-term use case for HBase...for the exact reasons you mentioned.  The compaction queue and data maintenance overhead just never catches up.

 

Can you either A) just stream the writes into HBase constantly using puts, or B) hold off on the bulk loads until the HFiles are full 18GB (eg. one region) in size?  At least reduce the compactions and splits?

avatar
New Contributor
Thanks for your reply. I have revised the solution and write into HBase using puts. It works without problem mentioned before.

avatar
Contributor

I agree with Clint, Bulk Loading into HBase every 3 minutes is too often and will cause a ton of compactions.  To remedy the splits you should have an overall understanding of what your data will look like 6 months - 1 year from now and pre-split the table upon creation.  This should give you enough regions to load all of your data without having to split everytime.  This is a best practice for puts as well.  Also with regards to Bulk Loading early versions of CDH4 had some issues with sequence numbers and I would advise moving to CDH 5.1.3.