Support Questions

Find answers, ask questions, and share your expertise
Announcements
Celebrating as our community reaches 100,000 members! Thank you!

When is HBase CF TTL too short?

avatar
Rising Star

Would a short CF TTL of 30 minutes on 2 to 36 million row table be detrimental to performance?

This HBase table would be queried or inserted from Storm at a goal rate of 20k entries per second at peak performance and would like the current entries to expire after 30 minutes.

1 ACCEPTED SOLUTION

avatar
Guru

20K entries per sec is pretty achievable. The TTL feature does not insert DELETE markers. The cells that have expired according to the TTL are filtered from the returned result set. Eventually a compaction will run and those expired entries will not be seen by the compaction runner and thus they won't be written to the new files from compaction. There is also another mechanism, that before compaction, if we can make sure that all the cells in an HFile are expired (by looking at the min and max timestamps of the hfiles), the compaction safely deletes the whole file without even going through compaction.

A TTL of 30 mins is very short, so you can also look into strategies for not running ANY compaction at all (since depending on the write rate, there may not be a lot of hfiles even with disabled compactions). The recently introduced FIFO compaction policy (https://issues.apache.org/jira/browse/HBASE-14468) coming in the next version of HDP-2.3 seems like a great fit.

View solution in original post

4 REPLIES 4

avatar
Master Mentor

@Enis @Devaraj Das I'd be curious to know the answer, my guess would be no, how many records inserted a day?

avatar

For the sake of discussion, lets say

  • the system is running at peak load 24 hours..
  • Out of 20K, there are 10K reads and 10K inserts per second

So, after the first 30 mins of running, system will add additional 10K deletes per second so a total of 30K hits. Its definitely not that straight forward and HBase is going to batch the actual deletes somehow internally.

30K tps is not a lot for HBase but the question is how big of a cluster are we talking about?

Also, other thing will be the memory available to the RegionServer... it makes sense to keep as much data in memory as possible so the I/O is minimal, as the data is to be deleted after 30 mins anyways. So, the next set of questions is - whats the memory available on the box and to region server? How big is each message?

avatar
Guru

20K entries per sec is pretty achievable. The TTL feature does not insert DELETE markers. The cells that have expired according to the TTL are filtered from the returned result set. Eventually a compaction will run and those expired entries will not be seen by the compaction runner and thus they won't be written to the new files from compaction. There is also another mechanism, that before compaction, if we can make sure that all the cells in an HFile are expired (by looking at the min and max timestamps of the hfiles), the compaction safely deletes the whole file without even going through compaction.

A TTL of 30 mins is very short, so you can also look into strategies for not running ANY compaction at all (since depending on the write rate, there may not be a lot of hfiles even with disabled compactions). The recently introduced FIFO compaction policy (https://issues.apache.org/jira/browse/HBASE-14468) coming in the next version of HDP-2.3 seems like a great fit.

avatar
Rising Star

Thanks Enis!