Created 12-30-2015 06:18 PM
Would a short CF TTL of 30 minutes on 2 to 36 million row table be detrimental to performance?
This HBase table would be queried or inserted from Storm at a goal rate of 20k entries per second at peak performance and would like the current entries to expire after 30 minutes.
Created 01-04-2016 08:51 AM
20K entries per sec is pretty achievable. The TTL feature does not insert DELETE markers. The cells that have expired according to the TTL are filtered from the returned result set. Eventually a compaction will run and those expired entries will not be seen by the compaction runner and thus they won't be written to the new files from compaction. There is also another mechanism, that before compaction, if we can make sure that all the cells in an HFile are expired (by looking at the min and max timestamps of the hfiles), the compaction safely deletes the whole file without even going through compaction.
A TTL of 30 mins is very short, so you can also look into strategies for not running ANY compaction at all (since depending on the write rate, there may not be a lot of hfiles even with disabled compactions). The recently introduced FIFO compaction policy (https://issues.apache.org/jira/browse/HBASE-14468) coming in the next version of HDP-2.3 seems like a great fit.
Created 12-31-2015 02:37 AM
@Enis @Devaraj Das I'd be curious to know the answer, my guess would be no, how many records inserted a day?
Created 12-31-2015 03:12 AM
For the sake of discussion, lets say
So, after the first 30 mins of running, system will add additional 10K deletes per second so a total of 30K hits. Its definitely not that straight forward and HBase is going to batch the actual deletes somehow internally.
30K tps is not a lot for HBase but the question is how big of a cluster are we talking about?
Also, other thing will be the memory available to the RegionServer... it makes sense to keep as much data in memory as possible so the I/O is minimal, as the data is to be deleted after 30 mins anyways. So, the next set of questions is - whats the memory available on the box and to region server? How big is each message?
Created 01-04-2016 08:51 AM
20K entries per sec is pretty achievable. The TTL feature does not insert DELETE markers. The cells that have expired according to the TTL are filtered from the returned result set. Eventually a compaction will run and those expired entries will not be seen by the compaction runner and thus they won't be written to the new files from compaction. There is also another mechanism, that before compaction, if we can make sure that all the cells in an HFile are expired (by looking at the min and max timestamps of the hfiles), the compaction safely deletes the whole file without even going through compaction.
A TTL of 30 mins is very short, so you can also look into strategies for not running ANY compaction at all (since depending on the write rate, there may not be a lot of hfiles even with disabled compactions). The recently introduced FIFO compaction policy (https://issues.apache.org/jira/browse/HBASE-14468) coming in the next version of HDP-2.3 seems like a great fit.
Created 01-04-2016 05:33 PM
Thanks Enis!