Created 06-21-2017 05:53 PM
The docs say:
When an explicit deletion occurs in HBase, the data is not actually deleted. Instead, a tombstone marker is written. The tombstone marker prevents the data from being returned with queries. During a major compaction, the data is actually deleted, and the tombstone marker is removed from the StoreFile. If the deletion happens because of an expired TTL, no tombstone is created. Instead, the expired data is filtered out and is not written back to the compacted StoreFile.
What does "expired data is filtered out and is not written back to the compacted StoreFile." mean?
I have done some testing with TTL. I put 1 million records in a database. I checked the file size. Then I set the ttl to 1 minute and all the data dissapeared (the actual file got much smaller). Our database has major compaction shut off.
Does a major compaction have to happen with TTL?
Created 06-21-2017 06:45 PM
Mark, were you at the Hbase Phoenix birds of a feather at the San Jose summit last week? If so, i was sitting 3 seats away. This question or something very similar was asked there.
Created 06-21-2017 07:19 PM
Yup. That's me! We are still trying to figure this out. We have gotten 4 different answers from 4 different people.
Hope things are going well!
Created 06-22-2017 05:04 AM
"What does "expired data is filtered out and is not written back to the compacted StoreFile." mean?"
Filtered data (by TTL) would be removed on compaction. That is what this statement means.
As to your confusion from your test, remember that there is a difference between a "minor compaction" and a "major compaction". A "major compaction" is a re-writing of all files in a region, whereas a "minor compaction" is (possibly) a subset of the files in a Region. Because (I'm guessing) you actually mean that you've disabled schedule major compactions doesn't mean that compactions will never run in your system (this is actually a really idea if you've somehow done this, by the way).
A minor compaction can remove data masked by a TTL -- this is the simple case. However, tombstones can *only* be removed when a major compaction runs. This is because the tombstone may be masking records that exist in a file which was not included in the compaction process.
Long-story short: if you want to verify why your experiment works as it does, just grep out the region identifier for your table from the RegionServer log. You should see a message INFO (if not INFO, certainly at DEBUG) which informs you that a compaction occurred on that Region, the number of input files and size and the size of the output file. Before that compaction message, the file on disk would be the full size; after, it would be the reduced size.
https://hbase.apache.org/book.html#compaction does a pretty good job explaining these nuances, but feel free to ask for more clarification.
Created 06-22-2017 12:42 PM
So you said:
"this is actually a really idea if you've somehow done this, by the way..."
But I think you left an adjective out. I suppose you mean: bad?
We have been told by experts at Azure and Hashmap that we don't need to do major compaction and it is currently shut off.
1. We don't do any deletions from our system. Would this be the reason they say this?
2. We have been told that major compaction will block any writes to our tables (we can't have this). I was told this is untrue at PhoenixCon but when I asked Hashmap, they said that HDInsight has rewritten major compaction and that it blocks writes.
2. We want to start using TTL. If minor compaction deletes these records (that is what I took from above) is major compaction required?
3. Why is there so much confusion about this?! Everyone seems to think TTL requires major compaction.
Created 06-22-2017 04:04 PM
"But I think you left an adjective out. I suppose you mean: bad?" Haha, oh my. This is why I shouldn't write responses late at night. Yes, "bad" 🙂
"We have been told by experts at Azure and Hashmap that we don't need to do major compaction and it is currently shut off." -- Again, what do you mean by "compactions are shut off"? HBase is still running compactions and will trigger compactions automatically which include all files in a region (which are "major compactions").
"We have been told that major compaction will block any writes to our tables (we can't have this). I was told this is untrue at PhoenixCon but when I asked Hashmap, they said that HDInsight has rewritten major compaction and that it blocks writes." -- This sounds completely false to me. Last I checked, HDInsight was still HDP and the version of HBase included in HBase does not block writes during compactions.
"2. We want to start using TTL. If minor compaction deletes these records (that is what I took from above) is major compaction required?" Yes, this is explicitly called out in the following documentation that TTL's are applied during minor compactions https://hbase.apache.org/book.html#ttl
Major compactions will still be run on your system, whether or not you have scheduled (daily/weekly) configurations. You cannot and should-not try to disable compactions from running. Not running compactions means that the number of files in your system will continue to grow and the query performance will decrease significantly, not to mention put unnecessary pressure on the Namenode.
"3. Why is there so much confusion about this?! Everyone seems to think TTL requires major compaction." I have to assume you're just frustrated and this is rhetorical. I can't tell you why people think what they do. I can point you at the official documentation.
Created 06-22-2017 08:33 PM
Awesome. Thanks for sharing your knowledge!