We are using Cloudera CDH 5.12.1 on Ubuntu Xenial 16.04
We have 2 distinct Hadoop/HBase clusters which use Kerberos Authentication from the same realm
We take Nightly HBase snapshots of all our tables on our primary cluster and are attempting to use the ExportSnapshot utility to transfer our HBase data to our backup cluster for offline backups and a few other uses. This had been working fine for some time using an older version of CDH but since our upgrade to 5.12.1 and the securing of our cluster with Kerberos we have run into and issue that I just can not figure out.
We have 12 tables that we serial snapshot and attempt to export. Some of these tables are quite large and the distcp/export process takes many hours. The large tables always fail during the Verify Snapshot Integrity step. The export always says that it can not find the hfile on the remote site. By turning on hadoop and hbase debugging I see that the hfiles are copied over to the remote site but then deleted on the remote site before the exportSnapshot is completed. This makes the export fail. What I can not figure out is what process on the remote side is deleteing the hfiles outside of the export snapshot. I have a theory it may be something to do with the hfile cleaner task but can not verify that in any way.
I was wondering if anyone else has experinced this or has an idea as to what might be going on.
This is the command we use the export the HBase snapshots
HADOOP_OPTS="-Dmapred.job.map.memory.mb=4096 -Dmapreduce.map.memory.mb=4096 HADOOP_HEAPSIZE=4096 hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -chuser hbase -chgroup hbase -copy-to hdfs://hdnn001.ch1.tnbsound.com:8020/hbase -snapshot 20171207210001.idx_entity_dma
This is the hbase log that shows the error
2017-12-21 06:04:07,947 INFO [main] snapshot.ExportSnapshot: Finalize the Snapshot Export
2017-12-21 06:04:08,306 INFO [main] snapshot.ExportSnapshot: Verify snapshot integrity
2017-12-21 06:04:08,310 INFO [main] Configuration.deprecation: fs.default.name is deprecated. Instead, use fs.defaultFS
2017-12-21 06:05:04,333 ERROR [VerifySnapshot-pool1-t1] snapshot.SnapshotReferenceUtil: Can't find hfile: 5d3803fa1bd44fa7b035de9b6feb850a in the real (hdfs://hdnn001.ch1.tnbsound.com:8020/hbase/data/default/idx_entity_asset_subregion/35367cf285b59a1a66067fd168caf2b6/m/5d3803fa1bd44fa7b035de9b6feb850a) or archive (hdfs://hdnn001.ch1.tnbsound.com:8020/hbase/archive/data/default/idx_entity_asset_subregion/35367cf285b59a1a66067fd168caf2b6/m/5d3803fa1bd44fa7b035de9b6feb850a) directory for the primary table.
2017-12-21 06:05:04,338 ERROR [main] snapshot.ExportSnapshot: Snapshot export failed
org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Can't find hfile: 5d3803fa1bd44fa7b035de9b6feb850a in the real (hdfs://hdnn001.ch1.tnbsound.com:8020/hbase/data/default/idx_entity_asset_subregion/35367cf285b59a1a66067fd168caf2b6/m/5d3803fa1bd44fa7b035de9b6feb850a) or archive (hdfs://hdnn001.ch1.tnbsound.com:8020/hbase/archive/data/default/idx_entity_asset_subregion/35367cf285b59a1a66067fd168caf2b6/m/5d3803fa1bd44fa7b035de9b6feb850a) directory for the primary table.
Maybe as a workaround you can write to another directory /tmp and once distcp is completed, move the snapshot and archive to proper locations.
If the exported HFiles are getting deleted in the target and if you can also confirm it's the Master's HFileCleaner thread which is deleting them, then there is some problem at the initial stage of ExportSnapshot where snapshot Manifest/References are copied over. Check if there is any warning/errors reported in the console logs. Also, check the manifest does exist in the target cluster.