Support Questions

Find answers, ask questions, and share your expertise

Data loss in MOB snapshot and clone

avatar
Explorer

I asked this of the HBase @dev list and it was suggested I move the question here, as it might be a CDH specific issue.

 

We have just experienced data loss in HBase 1.0.0-cdh5.4.10.

 

The initial state was a table (tim_test) built with MOB support and a few 10's million rows and 10's billions of cells.

 

I wanted to rename the table to get this into production and did so as follows:

 

 

  snapshot 'tim_test', 'tim_test-snapshot'
  clone_snapshot 'tim_test-snapshot', 'prod_b_map'

 

At this stage the application all looked good, and so I continued with:

 

  delete_snapshot 'tim_test-snapshot'
  disable 'tim_test' drop ‘tim_test’

 

Then things went... awry and data just started dropping out in the app.  Before long, all MOB data seemingly is gone.

 

The references in the new table MOB folder appear to point to the source table

 

(e.g. /hbase/mobdir/data/default/prod_b_map/ba42a2e8e9b669d9fc85bdfeed2f5f2a/EPSG_4326/tim_test=14bf5f1737ac65c34615ed97c0b7de06-d41d8cd98f00b204e9800998ecf8427e20161006ff8baa70d21f408caefe8ae6318dfba2). 

 

The RS logs full of ERROR like:

 

2016-10-12 15:19:14,640 ERROR org.apache.hadoop.hbase.regionserver.HStore: The mob file d41d8cd98f00b204e9800998ecf8427e20161006b59865f80e604781a79ebfa2ddd66b48 could not be found in the locations [hdfs://ha-nn/hbase/mobdir/data/default/tim_test/14bf5f1737ac65c34615ed97c0b7de06/EPSG_4326, hdfs://ha-nn/hbase/archive/data/default/tim_test/14bf5f1737ac65c34615ed97c0b7de06/EPSG_4326]

 

 

What I don't know is:

  1) was this running a background task to copy the MOB data when the snapshot was cloned and I just deleted the source before the copy was complete?

- or

  2) when running "snapshot and clone" it just references the source MOB data until a cell write occurs?

- or  

  3) snapshot and clone just doesn't support MOB?

 

Can anyone shed some light on this please?

 

I'd just like to know the state of MOB and in particular snapshots with MOB tables in the CDH releases.

 

Thanks folks.

 

2 ACCEPTED SOLUTIONS

avatar
Explorer

This has been confirmed by an HBase committer and resulted in https://issues.apache.org/jira/browse/HBASE-16841

 

It is an open issue on HBase master and the recommended workaround is to issue 'flush tablename' before running 'snapshot tablename snapshotname'.  

 

 

View solution in original post

avatar
Cloudera Employee
Thanks for confirming this! We will watch it upstream and backport the fix.

View solution in original post

7 REPLIES 7

avatar
Cloudera Employee

Thanks for reporting the issue. Answers are embedded in.

 

  1) was this running a background task to copy the MOB data when the snapshot was cloned and I just deleted the source before the copy was complete?

 

As far as I know, there is no data copy invovled during snapshot process, it uses references.

 

- or

  2) when running "snapshot and clone" it just references the source MOB data until a cell write occurs?

 

Yeah, it always reference the source mob data file. when new cell write happens, it will be in the memstore (not flushed into the files yet), so the snapshot is not affected in this case.

 

- or  

  3) snapshot and clone just doesn't support MOB?

 I think it is supported. Let me verify and come back. It maybe there is an issue with cleanser and it cleans up these mob files which is not supposed to clean up.

 

avatar
Explorer
Thanks!

I've got to fix up our data from source which is going to take a few days of processing, but once done I can try and reproduce this situation. It seems rather important (i.e. an enterprise customer presumably could experience data loss) so if a Clouderan has a bit of time to try and reproduce this that would be appreciated.
[I've chatted with Lars G about this and he confirms I'm not doing anything obviously wrong]

avatar
Cloudera Employee

Yeah, this is important. I created an internal jira to track it. If this is a common issue, will create an upstream jira and post fix, thanks!

avatar
Explorer

One other symptom I should probably mention:  

 

The new table showed 100 to 500 requests per second on all regions for some hours after this.  I could not find out why or what they were (it was not the app and was for sure being triggered internally in HBase).  After maybe 2 hours, I disabled the table, and then reenabled and the requests were 0.  Ganglia and RS logs didn't suggest there was much going on, but it seemed wrong.

avatar
Cloudera Employee

Let me quickly check with people here to see what could happen. Agree with you that this does not seem right. 

avatar
Explorer

This has been confirmed by an HBase committer and resulted in https://issues.apache.org/jira/browse/HBASE-16841

 

It is an open issue on HBase master and the recommended workaround is to issue 'flush tablename' before running 'snapshot tablename snapshotname'.  

 

 

avatar
Cloudera Employee
Thanks for confirming this! We will watch it upstream and backport the fix.