Support Questions

Find answers, ask questions, and share your expertise

Snapshots and backups

avatar
Rising Star

Are snapshots the preferred way for taking backups in Hadoop production environments?

Are snapshots used for taking backups for both namenode metadata (fsimage, edits) as well as datanode hdfs data. I would be very interested to know how snapshots are used in forum members hadoop environments.

Appreciate the feedback.

1 ACCEPTED SOLUTION

avatar
Super Guru

See my answers inline below:

--->So by replication do you mean distcp copy across two clusters - but distcp copies are not real time - so can they be called as replication?

Yes, replication can be enabled by either distcp or if you are streaming data using some ingestion tool like Nifi, you can send data to both clusters (Active and backup) at the same time in real time. In practice and because of Physics, you cannot just have 100 percent sync between two clusters so distcp is almost always the way to go.

--->I have read that Flume can be used to copy from a source to two different clusters - but even such a configuration would exist outside hdfs. Is this what you meant by replication?

I would prefer Nifi over Flume, but by replication, my idea is more of using distcp. Two data centers may be 1000 miles apart cannot be in synced for every single second and the cost and complexity of trying to achieve that is also very high. I have seen some very low latency telco use cases and those are some very few use cases where active-active is justified (remember very complex). Highly recommend you read this.

---> I read that Hive can be backed up using Snapshots in an incremental way. That is take a snapshot of hive data at one point in time and from then on (to another point in time) take snapshots and use the difference feature to get the incrementals between the current snapshot and the previous one. So data can be recovered by loading the full Snapshot and the incrementals to a point in time (like RDBMS recovery). Is this workable?

I am not aware of this method. There are two things you need to do for Hive replication.

a. Replicate metadata (use MySQL replication techniques or whichever metadata database you are using)

b. Use distcp to replicate HDFS files.

View solution in original post

7 REPLIES 7

avatar
Super Guru

Snapshots capture metadata at point in time so you can recover the state of your cluster to certain point in time if someone accidentally deletes something or due to some other failure if you would like to go back to certain point in time. One thing to remember with snapshots is that they only capture state at a point in time. No data copying occurs with Snapshot. It's a pretty quick O(1) operation. However, if you are creating many snapshots, then when you delete data, Hadoop will see a snapshot pointing to that data and instead of deleting your data, it will move it to an archiving folder. Many people are surprised that somehow their disk space is not freeing up even though they have deleted a lot of data (they don't even see deleted files as they have been moved to an archiving folder). The culprit in these cases is usually Snapshot.

But, see this is barely enough to make true backups. Across the industry, the mechanism for backups and disaster recovery is replication (both for Hive and HDFS as well as for HBase). So what you should be looking at is replication and then see if making snapshots also makes sense for your use cases (its almost always useful to have some snapshots - fewer snapshots means less disk space use).

Tp copy data between clusters, you will use distcp. Check the following link:

https://hadoop.apache.org/docs/r1.2.1/distcp2.html

If my answer helped, please accept.

avatar
Rising Star

Thanks for the insights on snapshots. Can I know where you use snapshots - what files/directories do you keep copies of?

>the mechanism for backups and disaster recovery is replication (both for Hive and HDFS as well as for HBase).

So by replication do you mean distcp copy across two clusters - but distcp copies are not real time - so can they be called as replication?

I have read that Flume can be used to copy from a source to two different clusters - but even such a configuration would exist outside hdfs. Is this what you meant by replication?

But I guess by 'real time' I am talking about changes in data in HDFS which is not really a design feature of Hadoop.

Appreciate the feedback.

avatar
Rising Star

I read that Hive can be backed up using Snapshots in an incremental way. That is take a snapshot of hive data at one point in time and from then on (to another point in time) take snapshots and use the difference feature to get the incrementals between the current snapshot and the previous one. So data can be recovered by loading the full Snapshot and the incrementals to a point in time (like RDBMS recovery). Is this workable?

Appreciate the insights.

avatar
Rising Star

One related question here is also the time it would take to copy a large file into hdfs. So if that file is lost/corrupt in hdfs, other than snapshots is there any other efficient/quick way to get the data loaded into hdfs (provided we have an OS copy of that file ofcourse)?

avatar
Super Guru

See my answers inline below:

--->So by replication do you mean distcp copy across two clusters - but distcp copies are not real time - so can they be called as replication?

Yes, replication can be enabled by either distcp or if you are streaming data using some ingestion tool like Nifi, you can send data to both clusters (Active and backup) at the same time in real time. In practice and because of Physics, you cannot just have 100 percent sync between two clusters so distcp is almost always the way to go.

--->I have read that Flume can be used to copy from a source to two different clusters - but even such a configuration would exist outside hdfs. Is this what you meant by replication?

I would prefer Nifi over Flume, but by replication, my idea is more of using distcp. Two data centers may be 1000 miles apart cannot be in synced for every single second and the cost and complexity of trying to achieve that is also very high. I have seen some very low latency telco use cases and those are some very few use cases where active-active is justified (remember very complex). Highly recommend you read this.

---> I read that Hive can be backed up using Snapshots in an incremental way. That is take a snapshot of hive data at one point in time and from then on (to another point in time) take snapshots and use the difference feature to get the incrementals between the current snapshot and the previous one. So data can be recovered by loading the full Snapshot and the incrementals to a point in time (like RDBMS recovery). Is this workable?

I am not aware of this method. There are two things you need to do for Hive replication.

a. Replicate metadata (use MySQL replication techniques or whichever metadata database you are using)

b. Use distcp to replicate HDFS files.

avatar
Rising Star

But distcp is only a point in time copy right? You need to run the command every time you want a copy to be done. It is not automatic so that any change in data in an existing file or a new file copy is automatically going to be copied across. So not sure why you refer to it as replication, which normally means that any change in data will be reflected at the replicate site automatically.

Appreciate the clarification.

avatar
Super Guru

It is point in time (time at which distcp runs). It can be automated using scripts. It is replication because you are replicating data. you are confusing replication with real time replication. Replication doesn't have to be real time. And it is impossible because of physics that any change in data in one cluster (or anyother database) will be reflected in the other.

I'll give you the example of my HBase use case. We were using Active-Active replication. But even then, we knew that there might be a situation where data will be written to one data center and a power failure will occur as data is being replicated to the remote data center and some data will not be replicated (let's say up to 10 to seconds of data). The only other way to make sure that this does not happen is to "Ack" only when data has been written to the remote data center. This slows down every write and we had 10's of thousands of writes per second. See, you have to make a choice. If you would like to have 100 percent sync then you have to make sure that you ack every single record being written slowing down all your writes. Or you can do asynchronous replication which works 99.99% of the time but in case of network issues between two data centers, you know some data some time will not be replicated right away. There is absolutely nothing technology can do anything here. This is simple Physics.