Support Questions

Find answers, ask questions, and share your expertise

What is a suggested offsite/cold backup method for HDFS? besides AWS S3

avatar
Expert Contributor
 
1 ACCEPTED SOLUTION

avatar
Contributor

@Cassandra,

Ideally, you don't need to backup HDFS since it stores 3 copies by default. If you need a DR strategy, a good strategy is to have a separate cluster in another datacenter. Use Apache Falcon or distcp to mirror the data to the DR cluster. If you need to backup certain high value datasets, take a snapshot of the data and back it up to tape (ugh!) or put it on your corporate SAN/NAS (if permitted). This will give you a way to recover the data if disaster strikes. I don't know if you are adverse to cloud storage (based on your S3 comment), but it is cheap and online all the time to recover data when needed.

I hope this helps,

Eric

View solution in original post

3 REPLIES 3

avatar
Contributor

@Cassandra,

Ideally, you don't need to backup HDFS since it stores 3 copies by default. If you need a DR strategy, a good strategy is to have a separate cluster in another datacenter. Use Apache Falcon or distcp to mirror the data to the DR cluster. If you need to backup certain high value datasets, take a snapshot of the data and back it up to tape (ugh!) or put it on your corporate SAN/NAS (if permitted). This will give you a way to recover the data if disaster strikes. I don't know if you are adverse to cloud storage (based on your S3 comment), but it is cheap and online all the time to recover data when needed.

I hope this helps,

Eric

avatar
Super Collaborator

"you don't need to backup HDFS since it stores 3 copies by default" : IMHO, I think we need to take care with that message. Having some replicas don't protect us again a "human error" or rogue administrator (hdfs dfs -rmr /), neither again an application bug.

It's just like RAID1: it's good but no IT department would consider that it serves as a backup.

avatar
Master Mentor

@Cassandra

HDFS Snapshots HBASE Snapshots Hive Metadata (DBA can work on setting up this based on DB flavor used for HCatalog)

Going back to your original question

This is helpful to understand the architecture. We can point it to DR cluster (It can be on prem or in cloud) as Eric mentioned.