Expert Contributor
Posts: 108
Registered: ‎02-15-2016

Hadoop Cluster cold backup to S3

[ Edited ]

Hi ,

I want to take cluster backup to S3 and then wipe out the cluster ,may be later spin it again .

for taking backup to s3 what is the best apporoach .

1- copy the entire hdfs content to S3 ( in single bucket or create mutiple bucket)

2- do i need to take hbase snapshot along with /hbase copy to s3 or either one will work.

3- REST encryption is enabled ,any special consideration to take while moving backup to S3

4- and how restore will work from  S3 , just copy to hdfs 

Posts: 7
Registered: ‎02-19-2018

Re: Hadoop Cluster cold backup to S3

Hello MSharma. 


I hope you have solved your problem by now but here are some thoughts.


Is your data entirely hBase? I think that makes things more difficult and outside of my expertise.I think you need to look into procedures for backing up an hBase database. It is almost irrelevant that you are using S3. The problem you face would be the same no matter what the backup medium is.


Normally - for most files and Hive tables I would lift and shift: Read from HDFS and copy to S3. If you have "at rest" encryption then I would expect that the reading process would decrypt the encrypted HDFS blocks - and you could use server side encryption on the S3 bucket instead. (Test this out first so you are comfortable with it before doing so). You would keep the data files - but lose any HDFS block information. Restoring those files would mean writing them into your cluster again as if they were brand new. 


If this data is not being updated though you might consider keeping it in S3 and reading it with fs.s3a


I hope that helps but I am sorry I don't know how to backup hBase.

Expert Contributor
Posts: 108
Registered: ‎02-15-2016

Re: Hadoop Cluster cold backup to S3

Thanks alex.
yeah i used distcp for hbase files as well and copied to s3. i was also able to restore that backup another cluster but you need to repair meta offline . my requirement was to copy everything to s3 and spin off ec2 instance and than later restore data from s3 backup if require.

Export hbase table is also a good option but than you need to have a enough space on /tmp since export copy tables locally before copying it to s3. we ran put of space with this method so decided to copy files.