Created on 04-26-2016 03:36 PM - edited 09-16-2022 03:15 AM
Hi
What are my options for HDFS replication in a DR scenario ? What are the pros and cons of each option ?
Thanks
Created 04-26-2016 03:59 PM
Hi @David Lays
You have mainly two high level approaches for data replication:
I hope this helps
Created 04-26-2016 03:55 PM
The two main options for replicating the HDFS structure are Falcon and distcp. The distcp command is not very feature rich, you give it a path in the HDFS structure and a destination cluster and it will copy everything to the same path on the destination. If the copy fails, you will need to start it again, etc.
Another method for maintaining a replica of your HDFS structure is Falcon. There are more data movement options and you can more effectively manage the lifecycle of all of the data on both sides.
If you're moving Hive table structures, there is some more complexity to making sure the tables are created on the DR side, but moving the actual files is done the same way
Created 04-26-2016 03:57 PM
The main options are ( I know it has been answered before but its a bit hard finding things )
a) Wandisco
This essentially duplicates HDFS across 2 clusters by hooking into the Namenode and mirroring any command and block change to the DR cluster. It is the only solution here that actually allows for a transactionally duplicate HR system i.e. no data loss for two clusters for all committed transactions
+ Immediate replication for HDFS
+ Only transactionally safe approach
- Additional cost
- Additional servers
- may have some compatibility issues for HDFS
- doesn't really fix anything above HDFS ( hive tables, oozie jobs, .... ) for example for hive the data on its own without the metadata of the Metastore doesn't give you a working system on its own
b) DistCP in Oozie
Relatively manual approach but Distcp provides some nice features to mirror folders. You can either do it manually ( Hadoop often is build on batch jobs with timestamped folders so you can simply add another action for distcp after the ingestion. It also provides a deltaload for a folder ( going by file size so there is a small danger of data loss if you modified a file with the same byte size ). If you want to move a snapshot over you can use HDFS snapshot feature.
+ Robust and fair enough in case operator knows what he is doing
- Kerberos setup can be a bit of a pain ( same REALM helps )
- Relatively manual.
- no transactional duplication
c) Falcon DR
Built on Oozie but does a couple things out of the box that need to be configured in Oozie. For example it can mirror Hive tables ( looks for partitions being added ) It also keeps track of different servers so it has built in concepts that you would need to hard code in Oozie.
+ Can mirror Hive and HDFS
+ some nice multi cluster concepts
- same downsides as Oozie really
d) Duplicate your data ingestion in both clusters ( Have two hot clusters )
This approach is perhaps the safest one since once you have it going both clusters will continue working so in case of a failover you KNOW that your DR cluster will actually be up for the job.
+ Will cover everything on top of HDFS like Hive and oozie
+ Will continuously test the DR cluster
- you might end up with non identical clusters if your code promotion is faulty
- additional unnecessary work ( data transformations in DR cluster
e) Kafka Mirror Maker
If you have mostly realtime datasources you will normally have a Kafka ingestion layer. You can use MirrorMaker to duplicate Kafka topics across two clusters.
Created 04-27-2016 01:40 AM
Thanks for these information. Right now I am working on the architectural level and these details will be useful for further steps
Created 10-04-2019 02:46 PM
Hello, I'm looking your answer 3 years later because I'm in a similar situation :). In my company (telco) we're planning using 2 hot clusters with dual ingest because our RTO is demanding and we're looking for mechanisms to monitor and keep in sync both clusters. We ingest data in real-time with kafka + spark streaming, loading to HDFS and consuming with Hive/Impala. I'm thinking about a first approach making simple counts with Hive/Impala tables on both clusters each hour/half hour and comparing. If something is missing in one of the clusters, we will have to "manually" re-ingest the missing data (or copy it with cloudera BDR from one cluster to the other) and re-process enriched data. I'm wondering if have you dealt with similar scenarios or suggestions you may have. Thanks in advance!
Created 04-26-2016 03:59 PM
Hi @David Lays
You have mainly two high level approaches for data replication:
I hope this helps
Created 04-27-2016 01:35 AM
Thanks. That's what I'm looking for.
Created 01-03-2018 03:37 PM
Can you point me to documentation on setting up nifi RT replication? Thanks.