Support Questions

david_lays · ‎04-26-2016

Hi

What are my options for HDFS replication in a DR scenario ? What are the pros and cons of each option ?

Thanks

ahadjidj · ‎04-26-2016

Hi @David Lays

You have mainly two high level approaches for data replication:

Replication in Y (Teeing): in this scenario you do replication at ingestion time. Each new data is stored in primary and DR clusters. NiFi is great for this double ingestion. The pro of this method is that you have data immediately in both clusters. The cons is that you have only the raw data and not processing results. If you want to get the same result on the DR cluster, you need to do the same processing in the DR cluster.
Replication in L (copying): in this scenario you ingest data at the primary cluster and later copy it to the DR cluster. Tools like Distcp or Falcon can be used to implement this. The pro is that you can replicate raw data and processing results in the same process. The cons is that the DR cluster is lagging behind n-in terms of data. The replication is usually scheduled and if you cluster goes down between you will loose data generated (ingested or computed) since the last replication.

I hope this helps

View solution in original post

emaxwell · ‎04-26-2016

@David Lays

The two main options for replicating the HDFS structure are Falcon and distcp. The distcp command is not very feature rich, you give it a path in the HDFS structure and a destination cluster and it will copy everything to the same path on the destination. If the copy fails, you will need to start it again, etc.

Another method for maintaining a replica of your HDFS structure is Falcon. There are more data movement options and you can more effectively manage the lifecycle of all of the data on both sides.

If you're moving Hive table structures, there is some more complexity to making sure the tables are created on the DR side, but moving the actual files is done the same way

bleonhardi · ‎04-26-2016

The main options are ( I know it has been answered before but its a bit hard finding things )

a) Wandisco

This essentially duplicates HDFS across 2 clusters by hooking into the Namenode and mirroring any command and block change to the DR cluster. It is the only solution here that actually allows for a transactionally duplicate HR system i.e. no data loss for two clusters for all committed transactions

+ Immediate replication for HDFS

+ Only transactionally safe approach

- Additional cost

- Additional servers

- may have some compatibility issues for HDFS

- doesn't really fix anything above HDFS ( hive tables, oozie jobs, .... ) for example for hive the data on its own without the metadata of the Metastore doesn't give you a working system on its own

b) DistCP in Oozie

Relatively manual approach but Distcp provides some nice features to mirror folders. You can either do it manually ( Hadoop often is build on batch jobs with timestamped folders so you can simply add another action for distcp after the ingestion. It also provides a deltaload for a folder ( going by file size so there is a small danger of data loss if you modified a file with the same byte size ). If you want to move a snapshot over you can use HDFS snapshot feature.

+ Robust and fair enough in case operator knows what he is doing

- Kerberos setup can be a bit of a pain ( same REALM helps )

- Relatively manual.

- no transactional duplication

c) Falcon DR

Built on Oozie but does a couple things out of the box that need to be configured in Oozie. For example it can mirror Hive tables ( looks for partitions being added ) It also keeps track of different servers so it has built in concepts that you would need to hard code in Oozie.

+ Can mirror Hive and HDFS

+ some nice multi cluster concepts

- same downsides as Oozie really

d) Duplicate your data ingestion in both clusters ( Have two hot clusters )

This approach is perhaps the safest one since once you have it going both clusters will continue working so in case of a failover you KNOW that your DR cluster will actually be up for the job.

+ Will cover everything on top of HDFS like Hive and oozie

+ Will continuously test the DR cluster

- you might end up with non identical clusters if your code promotion is faulty

- additional unnecessary work ( data transformations in DR cluster

e) Kafka Mirror Maker

If you have mostly realtime datasources you will normally have a Kafka ingestion layer. You can use MirrorMaker to duplicate Kafka topics across two clusters.

david_lays · ‎04-27-2016

Thanks for these information. Right now I am working on the architectural level and these details will be useful for further steps

josemartinezs · ‎10-04-2019

Hello, I'm looking your answer 3 years later because I'm in a similar situation :). In my company (telco) we're planning using 2 hot clusters with dual ingest because our RTO is demanding and we're looking for mechanisms to monitor and keep in sync both clusters. We ingest data in real-time with kafka + spark streaming, loading to HDFS and consuming with Hive/Impala. I'm thinking about a first approach making simple counts with Hive/Impala tables on both clusters each hour/half hour and comparing. If something is missing in one of the clusters, we will have to "manually" re-ingest the missing data (or copy it with cloudera BDR from one cluster to the other) and re-process enriched data. I'm wondering if have you dealt with similar scenarios or suggestions you may have. Thanks in advance!

ahadjidj · ‎04-26-2016

Hi @David Lays

You have mainly two high level approaches for data replication:

Replication in Y (Teeing): in this scenario you do replication at ingestion time. Each new data is stored in primary and DR clusters. NiFi is great for this double ingestion. The pro of this method is that you have data immediately in both clusters. The cons is that you have only the raw data and not processing results. If you want to get the same result on the DR cluster, you need to do the same processing in the DR cluster.
Replication in L (copying): in this scenario you ingest data at the primary cluster and later copy it to the DR cluster. Tools like Distcp or Falcon can be used to implement this. The pro is that you can replicate raw data and processing results in the same process. The cons is that the DR cluster is lagging behind n-in terms of data. The replication is usually scheduled and if you cluster goes down between you will loose data generated (ingested or computed) since the last replication.

I hope this helps

david_lays · ‎04-27-2016

Thanks. That's what I'm looking for.

darindenio · ‎01-03-2018

Hi @Abdelkrim Hadjidj

Can you point me to documentation on setting up nifi RT replication? Thanks.

Cloudera Community

Support Questions

HDFS replication for DR