is there asolution to make Hadoop continuously available across two or more datacenters?
is there a solution for an across datacenter replication? what latency? any data loss risk?
if we lose a whole datacenter, is it possible to switch to a second one? is it automatic?
Wandisco: Essentially a third party solution that replicates every file/block written to HDFS on the backup/DR cluster. The problem is that this does not take care of meta information like Hive tables. But it has very low latency. Essentially keeping two HDFS in sync.
DistCP: How about do it yourself. Schedule Distcp jobs between two clusters. You can combine this with HDFS snapshots to sync two clusters in fixed atomic timepoints. Potential dataloss since the last snapshot.
Falcon: A solution utilizing oozie and distcp to keep clusters in sync. It can schedule file transfers and can also keep hive tables in sync. Often the preferred solution, similar issues as in DistCP.
Distributing one cluster in multiple data centers: Don't do it does not work. The latency wreaks havoc on the HDFS processes.
In addition to that we often have streaming input data that is normally buffered in Kafka. Here we have a solution called mirror maker to sync kafka topics across datacenters.
Once the data center dies it is up to you to do the failover, there is nothing in the product that would do that automatically. Since it depends on the clients. Customers often use load balancers for this.
thanks for replying. kafka seems to be the best solution to better avoid data loss and gain replication time.
I also heared about "Asynchronous Rack Awareness" to replicate data across datacenters but I'm really not sure about it: have you any information about this feature?
Even with Kafka you will have some data loss since mirror maker is a consumer of the topic but it will be very little.
I never heard about "asynchronous" rack awareness. If you have a link to that let me know. AFAIK no customer who tried to spread a cluster across data centers has been happy. HDFs is built on low latency between datanodes and Namenode ( there is a LOT of communication between them ) so having datanodes in a different data center will be bad for performance.
But if you have some links on that feature perhaps we can drill down a bit.