<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Best practice for data replication/sync between two data centers in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Best-practice-for-data-replication-sync-between-two-data/m-p/5061#M29629</link>
    <description>&lt;P&gt;Cloudera Enterprise offers a backup and disaster recovery (BDR) tool which handles HDFS replication and other mechanisms like what you are seeking. &amp;nbsp;I also wrote &lt;A href="http://blog.cloudera.com/blog/2013/11/approaches-to-backup-and-disaster-recovery-in-hbase/" target="_self"&gt;this blog entry&lt;/A&gt; regarding the different mechanisms that are available for HBase backup and disaster recovery. &amp;nbsp;You didn't specify if you were using HBase, but that might help.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Some customers set up their user applications such that the data is written simultaneously to two clusters. &amp;nbsp;This is a cheap form of replication. &amp;nbsp;All data is written to cluster A and cluster B up front. &amp;nbsp;You will have to write this code yourself and also make it fault tolerant, etc.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To answer your other questions, I would definitely recommend you have two independent clusters. &amp;nbsp;One cluster spanning a WAN will not work very well, if at all.&lt;/P&gt;</description>
    <pubDate>Mon, 20 Jan 2014 21:04:10 GMT</pubDate>
    <dc:creator>Clint</dc:creator>
    <dc:date>2014-01-20T21:04:10Z</dc:date>
    <item>
      <title>Best practice for data replication/sync between two data centers</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Best-practice-for-data-replication-sync-between-two-data/m-p/5041#M29628</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;thinking of having two datacenters and the requirement of having a cluster surviving the failure of a whole datacenter, what would be the preferred setup?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;a) ONE Hadoop cluster spanned over both data centers, or&lt;/P&gt;&lt;P&gt;b) TWO independent Hadoop clusters with (somehow) synced data&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Questions:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;SPAN style="line-height: 14px;"&gt;it seems obvious for option a) that the interconnection between the data centers needs to be veeery good, at least 1GBit ?!?&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN style="line-height: 14px;"&gt;is it possible to configure Hadoop to replicate blocks to different data centers, in precedence of replicating to different racks via the rack topology script ?&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN style="line-height: 14px;"&gt;if option b) is chosen, how can an automatic,continous data replication between the two clusters be established (are there tools for this) ?&lt;/SPAN&gt;&lt;/LI&gt;&lt;LI&gt;&lt;SPAN style="line-height: 14px;"&gt;what are the main considerations, recommendations for the initially mentioned requirement ?&lt;/SPAN&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;many thanks in advance...Gerd...&lt;/P&gt;</description>
      <pubDate>Fri, 16 Sep 2022 08:52:39 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Best-practice-for-data-replication-sync-between-two-data/m-p/5041#M29628</guid>
      <dc:creator>geko</dc:creator>
      <dc:date>2022-09-16T08:52:39Z</dc:date>
    </item>
    <item>
      <title>Re: Best practice for data replication/sync between two data centers</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Best-practice-for-data-replication-sync-between-two-data/m-p/5061#M29629</link>
      <description>&lt;P&gt;Cloudera Enterprise offers a backup and disaster recovery (BDR) tool which handles HDFS replication and other mechanisms like what you are seeking. &amp;nbsp;I also wrote &lt;A href="http://blog.cloudera.com/blog/2013/11/approaches-to-backup-and-disaster-recovery-in-hbase/" target="_self"&gt;this blog entry&lt;/A&gt; regarding the different mechanisms that are available for HBase backup and disaster recovery. &amp;nbsp;You didn't specify if you were using HBase, but that might help.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Some customers set up their user applications such that the data is written simultaneously to two clusters. &amp;nbsp;This is a cheap form of replication. &amp;nbsp;All data is written to cluster A and cluster B up front. &amp;nbsp;You will have to write this code yourself and also make it fault tolerant, etc.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To answer your other questions, I would definitely recommend you have two independent clusters. &amp;nbsp;One cluster spanning a WAN will not work very well, if at all.&lt;/P&gt;</description>
      <pubDate>Mon, 20 Jan 2014 21:04:10 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Best-practice-for-data-replication-sync-between-two-data/m-p/5061#M29629</guid>
      <dc:creator>Clint</dc:creator>
      <dc:date>2014-01-20T21:04:10Z</dc:date>
    </item>
    <item>
      <title>Re: Best practice for data replication/sync between two data centers</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Best-practice-for-data-replication-sync-between-two-data/m-p/5085#M29630</link>
      <description>&lt;P&gt;Hi Clint,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;many thanks for your very helpful answer and the brilliant blog post about HBase repl.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;There's just one more question:&lt;/P&gt;&lt;P&gt;If Cloudera Enterprise is no option ($$$) and the synchronisation needs to be done on the storage layer, is a repetition of calling distcp an appropriate low-cost solution, or how would you tackle this problem ?&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;br...: Gerd :....&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jan 2014 08:00:01 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Best-practice-for-data-replication-sync-between-two-data/m-p/5085#M29630</guid>
      <dc:creator>geko</dc:creator>
      <dc:date>2014-01-21T08:00:01Z</dc:date>
    </item>
    <item>
      <title>Re: Best practice for data replication/sync between two data centers</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Best-practice-for-data-replication-sync-between-two-data/m-p/5103#M29631</link>
      <description>&lt;P&gt;Yes, DistCP is usually what people use for that. &amp;nbsp;It has rudimentary functionality for sync'ing data between clusters, albeit in a very busy cluster where files are being deleted/added frequently and/or other data is changing, replicating those changes between clusters will require custom logic on top of HDFS. &amp;nbsp;Facebook developed &lt;A href="http://www.facebook.com/notes/paul-yang/moving-an-elephant-large-scale-hadoop-data-migration-at-facebook/10150246275318920" target="_self"&gt;their own replication layer&lt;/A&gt;, but it is proprietary to their engineering department.&lt;/P&gt;</description>
      <pubDate>Tue, 21 Jan 2014 17:30:22 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Best-practice-for-data-replication-sync-between-two-data/m-p/5103#M29631</guid>
      <dc:creator>Clint</dc:creator>
      <dc:date>2014-01-21T17:30:22Z</dc:date>
    </item>
    <item>
      <title>Re: Best practice for data replication/sync between two data centers</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Best-practice-for-data-replication-sync-between-two-data/m-p/5207#M29632</link>
      <description>&lt;P&gt;Clint, thank you very much.&lt;/P&gt;</description>
      <pubDate>Thu, 23 Jan 2014 20:41:57 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Best-practice-for-data-replication-sync-between-two-data/m-p/5207#M29632</guid>
      <dc:creator>geko</dc:creator>
      <dc:date>2014-01-23T20:41:57Z</dc:date>
    </item>
    <item>
      <title>Re: Best practice for data replication/sync between two data centers</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Best-practice-for-data-replication-sync-between-two-data/m-p/42057#M29633</link>
      <description>&lt;P&gt;Can we monitor the namenode edits logs and use that to trigger file copy , continuously from one cluster to another.&lt;/P&gt;</description>
      <pubDate>Fri, 17 Jun 2016 03:05:41 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Best-practice-for-data-replication-sync-between-two-data/m-p/42057#M29633</guid>
      <dc:creator>karthikmr</dc:creator>
      <dc:date>2016-06-17T03:05:41Z</dc:date>
    </item>
  </channel>
</rss>

