<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Re: Data transfer between two clusters in Support Questions</title>
    <link>https://community.cloudera.com/t5/Support-Questions/Data-transfer-between-two-clusters/m-p/115278#M78072</link>
    <description>&lt;P&gt;If you have same components on Target server its fine.&lt;/P&gt;&lt;P&gt;1) Distcp is one off the best option for data transfer between cluster for HDFS &lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.6/bk_Sys_Admin_Guides/content/using_distcp.html" target="_blank"&gt;https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.6/bk_Sys_Admin_Guides/content/using_distcp.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;2) Hbase follow below link it has two method.&lt;/P&gt;&lt;P&gt;&lt;A href="http://hbase.apache.org/book.html#ops.backup" target="_blank"&gt;http://hbase.apache.org/book.html#ops.backup&lt;/A&gt; :- Using Distcp for full dump&lt;/P&gt;&lt;P&gt;&lt;A href="https://hbase.apache.org/book.html#ops.snapshots" target="_blank"&gt;https://hbase.apache.org/book.html#ops.snapshots&lt;/A&gt;  :- SnapShot database for Hbase&lt;/P&gt;&lt;P&gt;3) Hive Metastore , check what type of database uisng like Mysql, take full export of Mysql database as below.&lt;/P&gt;&lt;P&gt;For full dump, you can use "root" user &lt;/P&gt;&lt;P&gt;mysqldump -u [username]-p [password][dbname]&amp;gt; filename.sql&lt;/P&gt;&lt;P&gt;And if you wish to zip it at the sametime:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;mysqldump -u [username]-p [password][db]| gzip &amp;gt; filename.sql.gz&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;You can then move this file between servers with:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;scp user@xxx.xxx.xxx.xxx:/path_to_your_dump/filename.sql.gz your_detination_path/&lt;/CODE&gt;&lt;/PRE&gt;&lt;PRE&gt;&lt;CODE&gt;Once copied import the all objects to my sql database and start he hive server &lt;/CODE&gt;&lt;/PRE&gt;</description>
    <pubDate>Tue, 23 Aug 2016 09:26:47 GMT</pubDate>
    <dc:creator>shivkumar82015</dc:creator>
    <dc:date>2016-08-23T09:26:47Z</dc:date>
    <item>
      <title>Data transfer between two clusters</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Data-transfer-between-two-clusters/m-p/115277#M78071</link>
      <description>&lt;P&gt;What are the different options to transfer data from old cluster to new one. (HDFS/Hive/HBase) ? &lt;/P&gt;</description>
      <pubDate>Tue, 23 Aug 2016 08:29:46 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Data-transfer-between-two-clusters/m-p/115277#M78071</guid>
      <dc:creator>mpandit</dc:creator>
      <dc:date>2016-08-23T08:29:46Z</dc:date>
    </item>
    <item>
      <title>Re: Data transfer between two clusters</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Data-transfer-between-two-clusters/m-p/115278#M78072</link>
      <description>&lt;P&gt;If you have same components on Target server its fine.&lt;/P&gt;&lt;P&gt;1) Distcp is one off the best option for data transfer between cluster for HDFS &lt;/P&gt;&lt;P&gt;&lt;A href="https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.6/bk_Sys_Admin_Guides/content/using_distcp.html" target="_blank"&gt;https://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.3.6/bk_Sys_Admin_Guides/content/using_distcp.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;2) Hbase follow below link it has two method.&lt;/P&gt;&lt;P&gt;&lt;A href="http://hbase.apache.org/book.html#ops.backup" target="_blank"&gt;http://hbase.apache.org/book.html#ops.backup&lt;/A&gt; :- Using Distcp for full dump&lt;/P&gt;&lt;P&gt;&lt;A href="https://hbase.apache.org/book.html#ops.snapshots" target="_blank"&gt;https://hbase.apache.org/book.html#ops.snapshots&lt;/A&gt;  :- SnapShot database for Hbase&lt;/P&gt;&lt;P&gt;3) Hive Metastore , check what type of database uisng like Mysql, take full export of Mysql database as below.&lt;/P&gt;&lt;P&gt;For full dump, you can use "root" user &lt;/P&gt;&lt;P&gt;mysqldump -u [username]-p [password][dbname]&amp;gt; filename.sql&lt;/P&gt;&lt;P&gt;And if you wish to zip it at the sametime:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;mysqldump -u [username]-p [password][db]| gzip &amp;gt; filename.sql.gz&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;You can then move this file between servers with:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;scp user@xxx.xxx.xxx.xxx:/path_to_your_dump/filename.sql.gz your_detination_path/&lt;/CODE&gt;&lt;/PRE&gt;&lt;PRE&gt;&lt;CODE&gt;Once copied import the all objects to my sql database and start he hive server &lt;/CODE&gt;&lt;/PRE&gt;</description>
      <pubDate>Tue, 23 Aug 2016 09:26:47 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Data-transfer-between-two-clusters/m-p/115278#M78072</guid>
      <dc:creator>shivkumar82015</dc:creator>
      <dc:date>2016-08-23T09:26:47Z</dc:date>
    </item>
    <item>
      <title>Re: Data transfer between two clusters</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Data-transfer-between-two-clusters/m-p/115279#M78073</link>
      <description>&lt;P&gt;&lt;A rel="user" href="https://community.cloudera.com/users/9842/mpandit.html" nodeid="9842"&gt;@milind pandit&lt;/A&gt;&lt;/P&gt;&lt;P&gt;Apache Falcon is specifically designed to replicate data between clusters.  It is another tool in your toolbox in addition to the suggestions provided by &lt;A rel="user" href="https://community.cloudera.com/users/11907/shivkumar82015.html" nodeid="11907"&gt;@zkfs&lt;/A&gt;.&lt;/P&gt;&lt;P&gt;&lt;A target="_blank" href="http://hortonworks.com/apache/falcon/"&gt;http://hortonworks.com/apache/falcon/&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Tue, 23 Aug 2016 21:08:29 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Data-transfer-between-two-clusters/m-p/115279#M78073</guid>
      <dc:creator>myoung</dc:creator>
      <dc:date>2016-08-23T21:08:29Z</dc:date>
    </item>
    <item>
      <title>Re: Data transfer between two clusters</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Data-transfer-between-two-clusters/m-p/115280#M78074</link>
      <description>&lt;P&gt;@&lt;A href="https://community.hortonworks.com/users/9842/mpandit.html"&gt;milind pandit&lt;/A&gt;&lt;/P&gt;&lt;P&gt;I will not repeat the content of the responses from @&lt;A href="https://community.hortonworks.com/users/11907/shivkumar82015.html"&gt;zkfs&lt;/A&gt; and &lt;A rel="user" href="https://community.cloudera.com/users/2695/myoung.html" nodeid="2695"&gt;@Michael Young&lt;/A&gt;. The above responses are great, but they are not exclusive, just complementary, my 2c. Falcon will help with HDFS, but it won't help with HBase. I would use Falcon for active-active clusters or disaster recovery. Your question implies that data is migrated from an old cluster to a new cluster. As such you could go with options from @&lt;A href="https://community.hortonworks.com/users/11907/shivkumar82015.html"&gt;zkfs&lt;/A&gt;, however, Falcon is also an option for the HDFS part, but as I said, the effort to set it up and administrate it is worth it for something that is a continuous replication, not one time deal. For that case too, HBase replication should be also considered. It was not mentioned in the above responses.&lt;/P&gt;</description>
      <pubDate>Wed, 24 Aug 2016 11:06:46 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Data-transfer-between-two-clusters/m-p/115280#M78074</guid>
      <dc:creator>cstanca</dc:creator>
      <dc:date>2016-08-24T11:06:46Z</dc:date>
    </item>
    <item>
      <title>Re: Data transfer between two clusters</title>
      <link>https://community.cloudera.com/t5/Support-Questions/Data-transfer-between-two-clusters/m-p/115281#M78075</link>
      <description>&lt;P&gt;We're going through this process now, migrating a non-trivial amount 
of data from an older cluster onto a new cluster and environment.  We 
have a couple of requirements and constraints that limited some of the 
options:&lt;/P&gt;&lt;OL&gt;
&lt;LI&gt;The datanodes on the 2 clusters don't have 
network connectivity.  Each cluster resides in it's own private 
firewalled network.  (As an added complication, we also use the same 
hostnames in each of the two private environments.)  distcp scales 
requires the datanodes in the 2 clusters to be able communicate 
directly.  &lt;/LI&gt;&lt;LI&gt;We have different security models in the two 
models.  The old cluster uses simple authentication.  The new cluster 
uses kerberos for authentication.  I've found that getting some of the 
tools to work with 2 different authentication models can be difficult.&lt;/LI&gt;&lt;LI&gt;I
 want to preserve the file metadata from the old cluster on the new 
cluster - e.g. file create time, ownership, file system permissions.  
Some of the options can move the data from the source cluster, but they 
write 'new' files on the target cluster.  The old cluster has been 
running running for around 2 years so there's alot of useful information
 in those file timestamps.&lt;/LI&gt;&lt;LI&gt;I need to perform a near-live 
migration.  I have the keep the old cluster running in parallel while 
migrating data and users to the new cluster.  Can't just cut access to 
the old cluster &lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;After trying a number of tools and combinations, inculding WebHDFS and Knox combinations. we've settled on the following:&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;Export the 
old cluster via NFS gateways.  We lock the NFS access controls to only 
allow the edge servers on the new cluster to mount the HDFS NFS volume. 
 The edge servers in our target cluster are airflow workers running as a
 grid.  We've created a source NFS gateway for each target edge server 
airflow worker enabling a degree of scale-out.  Not as good as distcp 
scale-out but better than a single point pipe.&lt;/LI&gt;&lt;/UL&gt;&lt;UL&gt;
&lt;LI&gt;run good
 old fashioned hdfs dfs -copyFromLocal -p &amp;lt;old_cluster_nfs_dir&amp;gt; 
&amp;lt;new_cluster_hdfs_dir&amp;gt;.  This enables us to preserve the file 
timestamps as well as ownerships.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;As part of managing 
the migration process, we're also making use of HDFS snapshots on both 
source and target to enable consistency management.  Our migration jobs 
take snapshots at the beginning and end of each migration job and issue 
delta or difference reports to identify if data was modified and 
possibly missed during the migration process.  I'm expecting that some 
of our larger data sets will take hours to complete, for the largest 
few, possible &amp;gt; 24hrs.  In order to perform the snapshot management 
we also added some additional wrapper code.  WebHDFS can be used to 
create and list snapshots, but it doesn't yet have an operation for 
returning a snapshot difference report.&lt;/P&gt;&lt;P&gt;For the hive metadata, 
the majority of our hive DDL exists in git/source code control.  We're 
actually using this migration as an opportunity to enforce this for our 
production objects.  For end user objects, e.g. analysts data labs, 
we're exporting the DDL on the old cluster and re-playing DDL on the new
 cluster - with tweeks for any reserved words collisions.&lt;/P&gt;&lt;P&gt;We don't have HBase operating on our old cluster so I didn't have to come up with a solution for that problem.&lt;/P&gt;</description>
      <pubDate>Fri, 02 Sep 2016 05:13:34 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Support-Questions/Data-transfer-between-two-clusters/m-p/115281#M78075</guid>
      <dc:creator>chris_ottinger</dc:creator>
      <dc:date>2016-09-02T05:13:34Z</dc:date>
    </item>
  </channel>
</rss>

