<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>question Distcp between secured clusters in Archives of Support Questions (Read Only)</title>
    <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Distcp-between-secured-clusters/m-p/149414#M35993</link>
    <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;We
have two secured clusters with namenode HA setup. Let's name them as PRIMARY and DR. We are now
implementing a DR solution between the clusters using HDFS snapshots and distcp
(We are on HDP2.4.2 and Falcon doesn't support HDFS snapshots till HDP2.5. So
had to use HDFS snapshots with distcp) to replicate the data from PRIMARY to DR
cluster. All the Hadoop daemon accounts on the clusters are appended with the
cluster name. For example, PRIMARY-hdfs, DR-yarn etc.
I have few questions in this regard:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Q: On which node should the
     distcp job be running?&lt;UL&gt;&lt;LI&gt;My Understanding: For DR
      purposes, distcp job should ideally be run on one of the machines on the
      DR cluster as it has unused YARN capacity. The requirement for the node
      is to have hadoop client libraries available for it to run distcp. For
      example, assume the node as dr-host1@HADOOP.COM&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;Which user should the distcp
     job be running as? Is it someone with hdfs privileges (For example, DR-hdfs@HADOOP.COM) or any other user for
     example, a new user created for this purpose -replication-user (replication-user@HADOOP.COM)&lt;/LI&gt;&lt;LI&gt;If its hdfs user (DR-hdfs@HADOOP.COM), how to ensure the user is
     allowed access on the PRIMARY cluster? (probably through auth_to_local
     settings like below?)&lt;UL&gt;&lt;LI&gt;RULE:
      [1:$1@$0] (.*-hdfs@HADOOP.COM) s/.*/PRIMARY-hdfs/&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;If it’s a non-standard user
     like replication-user, what are the considerations to be taken? Is it
     required / recommended to have the same user replication-user on both the clusters and
     have auth_to_local setting similar to above?&lt;/LI&gt;&lt;LI&gt;As the clusters are secured
     by Kerberos and the principals are going to be different on the clusters,
     how to make this work? The replication-user's keytab file is going to be
     different on PRIMARY and DR cluster. What is the best approach to handle
     this?&lt;/LI&gt;&lt;LI&gt;What's the impact on the
     solution if the both the clusters are part of separate Kerberos realms
     like PRIMARY.HADOOP.COM and DR.HADOOP.COM?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Apologies if some of
these are trivial. Hadoop security is still a grey-area for me and hence
majority of these surround security.&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;Vijay&lt;/P&gt;</description>
    <pubDate>Wed, 27 Jul 2016 00:24:55 GMT</pubDate>
    <dc:creator>bhoomireddy_vij</dc:creator>
    <dc:date>2016-07-27T00:24:55Z</dc:date>
    <item>
      <title>Distcp between secured clusters</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Distcp-between-secured-clusters/m-p/149414#M35993</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;We
have two secured clusters with namenode HA setup. Let's name them as PRIMARY and DR. We are now
implementing a DR solution between the clusters using HDFS snapshots and distcp
(We are on HDP2.4.2 and Falcon doesn't support HDFS snapshots till HDP2.5. So
had to use HDFS snapshots with distcp) to replicate the data from PRIMARY to DR
cluster. All the Hadoop daemon accounts on the clusters are appended with the
cluster name. For example, PRIMARY-hdfs, DR-yarn etc.
I have few questions in this regard:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Q: On which node should the
     distcp job be running?&lt;UL&gt;&lt;LI&gt;My Understanding: For DR
      purposes, distcp job should ideally be run on one of the machines on the
      DR cluster as it has unused YARN capacity. The requirement for the node
      is to have hadoop client libraries available for it to run distcp. For
      example, assume the node as dr-host1@HADOOP.COM&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;Which user should the distcp
     job be running as? Is it someone with hdfs privileges (For example, DR-hdfs@HADOOP.COM) or any other user for
     example, a new user created for this purpose -replication-user (replication-user@HADOOP.COM)&lt;/LI&gt;&lt;LI&gt;If its hdfs user (DR-hdfs@HADOOP.COM), how to ensure the user is
     allowed access on the PRIMARY cluster? (probably through auth_to_local
     settings like below?)&lt;UL&gt;&lt;LI&gt;RULE:
      [1:$1@$0] (.*-hdfs@HADOOP.COM) s/.*/PRIMARY-hdfs/&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;If it’s a non-standard user
     like replication-user, what are the considerations to be taken? Is it
     required / recommended to have the same user replication-user on both the clusters and
     have auth_to_local setting similar to above?&lt;/LI&gt;&lt;LI&gt;As the clusters are secured
     by Kerberos and the principals are going to be different on the clusters,
     how to make this work? The replication-user's keytab file is going to be
     different on PRIMARY and DR cluster. What is the best approach to handle
     this?&lt;/LI&gt;&lt;LI&gt;What's the impact on the
     solution if the both the clusters are part of separate Kerberos realms
     like PRIMARY.HADOOP.COM and DR.HADOOP.COM?&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Apologies if some of
these are trivial. Hadoop security is still a grey-area for me and hence
majority of these surround security.&lt;/P&gt;&lt;P&gt;Thanks&lt;/P&gt;&lt;P&gt;Vijay&lt;/P&gt;</description>
      <pubDate>Wed, 27 Jul 2016 00:24:55 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Distcp-between-secured-clusters/m-p/149414#M35993</guid>
      <dc:creator>bhoomireddy_vij</dc:creator>
      <dc:date>2016-07-27T00:24:55Z</dc:date>
    </item>
    <item>
      <title>Re: Distcp between secured clusters</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Distcp-between-secured-clusters/m-p/149415#M35994</link>
      <description>&lt;A rel="user" href="https://community.cloudera.com/users/2733/bhoomireddyvijay.html" nodeid="2733"&gt;@Vijaya Narayana Reddy Bhoomi Reddy&lt;/A&gt;&lt;P&gt;Please see my answers inline below:&lt;/P&gt;&lt;P&gt;Q: On which node should the distcp job be running?
&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;My Understanding: For DR purposes, distcp job should ideally be run on one of the machines on the DR cluster as it has unused YARN capacity. The requirement for the node is to have hadoop client libraries available for it to run distcp. For example, assume the node as &lt;A href="mailto:dr-host1@HADOOP.COM"&gt;dr-host1@HADOOP.COM&lt;/A&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P style="margin-left: 20px;"&gt;-&amp;gt; Running the job on destination is fine. Just remember that distcp builds a "copylist" for files to copy. For large cluster with thousands of directories and subdirectories this can be an expensive operation specially when run from remote cluster. It's totally okay. you just need to be aware of it.&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;Which user should the distcp job be running as? Is it someone with hdfs privileges (For example, &lt;A href="mailto:DR-hdfs@HADOOP.COM"&gt;DR-hdfs@HADOOP.COM&lt;/A&gt;) or any other user for example, a new user created for this purpose -replication-user (&lt;A href="mailto:replication-user@HADOOP.COM"&gt;replication-user@HADOOP.COM&lt;/A&gt;)&lt;/LI&gt;&lt;/UL&gt;&lt;P style="margin-left: 20px;"&gt; First, don't use hdfs. Now, the Kerberos principal you want to use must need to have read permissions on the files you will copy. If that's everything, then give appropriate permissions. If you are going to use two different principals then you need to provide the destination principal to be a proxy user aka impersonation on your source cluster. Add the following to your source cluster core-site.xml and restart source cluster. Use the new core-site.xml to connect to source cluster.&lt;/P&gt;&lt;PRE&gt;property&amp;gt;
     &amp;lt;name&amp;gt;hadoop.proxyuser.hdfsdestuser.hosts&amp;lt;/name&amp;gt;
     &amp;lt;value&amp;gt;&amp;lt;destination host or wherever this user is connecting from&amp;gt;&amp;lt;/value&amp;gt;
   &amp;lt;/property&amp;gt;
   &amp;lt;property&amp;gt;
     &amp;lt;name&amp;gt;hadoop.proxyuser.hdfsdestuser.groups&amp;lt;/name&amp;gt;
     &amp;lt;value&amp;gt;&amp;lt;all the groups which users belong to. this user can impersonate&amp;gt;&amp;lt;/value&amp;gt; &amp;lt;!--might want to start with * and then restrict&amp;gt; 
&amp;lt;/property&amp;gt;&lt;/PRE&gt;&lt;P style="margin-left: 20px;"&gt;
&lt;/P&gt;&lt;P style="margin-left: 20px;"&gt;This should enable your destination cluster to read source data. Also remember that if these users are in different kerberos realm then you need to setup cross realm trust. Check this &lt;A href="https://community.hortonworks.com/articles/18686/kerberos-cross-realm-trust-for-distcp.html"&gt;link&lt;/A&gt;.&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;If its hdfs user (&lt;A href="mailto:DR-hdfs@HADOOP.COM"&gt;DR-hdfs@HADOOP.COM&lt;/A&gt;), how to ensure the user is allowed access on the PRIMARY cluster? (probably through auth_to_local settings like below?)&lt;UL&gt;&lt;LI&gt;RULE: [1:$1@$0] (.&lt;A href="mailto:*-hdfs@HADOOP.COM"&gt;*-hdfs@HADOOP.COM&lt;/A&gt;) s/.*/PRIMARY-hdfs/&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;
&lt;/UL&gt;&lt;P&gt; -&amp;gt; Check previous answer. don't use hdfs user. auth to local may or may not be required. Depends on what access you give the destination user.&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;If it’s a non-standard user like replication-user, what are the considerations to be taken? Is it required / recommended to have the same user replication-user on both the clusters and have auth_to_local setting similar to above&lt;/LI&gt;&lt;/UL&gt;&lt;P style="margin-left: 40px;"&gt;-&amp;gt; Check above again. If it's same user, then it will make things easy. for users that are different, changing core-site.xml to add proxy user isn't very complicated either.&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;As the clusters are secured by Kerberos and the principals are going to be different on the clusters, how to make this work? The replication-user's keytab file is going to be different on PRIMARY and DR cluster. What is the best approach to handle this?&lt;/LI&gt;&lt;/UL&gt;&lt;P style="margin-left: 20px;"&gt;-&amp;gt; Check my answer to your question number 2.&lt;/P&gt;&lt;UL&gt;
&lt;LI&gt;What's the impact on the solution if the both the clusters are part of separate Kerberos realms like &lt;A href="http://primary.hadoop.com/"&gt;PRIMARY.HADOOP.COM&lt;/A&gt; and&lt;A href="http://dr.hadoop.com/"&gt;DR.HADOOP.COM&lt;/A&gt;?&lt;/LI&gt;&lt;/UL&gt;&lt;P style="margin-left: 20px;"&gt;-&amp;gt; Check this &lt;A href="https://community.hortonworks.com/articles/18686/kerberos-cross-realm-trust-for-distcp.html"&gt;link&lt;/A&gt;. (Already referred earlier)&lt;/P&gt;&lt;UL&gt;&lt;/UL&gt;</description>
      <pubDate>Wed, 27 Jul 2016 04:08:09 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Distcp-between-secured-clusters/m-p/149415#M35994</guid>
      <dc:creator>mqureshi</dc:creator>
      <dc:date>2016-07-27T04:08:09Z</dc:date>
    </item>
    <item>
      <title>Re: Distcp between secured clusters</title>
      <link>https://community.cloudera.com/t5/Archives-of-Support-Questions/Distcp-between-secured-clusters/m-p/149416#M35995</link>
      <description>&lt;P&gt;Thanks &lt;A href="#"&gt;@mqureshi&lt;/A&gt; for your response. In order to explain my case better, I have created another &lt;A href="https://community.hortonworks.com/questions/47981/distcp-between-secured-clusters-1.html"&gt;question&lt;/A&gt; with more detail.  Request you to please have a look at it.&lt;/P&gt;</description>
      <pubDate>Thu, 28 Jul 2016 20:36:33 GMT</pubDate>
      <guid>https://community.cloudera.com/t5/Archives-of-Support-Questions/Distcp-between-secured-clusters/m-p/149416#M35995</guid>
      <dc:creator>bhoomireddy_vij</dc:creator>
      <dc:date>2016-07-28T20:36:33Z</dc:date>
    </item>
  </channel>
</rss>

