Support Questions

Find answers, ask questions, and share your expertise

Distcp between secured clusters

avatar
Rising Star

Hi,

We have two secured clusters with namenode HA setup. Let's name them as PRIMARY and DR. We are now implementing a DR solution between the clusters using HDFS snapshots and distcp (We are on HDP2.4.2 and Falcon doesn't support HDFS snapshots till HDP2.5. So had to use HDFS snapshots with distcp) to replicate the data from PRIMARY to DR cluster. All the Hadoop daemon accounts on the clusters are appended with the cluster name. For example, PRIMARY-hdfs, DR-yarn etc. I have few questions in this regard:

  • Q: On which node should the distcp job be running?
    • My Understanding: For DR purposes, distcp job should ideally be run on one of the machines on the DR cluster as it has unused YARN capacity. The requirement for the node is to have hadoop client libraries available for it to run distcp. For example, assume the node as dr-host1@HADOOP.COM
  • Which user should the distcp job be running as? Is it someone with hdfs privileges (For example, DR-hdfs@HADOOP.COM) or any other user for example, a new user created for this purpose -replication-user (replication-user@HADOOP.COM)
  • If its hdfs user (DR-hdfs@HADOOP.COM), how to ensure the user is allowed access on the PRIMARY cluster? (probably through auth_to_local settings like below?)
    • RULE: [1:$1@$0] (.*-hdfs@HADOOP.COM) s/.*/PRIMARY-hdfs/
  • If it’s a non-standard user like replication-user, what are the considerations to be taken? Is it required / recommended to have the same user replication-user on both the clusters and have auth_to_local setting similar to above?
  • As the clusters are secured by Kerberos and the principals are going to be different on the clusters, how to make this work? The replication-user's keytab file is going to be different on PRIMARY and DR cluster. What is the best approach to handle this?
  • What's the impact on the solution if the both the clusters are part of separate Kerberos realms like PRIMARY.HADOOP.COM and DR.HADOOP.COM?

Apologies if some of these are trivial. Hadoop security is still a grey-area for me and hence majority of these surround security.

Thanks

Vijay

1 ACCEPTED SOLUTION

avatar
Super Guru
@Vijaya Narayana Reddy Bhoomi Reddy

Please see my answers inline below:

Q: On which node should the distcp job be running?

  • My Understanding: For DR purposes, distcp job should ideally be run on one of the machines on the DR cluster as it has unused YARN capacity. The requirement for the node is to have hadoop client libraries available for it to run distcp. For example, assume the node as dr-host1@HADOOP.COM

-> Running the job on destination is fine. Just remember that distcp builds a "copylist" for files to copy. For large cluster with thousands of directories and subdirectories this can be an expensive operation specially when run from remote cluster. It's totally okay. you just need to be aware of it.

  • Which user should the distcp job be running as? Is it someone with hdfs privileges (For example, DR-hdfs@HADOOP.COM) or any other user for example, a new user created for this purpose -replication-user (replication-user@HADOOP.COM)

First, don't use hdfs. Now, the Kerberos principal you want to use must need to have read permissions on the files you will copy. If that's everything, then give appropriate permissions. If you are going to use two different principals then you need to provide the destination principal to be a proxy user aka impersonation on your source cluster. Add the following to your source cluster core-site.xml and restart source cluster. Use the new core-site.xml to connect to source cluster.

property>
     <name>hadoop.proxyuser.hdfsdestuser.hosts</name>
     <value><destination host or wherever this user is connecting from></value>
   </property>
   <property>
     <name>hadoop.proxyuser.hdfsdestuser.groups</name>
     <value><all the groups which users belong to. this user can impersonate></value> <!--might want to start with * and then restrict> 
</property>

This should enable your destination cluster to read source data. Also remember that if these users are in different kerberos realm then you need to setup cross realm trust. Check this link.

  • If its hdfs user (DR-hdfs@HADOOP.COM), how to ensure the user is allowed access on the PRIMARY cluster? (probably through auth_to_local settings like below?)

-> Check previous answer. don't use hdfs user. auth to local may or may not be required. Depends on what access you give the destination user.

  • If it’s a non-standard user like replication-user, what are the considerations to be taken? Is it required / recommended to have the same user replication-user on both the clusters and have auth_to_local setting similar to above

-> Check above again. If it's same user, then it will make things easy. for users that are different, changing core-site.xml to add proxy user isn't very complicated either.

  • As the clusters are secured by Kerberos and the principals are going to be different on the clusters, how to make this work? The replication-user's keytab file is going to be different on PRIMARY and DR cluster. What is the best approach to handle this?

-> Check my answer to your question number 2.

-> Check this link. (Already referred earlier)

    View solution in original post

    2 REPLIES 2

    avatar
    Super Guru
    @Vijaya Narayana Reddy Bhoomi Reddy

    Please see my answers inline below:

    Q: On which node should the distcp job be running?

    • My Understanding: For DR purposes, distcp job should ideally be run on one of the machines on the DR cluster as it has unused YARN capacity. The requirement for the node is to have hadoop client libraries available for it to run distcp. For example, assume the node as dr-host1@HADOOP.COM

    -> Running the job on destination is fine. Just remember that distcp builds a "copylist" for files to copy. For large cluster with thousands of directories and subdirectories this can be an expensive operation specially when run from remote cluster. It's totally okay. you just need to be aware of it.

    • Which user should the distcp job be running as? Is it someone with hdfs privileges (For example, DR-hdfs@HADOOP.COM) or any other user for example, a new user created for this purpose -replication-user (replication-user@HADOOP.COM)

    First, don't use hdfs. Now, the Kerberos principal you want to use must need to have read permissions on the files you will copy. If that's everything, then give appropriate permissions. If you are going to use two different principals then you need to provide the destination principal to be a proxy user aka impersonation on your source cluster. Add the following to your source cluster core-site.xml and restart source cluster. Use the new core-site.xml to connect to source cluster.

    property>
         <name>hadoop.proxyuser.hdfsdestuser.hosts</name>
         <value><destination host or wherever this user is connecting from></value>
       </property>
       <property>
         <name>hadoop.proxyuser.hdfsdestuser.groups</name>
         <value><all the groups which users belong to. this user can impersonate></value> <!--might want to start with * and then restrict> 
    </property>

    This should enable your destination cluster to read source data. Also remember that if these users are in different kerberos realm then you need to setup cross realm trust. Check this link.

    • If its hdfs user (DR-hdfs@HADOOP.COM), how to ensure the user is allowed access on the PRIMARY cluster? (probably through auth_to_local settings like below?)

    -> Check previous answer. don't use hdfs user. auth to local may or may not be required. Depends on what access you give the destination user.

    • If it’s a non-standard user like replication-user, what are the considerations to be taken? Is it required / recommended to have the same user replication-user on both the clusters and have auth_to_local setting similar to above

    -> Check above again. If it's same user, then it will make things easy. for users that are different, changing core-site.xml to add proxy user isn't very complicated either.

    • As the clusters are secured by Kerberos and the principals are going to be different on the clusters, how to make this work? The replication-user's keytab file is going to be different on PRIMARY and DR cluster. What is the best approach to handle this?

    -> Check my answer to your question number 2.

    -> Check this link. (Already referred earlier)

      avatar
      Rising Star

      Thanks @mqureshi for your response. In order to explain my case better, I have created another question with more detail. Request you to please have a look at it.