Support Questions

bhoomireddy_vij · ‎07-26-2016

Hi,

We have two secured clusters with namenode HA setup. Let's name them as PRIMARY and DR. We are now implementing a DR solution between the clusters using HDFS snapshots and distcp (We are on HDP2.4.2 and Falcon doesn't support HDFS snapshots till HDP2.5. So had to use HDFS snapshots with distcp) to replicate the data from PRIMARY to DR cluster. All the Hadoop daemon accounts on the clusters are appended with the cluster name. For example, PRIMARY-hdfs, DR-yarn etc. I have few questions in this regard:

Q: On which node should the distcp job be running?
- My Understanding: For DR purposes, distcp job should ideally be run on one of the machines on the DR cluster as it has unused YARN capacity. The requirement for the node is to have hadoop client libraries available for it to run distcp. For example, assume the node as dr-host1@HADOOP.COM
Which user should the distcp job be running as? Is it someone with hdfs privileges (For example, DR-hdfs@HADOOP.COM) or any other user for example, a new user created for this purpose -replication-user (replication-user@HADOOP.COM)
If its hdfs user (DR-hdfs@HADOOP.COM), how to ensure the user is allowed access on the PRIMARY cluster? (probably through auth_to_local settings like below?)
- RULE: [1:$1@$0] (.*-hdfs@HADOOP.COM) s/.*/PRIMARY-hdfs/
If it’s a non-standard user like replication-user, what are the considerations to be taken? Is it required / recommended to have the same user replication-user on both the clusters and have auth_to_local setting similar to above?
As the clusters are secured by Kerberos and the principals are going to be different on the clusters, how to make this work? The replication-user's keytab file is going to be different on PRIMARY and DR cluster. What is the best approach to handle this?
What's the impact on the solution if the both the clusters are part of separate Kerberos realms like PRIMARY.HADOOP.COM and DR.HADOOP.COM?

Apologies if some of these are trivial. Hadoop security is still a grey-area for me and hence majority of these surround security.

Thanks

Vijay

mqureshi · ‎07-26-2016

@Vijaya Narayana Reddy Bhoomi Reddy

Please see my answers inline below:

Q: On which node should the distcp job be running?

My Understanding: For DR purposes, distcp job should ideally be run on one of the machines on the DR cluster as it has unused YARN capacity. The requirement for the node is to have hadoop client libraries available for it to run distcp. For example, assume the node as dr-host1@HADOOP.COM

-> Running the job on destination is fine. Just remember that distcp builds a "copylist" for files to copy. For large cluster with thousands of directories and subdirectories this can be an expensive operation specially when run from remote cluster. It's totally okay. you just need to be aware of it.

Which user should the distcp job be running as? Is it someone with hdfs privileges (For example, DR-hdfs@HADOOP.COM) or any other user for example, a new user created for this purpose -replication-user (replication-user@HADOOP.COM)

First, don't use hdfs. Now, the Kerberos principal you want to use must need to have read permissions on the files you will copy. If that's everything, then give appropriate permissions. If you are going to use two different principals then you need to provide the destination principal to be a proxy user aka impersonation on your source cluster. Add the following to your source cluster core-site.xml and restart source cluster. Use the new core-site.xml to connect to source cluster.

property>
     <name>hadoop.proxyuser.hdfsdestuser.hosts</name>
     <value><destination host or wherever this user is connecting from></value>
   </property>
   <property>
     <name>hadoop.proxyuser.hdfsdestuser.groups</name>
     <value><all the groups which users belong to. this user can impersonate></value> <!--might want to start with * and then restrict> 
</property>

This should enable your destination cluster to read source data. Also remember that if these users are in different kerberos realm then you need to setup cross realm trust. Check this link.

If its hdfs user (DR-hdfs@HADOOP.COM), how to ensure the user is allowed access on the PRIMARY cluster? (probably through auth_to_local settings like below?)
- RULE: [1:$1@$0] (.*-hdfs@HADOOP.COM) s/.*/PRIMARY-hdfs/

-> Check previous answer. don't use hdfs user. auth to local may or may not be required. Depends on what access you give the destination user.

If it’s a non-standard user like replication-user, what are the considerations to be taken? Is it required / recommended to have the same user replication-user on both the clusters and have auth_to_local setting similar to above

-> Check above again. If it's same user, then it will make things easy. for users that are different, changing core-site.xml to add proxy user isn't very complicated either.

As the clusters are secured by Kerberos and the principals are going to be different on the clusters, how to make this work? The replication-user's keytab file is going to be different on PRIMARY and DR cluster. What is the best approach to handle this?

-> Check my answer to your question number 2.

What's the impact on the solution if the both the clusters are part of separate Kerberos realms like PRIMARY.HADOOP.COM andDR.HADOOP.COM?

-> Check this link. (Already referred earlier)

View solution in original post

mqureshi · ‎07-26-2016

@Vijaya Narayana Reddy Bhoomi Reddy

Please see my answers inline below:

Q: On which node should the distcp job be running?

My Understanding: For DR purposes, distcp job should ideally be run on one of the machines on the DR cluster as it has unused YARN capacity. The requirement for the node is to have hadoop client libraries available for it to run distcp. For example, assume the node as dr-host1@HADOOP.COM

-> Running the job on destination is fine. Just remember that distcp builds a "copylist" for files to copy. For large cluster with thousands of directories and subdirectories this can be an expensive operation specially when run from remote cluster. It's totally okay. you just need to be aware of it.

Which user should the distcp job be running as? Is it someone with hdfs privileges (For example, DR-hdfs@HADOOP.COM) or any other user for example, a new user created for this purpose -replication-user (replication-user@HADOOP.COM)

First, don't use hdfs. Now, the Kerberos principal you want to use must need to have read permissions on the files you will copy. If that's everything, then give appropriate permissions. If you are going to use two different principals then you need to provide the destination principal to be a proxy user aka impersonation on your source cluster. Add the following to your source cluster core-site.xml and restart source cluster. Use the new core-site.xml to connect to source cluster.

property>
     <name>hadoop.proxyuser.hdfsdestuser.hosts</name>
     <value><destination host or wherever this user is connecting from></value>
   </property>
   <property>
     <name>hadoop.proxyuser.hdfsdestuser.groups</name>
     <value><all the groups which users belong to. this user can impersonate></value> <!--might want to start with * and then restrict> 
</property>

This should enable your destination cluster to read source data. Also remember that if these users are in different kerberos realm then you need to setup cross realm trust. Check this link.

If its hdfs user (DR-hdfs@HADOOP.COM), how to ensure the user is allowed access on the PRIMARY cluster? (probably through auth_to_local settings like below?)
- RULE: [1:$1@$0] (.*-hdfs@HADOOP.COM) s/.*/PRIMARY-hdfs/

-> Check previous answer. don't use hdfs user. auth to local may or may not be required. Depends on what access you give the destination user.

If it’s a non-standard user like replication-user, what are the considerations to be taken? Is it required / recommended to have the same user replication-user on both the clusters and have auth_to_local setting similar to above

-> Check above again. If it's same user, then it will make things easy. for users that are different, changing core-site.xml to add proxy user isn't very complicated either.

As the clusters are secured by Kerberos and the principals are going to be different on the clusters, how to make this work? The replication-user's keytab file is going to be different on PRIMARY and DR cluster. What is the best approach to handle this?

-> Check my answer to your question number 2.

What's the impact on the solution if the both the clusters are part of separate Kerberos realms like PRIMARY.HADOOP.COM andDR.HADOOP.COM?

-> Check this link. (Already referred earlier)

bhoomireddy_vij · ‎07-28-2016

Thanks @mqureshi for your response. In order to explain my case better, I have created another question with more detail. Request you to please have a look at it.

Cloudera Community

Support Questions

Distcp between secured clusters

distcp between secured clusters

distcp secure to insecure cluster

Configuring Kafka Console Producer/Consumer on a S...

Hardening Apache ZooKeeper Security Part 2: TLS en...

Connecting to Kerberos secured HBase cluster from ...

How to transfer file using secure webhdfs in distc...

2 Doubts - Distcp between secure clusters in diffe...

Kerberos cross realm trust for distcp

Getting error while doing distcp with two secure c...

Hortonworks Secure Cluster with Isilon OneFS