- Subscribe to RSS Feed
- Mark Question as New
- Mark Question as Read
- Float this Question for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
Distcp between secured clusters
- Labels:
-
Apache Hadoop
Created ‎07-26-2016 05:24 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
We have two secured clusters with namenode HA setup. Let's name them as PRIMARY and DR. We are now implementing a DR solution between the clusters using HDFS snapshots and distcp (We are on HDP2.4.2 and Falcon doesn't support HDFS snapshots till HDP2.5. So had to use HDFS snapshots with distcp) to replicate the data from PRIMARY to DR cluster. All the Hadoop daemon accounts on the clusters are appended with the cluster name. For example, PRIMARY-hdfs, DR-yarn etc. I have few questions in this regard:
- Q: On which node should the
distcp job be running?
- My Understanding: For DR purposes, distcp job should ideally be run on one of the machines on the DR cluster as it has unused YARN capacity. The requirement for the node is to have hadoop client libraries available for it to run distcp. For example, assume the node as dr-host1@HADOOP.COM
- Which user should the distcp job be running as? Is it someone with hdfs privileges (For example, DR-hdfs@HADOOP.COM) or any other user for example, a new user created for this purpose -replication-user (replication-user@HADOOP.COM)
- If its hdfs user (DR-hdfs@HADOOP.COM), how to ensure the user is
allowed access on the PRIMARY cluster? (probably through auth_to_local
settings like below?)
- RULE: [1:$1@$0] (.*-hdfs@HADOOP.COM) s/.*/PRIMARY-hdfs/
- If it’s a non-standard user like replication-user, what are the considerations to be taken? Is it required / recommended to have the same user replication-user on both the clusters and have auth_to_local setting similar to above?
- As the clusters are secured by Kerberos and the principals are going to be different on the clusters, how to make this work? The replication-user's keytab file is going to be different on PRIMARY and DR cluster. What is the best approach to handle this?
- What's the impact on the solution if the both the clusters are part of separate Kerberos realms like PRIMARY.HADOOP.COM and DR.HADOOP.COM?
Apologies if some of these are trivial. Hadoop security is still a grey-area for me and hence majority of these surround security.
Thanks
Vijay
Created ‎07-26-2016 09:08 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please see my answers inline below:
Q: On which node should the distcp job be running?
- My Understanding: For DR purposes, distcp job should ideally be run on one of the machines on the DR cluster as it has unused YARN capacity. The requirement for the node is to have hadoop client libraries available for it to run distcp. For example, assume the node as dr-host1@HADOOP.COM
-> Running the job on destination is fine. Just remember that distcp builds a "copylist" for files to copy. For large cluster with thousands of directories and subdirectories this can be an expensive operation specially when run from remote cluster. It's totally okay. you just need to be aware of it.
- Which user should the distcp job be running as? Is it someone with hdfs privileges (For example, DR-hdfs@HADOOP.COM) or any other user for example, a new user created for this purpose -replication-user (replication-user@HADOOP.COM)
First, don't use hdfs. Now, the Kerberos principal you want to use must need to have read permissions on the files you will copy. If that's everything, then give appropriate permissions. If you are going to use two different principals then you need to provide the destination principal to be a proxy user aka impersonation on your source cluster. Add the following to your source cluster core-site.xml and restart source cluster. Use the new core-site.xml to connect to source cluster.
property> <name>hadoop.proxyuser.hdfsdestuser.hosts</name> <value><destination host or wherever this user is connecting from></value> </property> <property> <name>hadoop.proxyuser.hdfsdestuser.groups</name> <value><all the groups which users belong to. this user can impersonate></value> <!--might want to start with * and then restrict> </property>
This should enable your destination cluster to read source data. Also remember that if these users are in different kerberos realm then you need to setup cross realm trust. Check this link.
- If its hdfs user (DR-hdfs@HADOOP.COM), how to ensure the user is allowed access on the PRIMARY cluster? (probably through auth_to_local settings like below?)
- RULE: [1:$1@$0] (.*-hdfs@HADOOP.COM) s/.*/PRIMARY-hdfs/
-> Check previous answer. don't use hdfs user. auth to local may or may not be required. Depends on what access you give the destination user.
- If it’s a non-standard user like replication-user, what are the considerations to be taken? Is it required / recommended to have the same user replication-user on both the clusters and have auth_to_local setting similar to above
-> Check above again. If it's same user, then it will make things easy. for users that are different, changing core-site.xml to add proxy user isn't very complicated either.
- As the clusters are secured by Kerberos and the principals are going to be different on the clusters, how to make this work? The replication-user's keytab file is going to be different on PRIMARY and DR cluster. What is the best approach to handle this?
-> Check my answer to your question number 2.
- What's the impact on the solution if the both the clusters are part of separate Kerberos realms like PRIMARY.HADOOP.COM andDR.HADOOP.COM?
-> Check this link. (Already referred earlier)
Created ‎07-26-2016 09:08 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Please see my answers inline below:
Q: On which node should the distcp job be running?
- My Understanding: For DR purposes, distcp job should ideally be run on one of the machines on the DR cluster as it has unused YARN capacity. The requirement for the node is to have hadoop client libraries available for it to run distcp. For example, assume the node as dr-host1@HADOOP.COM
-> Running the job on destination is fine. Just remember that distcp builds a "copylist" for files to copy. For large cluster with thousands of directories and subdirectories this can be an expensive operation specially when run from remote cluster. It's totally okay. you just need to be aware of it.
- Which user should the distcp job be running as? Is it someone with hdfs privileges (For example, DR-hdfs@HADOOP.COM) or any other user for example, a new user created for this purpose -replication-user (replication-user@HADOOP.COM)
First, don't use hdfs. Now, the Kerberos principal you want to use must need to have read permissions on the files you will copy. If that's everything, then give appropriate permissions. If you are going to use two different principals then you need to provide the destination principal to be a proxy user aka impersonation on your source cluster. Add the following to your source cluster core-site.xml and restart source cluster. Use the new core-site.xml to connect to source cluster.
property> <name>hadoop.proxyuser.hdfsdestuser.hosts</name> <value><destination host or wherever this user is connecting from></value> </property> <property> <name>hadoop.proxyuser.hdfsdestuser.groups</name> <value><all the groups which users belong to. this user can impersonate></value> <!--might want to start with * and then restrict> </property>
This should enable your destination cluster to read source data. Also remember that if these users are in different kerberos realm then you need to setup cross realm trust. Check this link.
- If its hdfs user (DR-hdfs@HADOOP.COM), how to ensure the user is allowed access on the PRIMARY cluster? (probably through auth_to_local settings like below?)
- RULE: [1:$1@$0] (.*-hdfs@HADOOP.COM) s/.*/PRIMARY-hdfs/
-> Check previous answer. don't use hdfs user. auth to local may or may not be required. Depends on what access you give the destination user.
- If it’s a non-standard user like replication-user, what are the considerations to be taken? Is it required / recommended to have the same user replication-user on both the clusters and have auth_to_local setting similar to above
-> Check above again. If it's same user, then it will make things easy. for users that are different, changing core-site.xml to add proxy user isn't very complicated either.
- As the clusters are secured by Kerberos and the principals are going to be different on the clusters, how to make this work? The replication-user's keytab file is going to be different on PRIMARY and DR cluster. What is the best approach to handle this?
-> Check my answer to your question number 2.
- What's the impact on the solution if the both the clusters are part of separate Kerberos realms like PRIMARY.HADOOP.COM andDR.HADOOP.COM?
-> Check this link. (Already referred earlier)
Created ‎07-28-2016 01:36 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
