Support Questions

bhoomireddy_vij · ‎07-28-2016

I am still getting familiar with security aspects in Hadoop and hence need some guidance.

I am trying to setup a distcp job between two secure clusters. Lets say the clusters are called primary_cluster and dr_cluster. Both the clusters are connected to a single active directory instance and share the same kerberos realm AD.HADOOP.COM.

On the primary_cluster, assume there are two directories that need to be replicated to the dr_cluster . Assume the directories are /abc owned by abc-usr and /xyz owned by xyz-usr on both the clusters. Also, /abc and /xyz are designated as encryption zones and hence are encrypted using KMS keys. My customer doesn’t want superuser like hdfs user to be running the distcp job and prefer to execute it by the owner of the HDFS directory i.e. abc-usr or xyz-usr in this case. So I’m thinking to have keytab files for both abc-usr and xyz-usr be made available on the node (lets call it distcp-node) from which the distcp job will be triggered (planning to trigger it on the dr_cluster as dr_cluster’s yarn capacity is used very little). Below is the sequence of steps I am having in mind to perform.

#Running from distcp-node part of dr_cluster

su abc-usr
kinit -k -t /keytab-location abc-usr (At this instance, abc-usr will get a TGT from AD’s KDC)
hadoop distcp hdfs://primary_cluster_nn:50070/abc hdfs://dr_cluster_nn:8020/abc (Using the above acquired TGT, kerberos service tickets need to be acquired)
su xyz-usr
kinit -k -t /keytab-location xyz-usr (At this instance, xyz-usr will get a TGT from AD’s KDC)
hadoop distcp hdfs://primary_cluster_nn:50070/xyz hdfs://dr_cluster_nn:8020/xyz (Using the above acquired TGT, kerberos service tickets need to be acquired)

My Queries:

During steps 3 and 6, will it get service tickets for name nodes of both primary_cluster and dr_cluster or will it get only for dr_cluster’s name node as the command is being run on a host part of dr_cluster?
Are the key tab files required anywhere else across the clusters apart from distcp-node?
Is there a need to configure hadoop auth_to_local settings? If so, what rules are required? I presume these are required. As a bare minimum, auth_to_local rules for abc-usr and xyz-user are needed on both the clusters to translate kerberos principals to user short names.
Is there any need to configure proxy_user rules in Hadoop?
As both /abc and /xyz are encryption zones, how to ensure the data is transferred properly? I presume as and when the data is read by distcp on primary_cluster, it is transparently decrypted by primary_cluster’s KMS and sent over the wire and re-encrypted on DR side using dr_cluster’s KMS
If the above statement is incorrect, should I run the distcp command on /abc/.reserved/raw and /xyz/.reserved/raw directories and securely transfer the appropriate KMS keys? What would be the impact in this case, if I intend to run distcp by using HDFS snapshots?

PS: The purpose of using ditstcp based replication instead of Falcon, is to make use of HDFS snapshots.Falcon that is part of HDP2.4.2 doesn't yet support HDFS Snapshot based replication.

mqureshi · ‎07-28-2016

Before I answer this, you need to understand that everything I say here will not just work perfect;y in your environment. You will find some issues in your testing and then you can share with us and we'll help you along. That being said, theoretically my answers should help you implement this with minimal issues:

During steps 3 and 6, will it get service tickets for name nodes of both primary_cluster and dr_cluster or will it get only for dr_cluster’s name node as the command is being run on a host part of dr_cluster?

-> Since the same Kerberos from your AD is used for both clusters, it should get service tickets for both names nodes. You should not run into any issue here. That being said, if you an error during your testing, please share here and we'll help fix that.

2. Are the key tab files required anywhere else across the clusters apart from distcp-node?

-> No. Only on your destination cluster where you are running your distcp from.

3. Is there a need to configure hadoop auth_to_local settings? If so, what rules are required? I presume these are required. As a bare minimum, auth_to_local rules for abc-usr and xyz-user are needed on both the clusters to translate kerberos principals to user short names.

-> yes. You will need auth_to_local settings. Rule depends on your principal name. Usually you need to strip the kerberos realm. Please see this link on what rule you should setup.

4. Is there any need to configure proxy_user rules in Hadoop?

-> Yes, you need this because your files are owned by different users but your distcp will run with the special user that is only doing distcp. So while you run the job using this user, you should let this user impersonate whoever owns the data.

5. As both /abc and /xyz are encryption zones, how to ensure the data is transferred properly? I presume as and when the data is read by distcp on primary_cluster, it is transparently decrypted by primary_cluster’s KMS and sent over the wire and re-encrypted on DR side using dr_cluster’s KMS

-> You are talking about encryption at rest. When you read the data, it should be decrypted automatically using whatever mechanism is being used when you access your data otherwise (for example, how is data decrypted when you run hive query? Same mechanism should automatically kick in unless there are authorization issues which of course you'll have tot ake care of regardless).

6. If the above statement is incorrect, should I run the distcp command on /abc/.reserved/raw and /xyz/.reserved/raw directories and securely transfer the appropriate KMS keys? What would be the impact in this case, if I intend to run distcp by using HDFS snapshots?

-> Number 5 should work. honestly I don't understand question 6 but 5 should work.

View solution in original post

mqureshi · ‎07-28-2016

Before I answer this, you need to understand that everything I say here will not just work perfect;y in your environment. You will find some issues in your testing and then you can share with us and we'll help you along. That being said, theoretically my answers should help you implement this with minimal issues:

During steps 3 and 6, will it get service tickets for name nodes of both primary_cluster and dr_cluster or will it get only for dr_cluster’s name node as the command is being run on a host part of dr_cluster?