About bhoomireddy_vij

bhoomireddy_vij · ‎01-30-2017

Thanks @Sriharsha Chintalapani for bringing out this article. A much needed one with growing importance of Kafka in every data-centric organisation. Covers lot of ground from MirrorMaker perspective. Thanks

bhoomireddy_vij · ‎01-23-2017

Hi, I am trying to distcp data between two encryption zones located on two different clusters. Data has been copied successfully. However, when I read the data on the target cluster, I see some gibberish being printed on the terminal. Encryption Zone on source has been created with key (test-key). As its a DR requirement, I created a key on the target cluster with the same key name i.e. test-key. However, fundamentally they both are completely independent clusters. I presume when DistCp reads the data from the source cluster, it should read and transfer the data transparently using source side key and material and then write to target using the target’s key and material Wondering where this has gone wrong. Any pointers?

bhoomireddy_vij · ‎12-01-2016

Hi, I am trying to understand what are the benefits of using distcp -update vs distcp -update with hdfs snapshot differences? As I understand, update without any snapshot options will only replicate the modified data at source and doesnt touch files that are already existing at the destination. What additional benefits would one realise using HDFS snapshots in distcp based replication? Thanks Vijay

bhoomireddy_vij · ‎11-08-2016

@Mike Garris Hi, LUKS is a disk level encryption and hence is independent of the encryption supported by HDFS. Please see the link below to have an overview of the various levels of encryptions and where TDE sits. https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html Hope that answers your query.

bhoomireddy_vij · ‎09-30-2016

Hi, We are running distcp / falcon based replication between clusters. As depicted in the diagram above, we have edge nodes configured on both the clusters and a dedicated private network link has been established between them. For normal cluster traffic, I presume it uses the normal firewall channeled network. However, my understanding of distcp is that it works as name node to name node communication and hence would probably use the firewall route, but not the private link. Can anyone please guide me on how to make use of the private link so that all the replication traffic (which is expected to be huge and also to adhere to SLAs) would be directed through this. Looking for alternate suggestions and ideas to make this more performant. Thanks Vijay

bhoomireddy_vij · ‎08-03-2016

Thanks @Sunile Manjee This is the approach I think I need to follow. Just was trying to understand if there is any other alternative. To answer your question around Falcon, we are not using because we are on HDP2.4.2 and need to leverage HDFS snapshots. Falcon doesn't yet support snapshots till 2.5. So going with this approach for now.

bhoomireddy_vij · ‎08-03-2016

Hi, I am working on a distcp solution between two clusters. On cluster01 HDFS, there are multiple directories and each is owned by a different application team. The requirement is to distcp these directories onto cluster02 by preserving the access privileges. Both the clusters are secured. I was thinking of having a service user something like "distcp-user" with its own kerberos principal who can manage the distcp process and auditing would be easy as well. Would it be possible for distcp-user to complete the distcp process without having read access on cluster01 and write access on cluster02? Is this something impersonation can help with? For example, if dir1 on cluster01 is owned appuser1 and dir2 owned by appuser2, can distup-user impersonate both appuser1 and appuser2 and perform the distcp jobs on their behalf without sniffing into the actual underlying data? Is it only possible if distcp-user has appropriate read access enabled on the cluster01 and write access on cluster02, something to be managed by Ranger / HDFS ACLs? Thanks Vijay

bhoomireddy_vij · ‎07-28-2016

Thanks @mqureshi for your response. In order to explain my case better, I have created another question with more detail. Request you to please have a look at it.

bhoomireddy_vij · ‎07-28-2016

I am still getting familiar with security aspects in Hadoop and hence need some guidance. I am trying to setup a distcp job between two secure clusters. Lets say the clusters are called primary_cluster and dr_cluster. Both the clusters are connected to a single active directory instance and share the same kerberos realm AD.HADOOP.COM. On the primary_cluster, assume there are two directories that need to be replicated to the dr_cluster . Assume the directories are /abc owned by abc-usr and /xyz owned by xyz-usr on both the clusters. Also, /abc and /xyz are designated as encryption zones and hence are encrypted using KMS keys. My customer doesn’t want superuser like hdfs user to be running the distcp job and prefer to execute it by the owner of the HDFS directory i.e. abc-usr or xyz-usr in this case. So I’m thinking to have keytab files for both abc-usr and xyz-usr be made available on the node (lets call it distcp-node) from which the distcp job will be triggered (planning to trigger it on the dr_cluster as dr_cluster’s yarn capacity is used very little). Below is the sequence of steps I am having in mind to perform. #Running from distcp-node part of dr_cluster su abc-usr kinit -k -t /keytab-location abc-usr (At this instance, abc-usr will get a TGT from AD’s KDC) hadoop distcp hdfs://primary_cluster_nn:50070/abc hdfs://dr_cluster_nn:8020/abc (Using the above acquired TGT, kerberos service tickets need to be acquired) su xyz-usr kinit -k -t /keytab-location xyz-usr (At this instance, xyz-usr will get a TGT from AD’s KDC) hadoop distcp hdfs://primary_cluster_nn:50070/xyz hdfs://dr_cluster_nn:8020/xyz (Using the above acquired TGT, kerberos service tickets need to be acquired) My Queries: During steps 3 and 6, will it get service tickets for name nodes of both primary_cluster and dr_cluster or will it get only for dr_cluster’s name node as the command is being run on a host part of dr_cluster? Are the key tab files required anywhere else across the clusters apart from distcp-node? Is there a need to configure hadoop auth_to_local settings? If so, what rules are required? I presume these are required. As a bare minimum, auth_to_local rules for abc-usr and xyz-user are needed on both the clusters to translate kerberos principals to user short names. Is there any need to configure proxy_user rules in Hadoop? As both /abc and /xyz are encryption zones, how to ensure the data is transferred properly? I presume as and when the data is read by distcp on primary_cluster, it is transparently decrypted by primary_cluster’s KMS and sent over the wire and re-encrypted on DR side using dr_cluster’s KMS If the above statement is incorrect, should I run the distcp command on /abc/.reserved/raw and /xyz/.reserved/raw directories and securely transfer the appropriate KMS keys? What would be the impact in this case, if I intend to run distcp by using HDFS snapshots? PS: The purpose of using ditstcp based replication instead of Falcon, is to make use of HDFS snapshots.Falcon that is part of HDP2.4.2 doesn't yet support HDFS Snapshot based replication.

bhoomireddy_vij · ‎07-26-2016

Hi, We have two secured clusters with namenode HA setup. Let's name them as PRIMARY and DR. We are now implementing a DR solution between the clusters using HDFS snapshots and distcp (We are on HDP2.4.2 and Falcon doesn't support HDFS snapshots till HDP2.5. So had to use HDFS snapshots with distcp) to replicate the data from PRIMARY to DR cluster. All the Hadoop daemon accounts on the clusters are appended with the cluster name. For example, PRIMARY-hdfs, DR-yarn etc. I have few questions in this regard: Q: On which node should the distcp job be running? My Understanding: For DR purposes, distcp job should ideally be run on one of the machines on the DR cluster as it has unused YARN capacity. The requirement for the node is to have hadoop client libraries available for it to run distcp. For example, assume the node as dr-host1@HADOOP.COM Which user should the distcp job be running as? Is it someone with hdfs privileges (For example, DR-hdfs@HADOOP.COM) or any other user for example, a new user created for this purpose -replication-user (replication-user@HADOOP.COM) If its hdfs user (DR-hdfs@HADOOP.COM), how to ensure the user is allowed access on the PRIMARY cluster? (probably through auth_to_local settings like below?) RULE: [1:$1@$0] (.*-hdfs@HADOOP.COM) s/.*/PRIMARY-hdfs/ If it’s a non-standard user like replication-user, what are the considerations to be taken? Is it required / recommended to have the same user replication-user on both the clusters and have auth_to_local setting similar to above? As the clusters are secured by Kerberos and the principals are going to be different on the clusters, how to make this work? The replication-user's keytab file is going to be different on PRIMARY and DR cluster. What is the best approach to handle this? What's the impact on the solution if the both the clusters are part of separate Kerberos realms like PRIMARY.HADOOP.COM and DR.HADOOP.COM? Apologies if some of these are trivial. Hadoop security is still a grey-area for me and hence majority of these surround security. Thanks Vijay

Online	Offline
Last Visited	‎05-17-2018 01:51 PM

Member Since	‎02-10-2016 11:49 PM
Last Visited	‎05-17-2018 01:51 PM
Posts	34
Kudos received	16

Cloudera Community

Re: Hadoop backup to other cluster..newbie questio...

Re: backup namode and datanodes

Re: Kafka Mirror Maker Best Practices

Issue with HDFS Encryption

distcp update vs distcp update with snapshots

Re: LUKS on HDFS

DistCp Network considerations

Re: Impersonation for distcp

Impersonation for distcp

Re: Distcp between secured clusters

distcp between secured clusters

Distcp between secured clusters