Member since
02-10-2016
34
Posts
16
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3458 | 06-20-2016 12:48 PM | |
2175 | 06-11-2016 07:45 AM |
01-30-2017
10:54 PM
Thanks @Sriharsha Chintalapani for bringing out this article. A much needed one with growing importance of Kafka in every data-centric organisation. Covers lot of ground from MirrorMaker perspective. Thanks
... View more
01-23-2017
06:12 PM
1 Kudo
Hi, I am trying to distcp data between two encryption zones located on two different clusters. Data has been copied successfully. However, when I read the data on the target cluster, I see some gibberish being printed on the terminal. Encryption Zone on source has been created with key (test-key). As its a DR requirement, I created a key on the target cluster with the same key name i.e. test-key. However, fundamentally they both are completely independent clusters. I presume when DistCp reads the data from the source cluster, it should read and transfer the data transparently using source side key and material and then write to target using the target’s key and material Wondering where this has gone wrong. Any pointers?
... View more
Labels:
- Labels:
-
Apache Hadoop
12-01-2016
05:09 PM
2 Kudos
Hi, I am trying to understand what are the benefits of using distcp -update vs distcp -update with hdfs snapshot differences? As I understand, update without any snapshot options will only replicate the modified data at source and doesnt touch files that are already existing at the destination. What additional benefits would one realise using HDFS snapshots in distcp based replication? Thanks Vijay
... View more
Labels:
- Labels:
-
Apache Hadoop
11-08-2016
04:22 PM
@Mike Garris Hi, LUKS is a disk level encryption and hence is independent of the encryption supported by HDFS. Please see the link below to have an overview of the various levels of encryptions and where TDE sits. https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html Hope that answers your query.
... View more
09-30-2016
10:19 AM
1 Kudo
Hi, We are running distcp / falcon based replication between clusters. As depicted in the diagram above, we have edge nodes configured on both the clusters and a dedicated private network link has been established between them. For normal cluster traffic, I presume it uses the normal firewall channeled network. However, my understanding of distcp is that it works as name node to name node communication and hence would probably use the firewall route, but not the private link. Can anyone please guide me on how to make use of the private link so that all the replication traffic (which is expected to be huge and also to adhere to SLAs) would be directed through this. Looking for alternate suggestions and ideas to make this more performant. Thanks Vijay
... View more
Labels:
- Labels:
-
Apache Falcon
-
Apache Hadoop
08-03-2016
04:35 PM
Thanks @Sunile Manjee This is the approach I think I need to follow. Just was trying to understand if there is any other alternative. To answer your question around Falcon, we are not using because we are on HDP2.4.2 and need to leverage HDFS snapshots. Falcon doesn't yet support snapshots till 2.5. So going with this approach for now.
... View more
08-03-2016
02:56 PM
Hi, I am working on a distcp solution between two clusters. On cluster01 HDFS, there are multiple directories and each is owned by a different application team. The requirement is to distcp these directories onto cluster02 by preserving the access privileges. Both the clusters are secured. I was thinking of having a service user something like "distcp-user" with its own kerberos principal who can manage the distcp process and auditing would be easy as well.
Would it be possible for distcp-user to complete the distcp process without having read access on cluster01 and write access on cluster02? Is this something impersonation can help with? For example, if dir1 on cluster01 is owned appuser1 and dir2 owned by appuser2, can distup-user impersonate both appuser1 and appuser2 and perform the distcp jobs on their behalf without sniffing into the actual underlying data? Is it only possible if distcp-user has appropriate read access enabled on the cluster01 and write access on cluster02, something to be managed by Ranger / HDFS ACLs? Thanks Vijay
... View more
Labels:
- Labels:
-
Apache Hadoop
07-28-2016
01:36 PM
Thanks @mqureshi for your response. In order to explain my case better, I have created another question with more detail. Request you to please have a look at it.
... View more
07-28-2016
01:34 PM
I am still getting familiar with security aspects in Hadoop and hence need some guidance. I am trying to setup a distcp job between two secure clusters. Lets say the clusters are called primary_cluster and dr_cluster. Both the clusters are connected to a single active directory instance and share the same kerberos realm AD.HADOOP.COM. On the primary_cluster, assume there are two directories that need to be replicated to the dr_cluster . Assume the directories are /abc owned by abc-usr and /xyz owned by xyz-usr on both the clusters. Also, /abc and /xyz are designated as encryption zones and hence are encrypted using KMS keys. My customer doesn’t want superuser like hdfs user to be running the distcp job and prefer to execute it by the owner of the HDFS directory i.e. abc-usr or xyz-usr in this case. So I’m thinking to have keytab files for both abc-usr and xyz-usr be made available on the node (lets call it distcp-node) from which the distcp job will be triggered (planning to trigger it on the dr_cluster as dr_cluster’s yarn capacity is used very little). Below is the sequence of steps I am having in mind to perform. #Running from distcp-node part of dr_cluster
su abc-usr kinit -k -t /keytab-location abc-usr (At this instance, abc-usr will get a TGT from AD’s KDC) hadoop distcp hdfs://primary_cluster_nn:50070/abc hdfs://dr_cluster_nn:8020/abc (Using the above acquired TGT, kerberos service tickets need to be acquired) su xyz-usr kinit -k -t /keytab-location xyz-usr (At this instance, xyz-usr will get a TGT from AD’s KDC) hadoop distcp hdfs://primary_cluster_nn:50070/xyz hdfs://dr_cluster_nn:8020/xyz (Using the above acquired TGT, kerberos service tickets need to be acquired) My Queries:
During steps 3 and 6, will it get service tickets for name nodes of both primary_cluster and dr_cluster or will it get only for dr_cluster’s name node as the command is being run on a host part of dr_cluster? Are the key tab files required anywhere else across the clusters apart from distcp-node? Is there a need to configure hadoop auth_to_local settings? If so, what rules are required? I presume these are required. As a bare minimum, auth_to_local rules for abc-usr and xyz-user are needed on both the clusters to translate kerberos principals to user short names. Is there any need to configure proxy_user rules in Hadoop? As both /abc and /xyz are encryption zones, how to ensure the data is transferred properly? I presume as and when the data is read by distcp on primary_cluster, it is transparently decrypted by primary_cluster’s KMS and sent over the wire and re-encrypted on DR side using dr_cluster’s KMS If the above statement is incorrect, should I run the distcp command on /abc/.reserved/raw and /xyz/.reserved/raw directories and securely transfer the appropriate KMS keys? What would be the impact in this case, if I intend to run distcp by using HDFS snapshots? PS: The purpose of using ditstcp based replication instead of Falcon, is to make use of HDFS snapshots.Falcon that is part of HDP2.4.2 doesn't yet support HDFS Snapshot based replication.
... View more
Labels:
- Labels:
-
Apache Hadoop
07-26-2016
05:24 PM
2 Kudos
Hi, We
have two secured clusters with namenode HA setup. Let's name them as PRIMARY and DR. We are now
implementing a DR solution between the clusters using HDFS snapshots and distcp
(We are on HDP2.4.2 and Falcon doesn't support HDFS snapshots till HDP2.5. So
had to use HDFS snapshots with distcp) to replicate the data from PRIMARY to DR
cluster. All the Hadoop daemon accounts on the clusters are appended with the
cluster name. For example, PRIMARY-hdfs, DR-yarn etc.
I have few questions in this regard: Q: On which node should the
distcp job be running? My Understanding: For DR
purposes, distcp job should ideally be run on one of the machines on the
DR cluster as it has unused YARN capacity. The requirement for the node
is to have hadoop client libraries available for it to run distcp. For
example, assume the node as dr-host1@HADOOP.COM Which user should the distcp
job be running as? Is it someone with hdfs privileges (For example, DR-hdfs@HADOOP.COM) or any other user for
example, a new user created for this purpose -replication-user (replication-user@HADOOP.COM) If its hdfs user (DR-hdfs@HADOOP.COM), how to ensure the user is
allowed access on the PRIMARY cluster? (probably through auth_to_local
settings like below?) RULE:
[1:$1@$0] (.*-hdfs@HADOOP.COM) s/.*/PRIMARY-hdfs/ If it’s a non-standard user
like replication-user, what are the considerations to be taken? Is it
required / recommended to have the same user replication-user on both the clusters and
have auth_to_local setting similar to above? As the clusters are secured
by Kerberos and the principals are going to be different on the clusters,
how to make this work? The replication-user's keytab file is going to be
different on PRIMARY and DR cluster. What is the best approach to handle
this? What's the impact on the
solution if the both the clusters are part of separate Kerberos realms
like PRIMARY.HADOOP.COM and DR.HADOOP.COM? Apologies if some of
these are trivial. Hadoop security is still a grey-area for me and hence
majority of these surround security. Thanks Vijay
... View more
Labels:
- Labels:
-
Apache Hadoop