Member since
02-10-2016
34
Posts
16
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
1157 | 06-20-2016 12:48 PM | |
838 | 06-11-2016 07:45 AM |
05-17-2018
08:54 AM
Thanks @Felix Albani I am aware that we can turn on/off auditing at a policy level. However, I am looking at granular control at operations level so that we can disable logging for some operations which are perceived as unimportant.
... View more
05-16-2018
10:35 AM
Hi, Is there a mechanism to choose particular operations / audit types for auditing in Ranger while ignoring other types? We are exploring the possibility of turning-off logging mechanism for few categories in anticipation of better search performance in RangerSolr. Regards Vijay
... View more
Labels:
- Labels:
-
Apache Ranger
01-30-2017
10:54 PM
Thanks @Sriharsha Chintalapani for bringing out this article. A much needed one with growing importance of Kafka in every data-centric organisation. Covers lot of ground from MirrorMaker perspective. Thanks
... View more
01-23-2017
06:12 PM
1 Kudo
Hi, I am trying to distcp data between two encryption zones located on two different clusters. Data has been copied successfully. However, when I read the data on the target cluster, I see some gibberish being printed on the terminal. Encryption Zone on source has been created with key (test-key). As its a DR requirement, I created a key on the target cluster with the same key name i.e. test-key. However, fundamentally they both are completely independent clusters. I presume when DistCp reads the data from the source cluster, it should read and transfer the data transparently using source side key and material and then write to target using the target’s key and material Wondering where this has gone wrong. Any pointers?
... View more
Labels:
- Labels:
-
Apache Hadoop
12-01-2016
05:09 PM
2 Kudos
Hi, I am trying to understand what are the benefits of using distcp -update vs distcp -update with hdfs snapshot differences? As I understand, update without any snapshot options will only replicate the modified data at source and doesnt touch files that are already existing at the destination. What additional benefits would one realise using HDFS snapshots in distcp based replication? Thanks Vijay
... View more
Labels:
- Labels:
-
Apache Hadoop
11-08-2016
04:22 PM
@Mike Garris Hi, LUKS is a disk level encryption and hence is independent of the encryption supported by HDFS. Please see the link below to have an overview of the various levels of encryptions and where TDE sits. https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html Hope that answers your query.
... View more
09-30-2016
10:19 AM
1 Kudo
Hi, We are running distcp / falcon based replication between clusters. As depicted in the diagram above, we have edge nodes configured on both the clusters and a dedicated private network link has been established between them. For normal cluster traffic, I presume it uses the normal firewall channeled network. However, my understanding of distcp is that it works as name node to name node communication and hence would probably use the firewall route, but not the private link. Can anyone please guide me on how to make use of the private link so that all the replication traffic (which is expected to be huge and also to adhere to SLAs) would be directed through this. Looking for alternate suggestions and ideas to make this more performant. Thanks Vijay
... View more
Labels:
- Labels:
-
Apache Falcon
-
Apache Hadoop
08-03-2016
04:35 PM
Thanks @Sunile Manjee This is the approach I think I need to follow. Just was trying to understand if there is any other alternative. To answer your question around Falcon, we are not using because we are on HDP2.4.2 and need to leverage HDFS snapshots. Falcon doesn't yet support snapshots till 2.5. So going with this approach for now.
... View more
08-03-2016
02:56 PM
Hi, I am working on a distcp solution between two clusters. On cluster01 HDFS, there are multiple directories and each is owned by a different application team. The requirement is to distcp these directories onto cluster02 by preserving the access privileges. Both the clusters are secured. I was thinking of having a service user something like "distcp-user" with its own kerberos principal who can manage the distcp process and auditing would be easy as well.
Would it be possible for distcp-user to complete the distcp process without having read access on cluster01 and write access on cluster02? Is this something impersonation can help with? For example, if dir1 on cluster01 is owned appuser1 and dir2 owned by appuser2, can distup-user impersonate both appuser1 and appuser2 and perform the distcp jobs on their behalf without sniffing into the actual underlying data? Is it only possible if distcp-user has appropriate read access enabled on the cluster01 and write access on cluster02, something to be managed by Ranger / HDFS ACLs? Thanks Vijay
... View more
Labels:
- Labels:
-
Apache Hadoop
07-28-2016
01:36 PM
Thanks @mqureshi for your response. In order to explain my case better, I have created another question with more detail. Request you to please have a look at it.
... View more
07-28-2016
01:34 PM
I am still getting familiar with security aspects in Hadoop and hence need some guidance. I am trying to setup a distcp job between two secure clusters. Lets say the clusters are called primary_cluster and dr_cluster. Both the clusters are connected to a single active directory instance and share the same kerberos realm AD.HADOOP.COM. On the primary_cluster, assume there are two directories that need to be replicated to the dr_cluster . Assume the directories are /abc owned by abc-usr and /xyz owned by xyz-usr on both the clusters. Also, /abc and /xyz are designated as encryption zones and hence are encrypted using KMS keys. My customer doesn’t want superuser like hdfs user to be running the distcp job and prefer to execute it by the owner of the HDFS directory i.e. abc-usr or xyz-usr in this case. So I’m thinking to have keytab files for both abc-usr and xyz-usr be made available on the node (lets call it distcp-node) from which the distcp job will be triggered (planning to trigger it on the dr_cluster as dr_cluster’s yarn capacity is used very little). Below is the sequence of steps I am having in mind to perform. #Running from distcp-node part of dr_cluster
su abc-usr kinit -k -t /keytab-location abc-usr (At this instance, abc-usr will get a TGT from AD’s KDC) hadoop distcp hdfs://primary_cluster_nn:50070/abc hdfs://dr_cluster_nn:8020/abc (Using the above acquired TGT, kerberos service tickets need to be acquired) su xyz-usr kinit -k -t /keytab-location xyz-usr (At this instance, xyz-usr will get a TGT from AD’s KDC) hadoop distcp hdfs://primary_cluster_nn:50070/xyz hdfs://dr_cluster_nn:8020/xyz (Using the above acquired TGT, kerberos service tickets need to be acquired) My Queries:
During steps 3 and 6, will it get service tickets for name nodes of both primary_cluster and dr_cluster or will it get only for dr_cluster’s name node as the command is being run on a host part of dr_cluster? Are the key tab files required anywhere else across the clusters apart from distcp-node? Is there a need to configure hadoop auth_to_local settings? If so, what rules are required? I presume these are required. As a bare minimum, auth_to_local rules for abc-usr and xyz-user are needed on both the clusters to translate kerberos principals to user short names. Is there any need to configure proxy_user rules in Hadoop? As both /abc and /xyz are encryption zones, how to ensure the data is transferred properly? I presume as and when the data is read by distcp on primary_cluster, it is transparently decrypted by primary_cluster’s KMS and sent over the wire and re-encrypted on DR side using dr_cluster’s KMS If the above statement is incorrect, should I run the distcp command on /abc/.reserved/raw and /xyz/.reserved/raw directories and securely transfer the appropriate KMS keys? What would be the impact in this case, if I intend to run distcp by using HDFS snapshots? PS: The purpose of using ditstcp based replication instead of Falcon, is to make use of HDFS snapshots.Falcon that is part of HDP2.4.2 doesn't yet support HDFS Snapshot based replication.
... View more
Labels:
- Labels:
-
Apache Hadoop
07-26-2016
05:24 PM
2 Kudos
Hi, We
have two secured clusters with namenode HA setup. Let's name them as PRIMARY and DR. We are now
implementing a DR solution between the clusters using HDFS snapshots and distcp
(We are on HDP2.4.2 and Falcon doesn't support HDFS snapshots till HDP2.5. So
had to use HDFS snapshots with distcp) to replicate the data from PRIMARY to DR
cluster. All the Hadoop daemon accounts on the clusters are appended with the
cluster name. For example, PRIMARY-hdfs, DR-yarn etc.
I have few questions in this regard: Q: On which node should the
distcp job be running? My Understanding: For DR
purposes, distcp job should ideally be run on one of the machines on the
DR cluster as it has unused YARN capacity. The requirement for the node
is to have hadoop client libraries available for it to run distcp. For
example, assume the node as dr-host1@HADOOP.COM Which user should the distcp
job be running as? Is it someone with hdfs privileges (For example, DR-hdfs@HADOOP.COM) or any other user for
example, a new user created for this purpose -replication-user (replication-user@HADOOP.COM) If its hdfs user (DR-hdfs@HADOOP.COM), how to ensure the user is
allowed access on the PRIMARY cluster? (probably through auth_to_local
settings like below?) RULE:
[1:$1@$0] (.*-hdfs@HADOOP.COM) s/.*/PRIMARY-hdfs/ If it’s a non-standard user
like replication-user, what are the considerations to be taken? Is it
required / recommended to have the same user replication-user on both the clusters and
have auth_to_local setting similar to above? As the clusters are secured
by Kerberos and the principals are going to be different on the clusters,
how to make this work? The replication-user's keytab file is going to be
different on PRIMARY and DR cluster. What is the best approach to handle
this? What's the impact on the
solution if the both the clusters are part of separate Kerberos realms
like PRIMARY.HADOOP.COM and DR.HADOOP.COM? Apologies if some of
these are trivial. Hadoop security is still a grey-area for me and hence
majority of these surround security. Thanks Vijay
... View more
Labels:
- Labels:
-
Apache Hadoop
06-23-2016
10:29 AM
@Predrag Monodic Thanks for your responses.
... View more
06-22-2016
11:07 PM
@Predrag Monodic. Thanks for your response. Some follow-up questions With distcp, you would have to mirror Hive metadata separately. If distcp based mechanism is chosen, for metadata, is it a simply copy of the underlying metastore db that need to be copied over? Or are there further consideration? A caveat is that all Hive databases and tables existing before Falcon mirroring starts have to be mirrored separately We are building a DR cluster for an existing production cluster, which uses Hive extensively. So does that mean, for the initial migration once the DR cluster kicks-off, it need to be done externally to Falcon using Hive export/import mechanism? However, some operations like ACID delete and update are not supported. Can you please throw more light on this? When / why these are not supported when Falcon is the solution choice? Falcon will schedule Oozie jobs, and they will do mirroring. It's better to run those jobs on the DR cluster to spare resources on the source cluster. Are we saying to run the Falcon (and consequently the underlying Oozie) mainly on DR cluster, rather than the source cluster?
... View more
06-22-2016
09:07 AM
1 Kudo
Hi, I am looking for views on which is a better strategy for replication Hive and its metadata for DR purposes between two kerberised and high secure clusters. Copy underlying Hive data on HDFS using distcp2, preferably with HDFS snapshots enabled vs Using Falcon mirroring / replication for Hive (which I presume covers both Hive data and as well as metadata) Are there any caveats that I need to be aware of with either of these approaches and which one is preferred over the other, specifically having HDP2.4.2 in mind. Thanks Vijay
... View more
Labels:
- Labels:
-
Apache Falcon
-
Apache Hadoop
-
Apache Hive
06-20-2016
03:48 PM
@Silvio del Val, at present Ambari supports managing only one cluster per Ambari instance. So, in your case, you may need another Ambari deployment in the target cluster managing it.
... View more
06-20-2016
03:06 PM
@Silvio del Val Either HBase replication (https://hbase.apache.org/0.94/replication.html) or HBase snapshots (http://hbase.apache.org/0.94/book/ops.snapshots.html) with ExportSnapshot tool can help you to get HBase data replicated to your secondary cluster. HBase uses HDFS as the underlying file system. So yes, if you replicate HBase, all the data stored by those HBase tables would be replicated to your secondary cluster.
... View more
06-20-2016
01:34 PM
@Silvio del Val HBase stores its data underneath in HDFS only. But if you copy at HDFS level, it would only provide you the raw data, but you would miss all the HBase level metadata like table information etc. As I mentioned above, you can use HDFS snapshots with distcp2 only for HDFS data i.e. data directly stored in HDFS. For data stored in HBase, you can either use HBase snapshots or if affordability is not an issue, you can use HBase replication. distcp2 simply spawns map reduce jobs under the hood. So it does take vital cluster resources in the form of YARN container. Hence ensure distcp jobs are run at off-peak hours when cluster utilisation is kept to a minimum
... View more
06-20-2016
12:48 PM
Silvio, For backup and DR purposes, you can use distcp / Falcon to cover HDFS data. HBase replication can be used to maintain a second backup cluster The preferable approach to cover HDFS, Hive and HBase is as below: 1. Enable and use HDFS snapshots 2. Using distcp2, replicate HDFS snapshots between clusters. The present version of Falcon doesn't support HDFS snapshots and hence distcp2 need to be used. If functionality becomes available in Falcon, same can be leveraged 3. For Hive metadata, Falcon can help replicate the metastore 4. For HBase data, please see https://hbase.apache.org/0.94/replication.html 5. For Kafka, use Kafka's native MirrorMaker functionality Hope this helps!!
... View more
06-11-2016
07:45 AM
Alongside configuring NN disks for RAID, it is recommended to back up the following to safeguard your cluster from failures: HDFS data (Can be done using Falcon)
Hive data (Can be done using Falcon) HBase data (Setup HBase Cluster Replication) Hive metadata (Can be done using Falcon between clusters. Also setup underlying metastore database in HA / active-active mode within the cluster) Regular backup of databases used by Ambari, Oozie, Ranger Configurations Ambari Server and Agent
configurations (Ambari folders under /etc and /var) Configuration files for each
application or service under /etc directory Binaries
(/usr/hadoop/current) Any OS level configuration changes at each node level made in the cluster
... View more
06-03-2016
03:49 PM
1 Kudo
Hi, Using the importJCEKSKeys.sh script file present under Ranger KMS, would it be possible to selective import keys from Hadoop KMS rather than doing a complete import of all the keys present? Is there a better way to handle selective import of keys? Thanks Vijay
... View more
Labels:
- Labels:
-
Apache Ranger
05-26-2016
01:56 PM
2 Kudos
I have
seen few articles and questions on the community around Disaster Recovery.
However, its still not clear completely and hence posting a new question around
that:
As I understand, typically, these entities need to be
backed-up / synced between the clusters
Primary Datasets
HDFS Data Teeing - Flume /
Hortonworks Data Flow Copying / Replication -
distcp (invoking it manually), Falcon Hive Data Behind the scenes, Hive
data is stored in HDFS. So I presume the techniques of teeing / copying
can be employed for HDFS as above can be used here as well. HBase Data HBase native DR
replication mechanism - master-slave, master-master and cyclic (http://hbase.apache.org/book.html#_cluster_replication) Solr Indexes If indexes are being
stored in HDFS, HDFS techniques would cover Solr datasets as well DB backed services Hive Metadata - Periodic backup of the
database from primary to DR cluster Amber - Ambari DB contains
configurations for other ecosystem components. Periodic backup of the
database from primary to DR cluster Oozie - Oozie database contains
job and workflow level information. So this need to be backed up
regularly to the DR cluster Ranger - Ranger policy DB contains
info about the various policies impacting RBAC. Need to be backed up to
the DR cluster Configurations
Periodic backup of Ambari
Server and Agent configurations (Ambari folders under /etc and /var) Periodic backup of
Configuration files for each application or service under /etc directory Periodic backup of binaries
(/usr/hadoop/current) Periodic backup of any OS
specific changes at a node level in the primary cluster
Application / User data Queries on DR Strategy
Teeing vs Copying- Which one
is preferred over the other? Understand its scenario dependent. But which
has better adaptability and more widely used in the industry? Copying?
Is it necessary to have both
the main and the DR cluster on the same version of HDP? If not, what are
things to consider if same version is not possible?
Should it be like for like
topology between clusters in terms of component placement including
gateway nodes and zookeeper services?
How does security play out
for DR? Should both the cluster nodes be part of the same Kerberos realm
or can they be part of different realms?
Can the replication factor be
lower? Or it recommended to maintain it as the same as the primary
cluster?
Any specific network
requirements in terms of latency, speed etc. between the clusters
Is there a need to run
balancer on the DR cluster periodically?
How does encryption play out
between the primary and DR clusters? If encryption at rest is enabled in
the primary one, how is it handled in the DR cluster? What are the
implications of wire-encryption while transferring the data between the
clusters?
When HDFS snapshots is
enabled on the primary cluster, how does it work when data is being synced
to the DR cluster? Can Snapshots be exported onto another cluster? I
understand this is possible for HBase snapshots. But is it allowed in HDFS
case? For example, if a file is deleted on the primary cluster, but
available in the snapshot, will that be synced to the snapshot directory
on the DR cluster?
For services which involve
databases (Hive, Oozie, Ambari), instead of backing up periodically from
the primary cluster to the DR cluster, is it recommended to setup one HA
master in the DR cluster directly?
For configurations and
application data, instead of backing up at regular intervals, is there a
way to keep them in sync between the primary and DR clusters?
What extra / different
functionality will third party solutions like WANDisco provide in comparison to Falcon? I am trying to
understand the "active-active" working of WANDisco and why it is
not possible with Falcon.
What is the recommendation to
ensure gateway node services like Knox and client libraries are kept in
sync between the clusters?
What is the recommendation
for keeping application data, for example, Spark / Sqoop job level
information? Apologies for the lengthy post, but want to cover all the areas around DR. Hence posted in a single question. Thanks Vijay
... View more
Labels:
- Labels:
-
Apache Falcon
-
Apache Hadoop