Member since
02-10-2016
34
Posts
16
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3544 | 06-20-2016 12:48 PM | |
2195 | 06-11-2016 07:45 AM |
07-18-2017
04:24 PM
1 Kudo
@Vijaya Narayana Reddy Bhoomi Reddy You have to export and import the key as well. Just creating the key as the same name does not make it the same key. That is the reason you are seeing gibberish values. I wrote an article to automate this task with the automation script link. You can just change the cluster inside the script and change the directory locations(if any) to make it work. https://community.hortonworks.com/content/kbentry/110144/hdfs-encrypted-zone-intra-cluster-transfer-automat-1.html
... View more
12-02-2016
09:09 PM
3 Kudos
Mainly two benefits: 1. Avoid unnecessary copy for renamed files/directories. If we renamed a large directory on the source side, "distcp -update" cannot detect the rename thus will copy the whole renamed directory as a new one. 2. More efficient copy list generation. "distcp -update" needs to scan the whole directory and detect identical files during the copy process. Thus the copy list generation may take a long time for a big directory. Using snapshot diff based approach can greatly decrease this workload in case of an incremental sync scenario. However, snapshot based distcp requires very careful snapshot management on both the source and target clusters. E.g., the target cluster must not have any modification between two copies. Otherwise the diff may not be applied correctly.
... View more
09-30-2016
05:28 PM
5 Kudos
@Vijaya Narayana Reddy Bhoomi Reddy Edge nodes, while they may be in the same subnet with your HDP clusters, they are really not part of the actual clusters and as such there is no HDP configuration trick to redirect via edge nodes and Private Link. If you wish to use the 10 GB Private Link, it is just a matter of working with your network team to have those HDP clusters communicate via that Private Link instead of the firewall channeled network (doubt that they will want to do it). You did not put a number next to that "Firewall" line, but I assume that is much smaller since you want to use the other one. Maybe the network team needs to upgrade the firewall channeled network to meet the SLA. That is the correct approach and not use some trick to use the Private Link between edge nodes. It would meet your SLA and will also make network team happy to keep the firewall function in place. Network team may be able to peer-up those clusters to redirect the traffic through the private link without going through the edge nodes and by-passing the firewall channeled network, but I am pretty that they will break their network design principles going that way. The best approach is to upgrade the firewall channeled network to meet your needs.
... View more
08-03-2016
04:35 PM
Thanks @Sunile Manjee This is the approach I think I need to follow. Just was trying to understand if there is any other alternative. To answer your question around Falcon, we are not using because we are on HDP2.4.2 and need to leverage HDFS snapshots. Falcon doesn't yet support snapshots till 2.5. So going with this approach for now.
... View more
07-28-2016
03:46 PM
@Vijaya Narayana Reddy Bhoomi Reddy I believe the property you need to check is hadoop.security.auth_to_local, on the core-site.xml More about securing DistCp read here
... View more
07-28-2016
01:36 PM
Thanks @mqureshi for your response. In order to explain my case better, I have created another question with more detail. Request you to please have a look at it.
... View more
06-23-2016
10:29 AM
@Predrag Monodic Thanks for your responses.
... View more
12-16-2016
08:04 AM
@Vijaya Narayana Reddy Bhoomi Reddy At this point Ranger KMS import/export scripts do not allow for copy of a subset of keys. It's all or nothing.
... View more
05-26-2016
02:46 PM
2 Kudos
Teeing vs Copying- Which one is preferred over the other? Understand its scenario dependent. But which has better adaptability and more widely used in the industry? Copying?
With Teeing, you can split up primary tasks between the 2 clusters and use the other cluster as DR for that task. As an example, if you have clusters C1 and C2, you can use C1 as primary cluster and C2 as DR for some teams/tasks and use C2 as primary cluster and C1 as DR for some other users/tasks
Is it necessary to have both the main and the DR cluster on the same version of HDP? If not, what are things to consider if same version is not possible?
It is convinent to have them both on same version. This is especially the case if you want to use DR with almost no code changes if primary server is down.
Should it be like for like topology between clusters in terms of component placement including gateway nodes and zookeeper services? This is not required.
How does security play out for DR? Should both the cluster nodes be part of the same Kerberos realm or can they be part of different realms?
As a DR, same realm is a lot easier to manage than cross realm. But cross realm is possible.
Can the replication factor be lower? Or it recommended to maintain it as the same as the primary cluster? I have seen using rep factor 2 on DR clusters, but in case this turns in primary after disaster you may have to change rep factor to 3 on all data sets.
Any specific network requirements in terms of latency, speed etc. between the clusters
For ditscp, each node one cluster should communicate with each of the other nodes on second cluster.
Is there a need to run balancer on the DR cluster periodically?
Yes. Always good to run balancer to keep similar number of blocks across nodes.
How does encryption play out between the primary and DR clusters? If encryption at rest is enabled in the primary one, how is it handled in the DR cluster? What are the implications of wire-encryption while transferring the data between the clusters? Wire encyprtion will slow down transfers a little bit.
When HDFS snapshots is enabled on the primary cluster, how does it work when data is being synced to the DR cluster? Can Snapshots be exported onto another cluster? I understand this is possible for HBase snapshots. But is it allowed in HDFS case? For example, if a file is deleted on the primary cluster, but available in the snapshot, will that be synced to the snapshot directory on the DR cluster? If you are using snapshots, you can simply use distcp on snapshots instead of actual data set.
For services which involve databases (Hive, Oozie, Ambari), instead of backing up periodically from the primary cluster to the DR cluster, is it recommended to setup one HA master in the DR cluster directly? I don't think automating ambari is a good idea. Configs don't change that much so a simple process of duplicating might be better. Backing up would mean you need to have same hostnames and same topology.
For hive, instead of complete backup, Falcon can take care of table level replication.
For configurations and application data, instead of backing up at regular intervals, is there a way to keep them in sync between the primary and DR clusters?
Not sure where your application data resides, but for configuration since everything is managed by ambari, you can need to keep ambari configuration in sync.
... View more