About bhoomireddy_vij

bhoomireddy_vij · ‎06-23-2016

@Predrag Monodic Thanks for your responses.

bhoomireddy_vij · ‎06-22-2016

@Predrag Monodic. Thanks for your response. Some follow-up questions With distcp, you would have to mirror Hive metadata separately. If distcp based mechanism is chosen, for metadata, is it a simply copy of the underlying metastore db that need to be copied over? Or are there further consideration? A caveat is that all Hive databases and tables existing before Falcon mirroring starts have to be mirrored separately We are building a DR cluster for an existing production cluster, which uses Hive extensively. So does that mean, for the initial migration once the DR cluster kicks-off, it need to be done externally to Falcon using Hive export/import mechanism? However, some operations like ACID delete and update are not supported. Can you please throw more light on this? When / why these are not supported when Falcon is the solution choice? Falcon will schedule Oozie jobs, and they will do mirroring. It's better to run those jobs on the DR cluster to spare resources on the source cluster. Are we saying to run the Falcon (and consequently the underlying Oozie) mainly on DR cluster, rather than the source cluster?

bhoomireddy_vij · ‎06-22-2016

Hi, I am looking for views on which is a better strategy for replication Hive and its metadata for DR purposes between two kerberised and high secure clusters. Copy underlying Hive data on HDFS using distcp2, preferably with HDFS snapshots enabled vs Using Falcon mirroring / replication for Hive (which I presume covers both Hive data and as well as metadata) Are there any caveats that I need to be aware of with either of these approaches and which one is preferred over the other, specifically having HDP2.4.2 in mind. Thanks Vijay

bhoomireddy_vij · ‎06-20-2016

@Silvio del Val, at present Ambari supports managing only one cluster per Ambari instance. So, in your case, you may need another Ambari deployment in the target cluster managing it.

bhoomireddy_vij · ‎06-20-2016

@Silvio del Val Either HBase replication (https://hbase.apache.org/0.94/replication.html) or HBase snapshots (http://hbase.apache.org/0.94/book/ops.snapshots.html) with ExportSnapshot tool can help you to get HBase data replicated to your secondary cluster. HBase uses HDFS as the underlying file system. So yes, if you replicate HBase, all the data stored by those HBase tables would be replicated to your secondary cluster.

bhoomireddy_vij · ‎06-20-2016

@Silvio del Val HBase stores its data underneath in HDFS only. But if you copy at HDFS level, it would only provide you the raw data, but you would miss all the HBase level metadata like table information etc. As I mentioned above, you can use HDFS snapshots with distcp2 only for HDFS data i.e. data directly stored in HDFS. For data stored in HBase, you can either use HBase snapshots or if affordability is not an issue, you can use HBase replication. distcp2 simply spawns map reduce jobs under the hood. So it does take vital cluster resources in the form of YARN container. Hence ensure distcp jobs are run at off-peak hours when cluster utilisation is kept to a minimum

bhoomireddy_vij · ‎06-20-2016

Silvio, For backup and DR purposes, you can use distcp / Falcon to cover HDFS data. HBase replication can be used to maintain a second backup cluster The preferable approach to cover HDFS, Hive and HBase is as below: 1. Enable and use HDFS snapshots 2. Using distcp2, replicate HDFS snapshots between clusters. The present version of Falcon doesn't support HDFS snapshots and hence distcp2 need to be used. If functionality becomes available in Falcon, same can be leveraged 3. For Hive metadata, Falcon can help replicate the metastore 4. For HBase data, please see https://hbase.apache.org/0.94/replication.html 5. For Kafka, use Kafka's native MirrorMaker functionality Hope this helps!!

bhoomireddy_vij · ‎06-11-2016

Alongside configuring NN disks for RAID, it is recommended to back up the following to safeguard your cluster from failures: HDFS data (Can be done using Falcon) Hive data (Can be done using Falcon) HBase data (Setup HBase Cluster Replication) Hive metadata (Can be done using Falcon between clusters. Also setup underlying metastore database in HA / active-active mode within the cluster) Regular backup of databases used by Ambari, Oozie, Ranger Configurations Ambari Server and Agent configurations (Ambari folders under /etc and /var) Configuration files for each application or service under /etc directory Binaries (/usr/hadoop/current) Any OS level configuration changes at each node level made in the cluster

bhoomireddy_vij · ‎06-03-2016

Hi, Using the importJCEKSKeys.sh script file present under Ranger KMS, would it be possible to selective import keys from Hadoop KMS rather than doing a complete import of all the keys present? Is there a better way to handle selective import of keys? Thanks Vijay

bhoomireddy_vij · ‎05-26-2016

I have seen few articles and questions on the community around Disaster Recovery. However, its still not clear completely and hence posting a new question around that: As I understand, typically, these entities need to be backed-up / synced between the clusters Primary Datasets HDFS Data Teeing - Flume / Hortonworks Data Flow Copying / Replication - distcp (invoking it manually), Falcon Hive Data Behind the scenes, Hive data is stored in HDFS. So I presume the techniques of teeing / copying can be employed for HDFS as above can be used here as well. HBase Data HBase native DR replication mechanism - master-slave, master-master and cyclic (http://hbase.apache.org/book.html#_cluster_replication) Solr Indexes If indexes are being stored in HDFS, HDFS techniques would cover Solr datasets as well DB backed services Hive Metadata - Periodic backup of the database from primary to DR cluster Amber - Ambari DB contains configurations for other ecosystem components. Periodic backup of the database from primary to DR cluster Oozie - Oozie database contains job and workflow level information. So this need to be backed up regularly to the DR cluster Ranger - Ranger policy DB contains info about the various policies impacting RBAC. Need to be backed up to the DR cluster Configurations Periodic backup of Ambari Server and Agent configurations (Ambari folders under /etc and /var) Periodic backup of Configuration files for each application or service under /etc directory Periodic backup of binaries (/usr/hadoop/current) Periodic backup of any OS specific changes at a node level in the primary cluster Application / User data Queries on DR Strategy Teeing vs Copying- Which one is preferred over the other? Understand its scenario dependent. But which has better adaptability and more widely used in the industry? Copying? Is it necessary to have both the main and the DR cluster on the same version of HDP? If not, what are things to consider if same version is not possible? Should it be like for like topology between clusters in terms of component placement including gateway nodes and zookeeper services? How does security play out for DR? Should both the cluster nodes be part of the same Kerberos realm or can they be part of different realms? Can the replication factor be lower? Or it recommended to maintain it as the same as the primary cluster? Any specific network requirements in terms of latency, speed etc. between the clusters Is there a need to run balancer on the DR cluster periodically? How does encryption play out between the primary and DR clusters? If encryption at rest is enabled in the primary one, how is it handled in the DR cluster? What are the implications of wire-encryption while transferring the data between the clusters? When HDFS snapshots is enabled on the primary cluster, how does it work when data is being synced to the DR cluster? Can Snapshots be exported onto another cluster? I understand this is possible for HBase snapshots. But is it allowed in HDFS case? For example, if a file is deleted on the primary cluster, but available in the snapshot, will that be synced to the snapshot directory on the DR cluster? For services which involve databases (Hive, Oozie, Ambari), instead of backing up periodically from the primary cluster to the DR cluster, is it recommended to setup one HA master in the DR cluster directly? For configurations and application data, instead of backing up at regular intervals, is there a way to keep them in sync between the primary and DR clusters? What extra / different functionality will third party solutions like WANDisco provide in comparison to Falcon? I am trying to understand the "active-active" working of WANDisco and why it is not possible with Falcon. What is the recommendation to ensure gateway node services like Knox and client libraries are kept in sync between the clusters? What is the recommendation for keeping application data, for example, Spark / Sqoop job level information? Apologies for the lengthy post, but want to cover all the areas around DR. Hence posted in a single question. Thanks Vijay

Online	Offline
Last Visited	‎05-17-2018 01:51 PM

Member Since	‎02-10-2016 11:49 PM
Last Visited	‎05-17-2018 01:51 PM
Posts	34
Kudos received	16

Cloudera Community

Re: Hadoop backup to other cluster..newbie questio...

Re: backup namode and datanodes

Re: Hive replication between clusters - Falcon bas...

Re: Hive replication between clusters - Falcon bas...

Hive replication between clusters - Falcon based H...

Re: Hadoop backup to other cluster..newbie questio...

Re: Hadoop backup to other cluster..newbie questio...

Re: Hadoop backup to other cluster..newbie questio...

Re: Hadoop backup to other cluster..newbie questio...

Re: backup namode and datanodes

Importing keys to Ranger KMS - importJCEKSKeys.sh

Questions on Disaster Recovery