Member since
02-10-2016
34
Posts
16
Kudos Received
2
Solutions
My Accepted Solutions
Title | Views | Posted |
---|---|---|
3542 | 06-20-2016 12:48 PM | |
2195 | 06-11-2016 07:45 AM |
06-23-2016
10:29 AM
@Predrag Monodic Thanks for your responses.
... View more
06-22-2016
11:07 PM
@Predrag Monodic. Thanks for your response. Some follow-up questions With distcp, you would have to mirror Hive metadata separately. If distcp based mechanism is chosen, for metadata, is it a simply copy of the underlying metastore db that need to be copied over? Or are there further consideration? A caveat is that all Hive databases and tables existing before Falcon mirroring starts have to be mirrored separately We are building a DR cluster for an existing production cluster, which uses Hive extensively. So does that mean, for the initial migration once the DR cluster kicks-off, it need to be done externally to Falcon using Hive export/import mechanism? However, some operations like ACID delete and update are not supported. Can you please throw more light on this? When / why these are not supported when Falcon is the solution choice? Falcon will schedule Oozie jobs, and they will do mirroring. It's better to run those jobs on the DR cluster to spare resources on the source cluster. Are we saying to run the Falcon (and consequently the underlying Oozie) mainly on DR cluster, rather than the source cluster?
... View more
06-22-2016
09:07 AM
1 Kudo
Hi, I am looking for views on which is a better strategy for replication Hive and its metadata for DR purposes between two kerberised and high secure clusters. Copy underlying Hive data on HDFS using distcp2, preferably with HDFS snapshots enabled vs Using Falcon mirroring / replication for Hive (which I presume covers both Hive data and as well as metadata) Are there any caveats that I need to be aware of with either of these approaches and which one is preferred over the other, specifically having HDP2.4.2 in mind. Thanks Vijay
... View more
Labels:
- Labels:
-
Apache Falcon
-
Apache Hadoop
-
Apache Hive
06-20-2016
03:48 PM
@Silvio del Val, at present Ambari supports managing only one cluster per Ambari instance. So, in your case, you may need another Ambari deployment in the target cluster managing it.
... View more
06-20-2016
03:06 PM
@Silvio del Val Either HBase replication (https://hbase.apache.org/0.94/replication.html) or HBase snapshots (http://hbase.apache.org/0.94/book/ops.snapshots.html) with ExportSnapshot tool can help you to get HBase data replicated to your secondary cluster. HBase uses HDFS as the underlying file system. So yes, if you replicate HBase, all the data stored by those HBase tables would be replicated to your secondary cluster.
... View more
06-20-2016
01:34 PM
@Silvio del Val HBase stores its data underneath in HDFS only. But if you copy at HDFS level, it would only provide you the raw data, but you would miss all the HBase level metadata like table information etc. As I mentioned above, you can use HDFS snapshots with distcp2 only for HDFS data i.e. data directly stored in HDFS. For data stored in HBase, you can either use HBase snapshots or if affordability is not an issue, you can use HBase replication. distcp2 simply spawns map reduce jobs under the hood. So it does take vital cluster resources in the form of YARN container. Hence ensure distcp jobs are run at off-peak hours when cluster utilisation is kept to a minimum
... View more
06-20-2016
12:48 PM
Silvio, For backup and DR purposes, you can use distcp / Falcon to cover HDFS data. HBase replication can be used to maintain a second backup cluster The preferable approach to cover HDFS, Hive and HBase is as below: 1. Enable and use HDFS snapshots 2. Using distcp2, replicate HDFS snapshots between clusters. The present version of Falcon doesn't support HDFS snapshots and hence distcp2 need to be used. If functionality becomes available in Falcon, same can be leveraged 3. For Hive metadata, Falcon can help replicate the metastore 4. For HBase data, please see https://hbase.apache.org/0.94/replication.html 5. For Kafka, use Kafka's native MirrorMaker functionality Hope this helps!!
... View more
06-11-2016
07:45 AM
Alongside configuring NN disks for RAID, it is recommended to back up the following to safeguard your cluster from failures: HDFS data (Can be done using Falcon)
Hive data (Can be done using Falcon) HBase data (Setup HBase Cluster Replication) Hive metadata (Can be done using Falcon between clusters. Also setup underlying metastore database in HA / active-active mode within the cluster) Regular backup of databases used by Ambari, Oozie, Ranger Configurations Ambari Server and Agent
configurations (Ambari folders under /etc and /var) Configuration files for each
application or service under /etc directory Binaries
(/usr/hadoop/current) Any OS level configuration changes at each node level made in the cluster
... View more
06-03-2016
03:49 PM
1 Kudo
Hi, Using the importJCEKSKeys.sh script file present under Ranger KMS, would it be possible to selective import keys from Hadoop KMS rather than doing a complete import of all the keys present? Is there a better way to handle selective import of keys? Thanks Vijay
... View more
Labels:
- Labels:
-
Apache Ranger
05-26-2016
01:56 PM
2 Kudos
I have
seen few articles and questions on the community around Disaster Recovery.
However, its still not clear completely and hence posting a new question around
that:
As I understand, typically, these entities need to be
backed-up / synced between the clusters
Primary Datasets
HDFS Data Teeing - Flume /
Hortonworks Data Flow Copying / Replication -
distcp (invoking it manually), Falcon Hive Data Behind the scenes, Hive
data is stored in HDFS. So I presume the techniques of teeing / copying
can be employed for HDFS as above can be used here as well. HBase Data HBase native DR
replication mechanism - master-slave, master-master and cyclic (http://hbase.apache.org/book.html#_cluster_replication) Solr Indexes If indexes are being
stored in HDFS, HDFS techniques would cover Solr datasets as well DB backed services Hive Metadata - Periodic backup of the
database from primary to DR cluster Amber - Ambari DB contains
configurations for other ecosystem components. Periodic backup of the
database from primary to DR cluster Oozie - Oozie database contains
job and workflow level information. So this need to be backed up
regularly to the DR cluster Ranger - Ranger policy DB contains
info about the various policies impacting RBAC. Need to be backed up to
the DR cluster Configurations
Periodic backup of Ambari
Server and Agent configurations (Ambari folders under /etc and /var) Periodic backup of
Configuration files for each application or service under /etc directory Periodic backup of binaries
(/usr/hadoop/current) Periodic backup of any OS
specific changes at a node level in the primary cluster
Application / User data Queries on DR Strategy
Teeing vs Copying- Which one
is preferred over the other? Understand its scenario dependent. But which
has better adaptability and more widely used in the industry? Copying?
Is it necessary to have both
the main and the DR cluster on the same version of HDP? If not, what are
things to consider if same version is not possible?
Should it be like for like
topology between clusters in terms of component placement including
gateway nodes and zookeeper services?
How does security play out
for DR? Should both the cluster nodes be part of the same Kerberos realm
or can they be part of different realms?
Can the replication factor be
lower? Or it recommended to maintain it as the same as the primary
cluster?
Any specific network
requirements in terms of latency, speed etc. between the clusters
Is there a need to run
balancer on the DR cluster periodically?
How does encryption play out
between the primary and DR clusters? If encryption at rest is enabled in
the primary one, how is it handled in the DR cluster? What are the
implications of wire-encryption while transferring the data between the
clusters?
When HDFS snapshots is
enabled on the primary cluster, how does it work when data is being synced
to the DR cluster? Can Snapshots be exported onto another cluster? I
understand this is possible for HBase snapshots. But is it allowed in HDFS
case? For example, if a file is deleted on the primary cluster, but
available in the snapshot, will that be synced to the snapshot directory
on the DR cluster?
For services which involve
databases (Hive, Oozie, Ambari), instead of backing up periodically from
the primary cluster to the DR cluster, is it recommended to setup one HA
master in the DR cluster directly?
For configurations and
application data, instead of backing up at regular intervals, is there a
way to keep them in sync between the primary and DR clusters?
What extra / different
functionality will third party solutions like WANDisco provide in comparison to Falcon? I am trying to
understand the "active-active" working of WANDisco and why it is
not possible with Falcon.
What is the recommendation to
ensure gateway node services like Knox and client libraries are kept in
sync between the clusters?
What is the recommendation
for keeping application data, for example, Spark / Sqoop job level
information? Apologies for the lengthy post, but want to cover all the areas around DR. Hence posted in a single question. Thanks Vijay
... View more
Labels:
- Labels:
-
Apache Falcon
-
Apache Hadoop
- « Previous
-
- 1
- 2
- Next »