Created on 07-05-201607:26 PM - edited 09-16-202201:35 AM
Disaster recovery plan or a business process
contingency plan is a set of well-defined process or procedures that needs to
be executed so that the effects of a disaster will be minimized and the
organization will be able to either maintain or quickly resume mission-critical
operations.
Disaster usually comes in several forms and
need to planned for recovery accordingly:
Catastrophic
failure at the data center level, requiring failover to a backup location
Needing
to restore a previous copy of your data due to user error or accidental
deletion
The
ability to restore a point-in-time copy of your data for auditing purposes
Disclaimer: 1. This article is solely my
personal take on disaster recovery in a Hadoop cluster
2. Disaster Recovery is specialized subject in
itself. Do not Implement something based on this article in production until
you have a good understanding on what you are implementing.
Key objectives:
Minimal or no downtime for production cluster
Ensure High Availability of HDP Services
Ensure Backup and recovery of Databases, configurations and
binaries
No Data Loss
Recover from hardware failure
Recover from user error or accidental deletes
Business Continuity
Failover to DR cluster in case of Catastrophic failure or
disaster
This is the time I introduce RTO/RPO.
RTO/RPO Drill Down
RTO, or Recovery Time Objective, is the target
time you set for the recovery of your IT and business activities after a
disaster has struck. The goal here is to calculate how quickly you need to
recover, which can then dictate the type or preparations you need to implement
and the overall budget you should assign to business continuity.
RPO, or Recovery Point Objective, is focused on
data and your company’s loss tolerance in relation to your data. RPO is
determined by looking at the time between data backups and the amount of data
that could be lost in between backups.
The major difference between these two metrics
is their purpose. The RTO is usually large scale, and looks at your whole
business and systems involved. RPO focuses just on data and your company’s
overall resilience to the loss of it.
Qs: What
is your RTO/RPO?
Ans: For a complex and large production system
this answer would take some time to figure out and will progressively be defined.
Also ideally there should be multiple values for this answer.
What are you talking about?
A
1-hour/1-hour RTO/RPO is very different (cost and architecture wise) from a
2-week/1-day RTO/RPO. When you choose the RTO/RPO requirements you are also
choosing the required cost & architecture.
By
having well-defined RTO/RPO requirements you will avoid having an
over-engineered solution (which may be far too expensive) and will also avoid
having an under-engineered solution (which may fail precisely when you need it
most - during a Disaster event)
So ‘Band’
your data assets into different categories for RTO/RPO purposes.
Example: Band 1 = 1 hour RTO. Band 2 = 1 day
RTO. Band 3 = 1 week RTO, Band 4 = 1 month RTO, Band 5 = Not required in the
event of a disaster. You would be surprised how much data can wait
in the event of a SEVERE crash.
For instance, datasets that are used to provide
a report that is distributed once per month – you should never require a 1-hour
RTO. Or even if it does that, it will only for the last day of the month. Rest
of it, which is 29/30=97% should at max require a 1 day RTO even with maximum
availability requirements.
So the recommendation is to drill down your
dataset and categorize it for RTO/RPO objectives. You will eventually get into
a solution/architecture which would be better adaptive and more available
without increasing your budget. This will be more of a journey rather than
getting it 100% right at the first time.
Qs: Who will decide the RTO/RPO of the wildly
varying sets of data in my data lake?
Ans: The data/business line owners ideally will be the
person taking the decision.
For log/troubleshooting/configuration type of data the admins and data
engineers should take the decision which should accept feedback from the data/business
line owners
At this point of time we have not introduced
any tools or low level strategy for Disaster recovery and Backup.