Created on 07-05-2016 07:26 PM
Disaster recovery plan or a business process contingency plan is a set of well-defined process or procedures that needs to be executed so that the effects of a disaster will be minimized and the organization will be able to either maintain or quickly resume mission-critical operations.
Disaster usually comes in several forms and need to planned for recovery accordingly:
Disclaimer: 1. This article is solely my personal take on disaster recovery in a Hadoop cluster
2. Disaster Recovery is specialized subject in itself. Do not Implement something based on this article in production until you have a good understanding on what you are implementing.
This is the time I introduce RTO/RPO.
RTO/RPO Drill Down
RTO, or Recovery Time Objective, is the target time you set for the recovery of your IT and business activities after a disaster has struck. The goal here is to calculate how quickly you need to recover, which can then dictate the type or preparations you need to implement and the overall budget you should assign to business continuity.
RPO, or Recovery Point Objective, is focused on data and your company’s loss tolerance in relation to your data. RPO is determined by looking at the time between data backups and the amount of data that could be lost in between backups.
The major difference between these two metrics is their purpose. The RTO is usually large scale, and looks at your whole business and systems involved. RPO focuses just on data and your company’s overall resilience to the loss of it.
Qs: What is your RTO/RPO?
Ans: For a complex and large production system this answer would take some time to figure out and will progressively be defined. Also ideally there should be multiple values for this answer.
What are you talking about?
Example: Band 1 = 1 hour RTO. Band 2 = 1 day RTO. Band 3 = 1 week RTO, Band 4 = 1 month RTO, Band 5 = Not required in the event of a disaster. You would be surprised how much data can wait in the event of a SEVERE crash.
For instance, datasets that are used to provide a report that is distributed once per month – you should never require a 1-hour RTO. Or even if it does that, it will only for the last day of the month. Rest of it, which is 29/30=97% should at max require a 1 day RTO even with maximum availability requirements.
So the recommendation is to drill down your dataset and categorize it for RTO/RPO objectives. You will eventually get into a solution/architecture which would be better adaptive and more available without increasing your budget. This will be more of a journey rather than getting it 100% right at the first time.
Qs: Who will decide the RTO/RPO of the wildly varying sets of data in my data lake?
Ans: The data/business line owners ideally will be the person taking the decision.
For log/troubleshooting/configuration type of data the admins and data engineers should take the decision which should accept feedback from the data/business line owners
At this point of time we have not introduced any tools or low level strategy for Disaster recovery and Backup.
More to come in series 2...
A very special note of thanks to @bpreachuk who pretty much penned down the RTO/RPO explanation. It was written so well that i almost copied it:)
I also want to thank @Ravi Mutyala from whom I have learnt (and learning :)) a lot in this subject area.