Time is money!
Your boss keeps talking about RPO (Recovery Point Objective) and RTO (Recovery Time Objective). Do you just nod your head like you know what he/she is talking about? Maybe that scenario just happened and you are searching the internet for what these terms mean. If so, welcome! No one likes to think about disasters, but they happen all too often. Planning for the worst and hoping for the best will keep your data safe and your job even safer. Let’s take some time and explore what RPO and RTO mean, why these things are important, and what you need to do next to be a hero DBA!
RTO (Recovery Time Objective)
Recovery Time Objective is the amount of time in which your company expects you to have the database fully restored after a disaster. That is, how much downtime is acceptable for disaster recovery or planned outages. Each company is different, and most reference RTO in terms of nines.
The nines measure for a company that measures 365 days a year, 24 hours a day as follows:
5 9’s – 99.999% (this translates to about 5 minutes of acceptable downtime per year)
4 9’s – 99.99% (this translates at about 52.5 minutes per year and is much easier to achieve)
3 9’s – 99.9% (this translates at about 8.75 hours per year)
2 9’s – 99% (translates to about 3.5 days a year)
To decide what RTO is best for your company, you need to take into consideration your data needs. Not all companies run on a 365/24 schedule. Some companies only measure downtime between 8am-6pm Monday through Friday, or only on the weekends. This will drastically change the translation of the 9’s. Another thing to think about is whether the measured downtown includes time for maintenance or patching, times when the database must be offline. If maintenance time is eliminated from consideration, meeting the higher 9’s is much easier.
If your company insists on an RTO of 5 Nines and does not take into consideration maintenance or patching, then you must speak with the persons in charge to discuss the RPO. It is possible to adhere to the strict 5 minutes of downtime, but the point at which you are able to recover, will definitely be restricted.
RPO (Recovery Point Objective)
Recovery Point Objective is the level of data or work that is acceptable to lose in the event of a disaster. Ideally, companies will want ZERO data or work loss. While that IS achievable, it will all depend on valid backups and the extent of damage the database suffered at the point of disaster.
An RPO of 15 minutes means that the data and work must be recoverable to a point within 15 minutes of the disaster, meaning that it is expected that only 15 minutes of work or data may be lost. Stop right here and think about your backup plans and recovery models. Restoring a database that is in simple recovery model should not take as long as a restoring one in full recovery model. It is important to remember (from previous blog posts), the recovery model dictates how much data you can recover. It is also important to remember, the ability to recover ANY data at all is fully dependent on having valid backups.
Another term you might hear is “Run Book”. A Run Book is a physical or digital collection of information that is needed to restart the database in case of disaster. There are many items that should be included in the runbook. Some of the essential items one should consider having in the runbook are:
- Server level info, configuration, purpose, etc.
- List of all databases and applications using them
- List of agent jobs and proper response to a failure
- Disaster Recovery process with all contacts, RPO/RTO, etc. required to bring it back (based on level of issue)
- Backup schedules
When considering a run book, think about what someone would need if they were new to the company and the only person available to restart the database. What information would that person need? Making sure your run book is up to date on a regular basis is certainly a great idea!
Preparing for disaster
Keep in mind that if you prepare for the worst, you will be less likely to be caught off-guard with a manager breathing down your neck asking “WHEN WILL WE BE BACK UP AND RUNNING?!?!” Do you have any idea how long it will take to restore your database? If your answer is “no,” I would suggest doing a restore to see how long this takes. Further, I would suggest making it a habit to perform drills so that you and your team know what to do in the event of a disaster, and exactly how long it takes to get your company back up and running. Having a solid backup schedule, validating those backups, and keeping your company’s expectations in mind, you will be ready to handle any data disaster that may be thrown your way.
*Originally posted at Procure SQL: