In this article, I will explain what a disaster recovery plan is, who is responsible for it, and why it’s a crucial part of any SaaS service.
Kontent.ai is a multi-tenant solution. One resource failure is capable of causing an outage for many customers. That’s a fact. We are well aware of these consequences, and they were also the primary motivation for developing a solid disaster recovery plan. To keep our and your business continuously going on.
The disaster recovery plan
To create an effective disaster recovery plan, you first need to establish a team and every member’s responsibilities. At Kontent.ai, the team looks like this:
- Chief Information Security Officer (CISO) - accountable for the whole disaster recovery process
- VP of engineering - in charge of development processes
- 3rd level support - responsible for taking appropriate steps in case of a disaster event
- DevOps - responsible for supporting developers with infrastructure issues
With the right group of people, you can start thinking about:
- Inventory of assets
- What assets do we own?
- Risk assessment
- How important are they?
- What threats can cause their failures?
- What to do if it really happens?
Let’s take a closer look at each of these questions.
Inventory of assets
What assets do we own? The answer should define the most critical data. In our case, they include:
- Content items
- Content models
- Subscription information
- Billing information
- And other data
Therefore we know which resources need to be restored in case of disaster recovery event:
- Cosmos DB
- Azure SQL
- Azure Storage, etc.
It’s crucial to review the list regularly. Any implementation change may be a subject of a disaster recovery plan. This strongly applies to Kontent.ai as we release new functionality every two weeks.
Once the asset inventory is completed, you move on to the risk assessment. For each resource, identify potential threats and analyze them before they occur. Always ask the same question: “What’s the worst scenario that our business might have to deal with?”
Our list contains, among others, the following very common scenarios:
- User error
Everyone makes mistakes, and their impact varies wildly from a minor inconvenience to a major problem that affects multiple users.
- Resource failure
Kontent.ai relies on MS Azure infrastructure. What happens if any resource fails? Important services could be interrupted.
- A bug in the production code
No code is perfect. Bug in the production code may affect a customer’s data, create inconsistencies in the data, or permanently delete it.
- Natural disaster
Natural occurrences like wildfires, earthquakes, hurricanes, and so on could cause serious damage to data centers.
Once you know what assets you have and what may happen to them, the next reasonable question is: “What should be restored first?”
It all depends on the asset’s criticality. After speaking to each data owner, you need to determine how critical each resource and the data stored on it really is to your business. A different set of disaster recovery controls is applied to the specific resource based on RTO and RPO.
Define RTO and RPO
There are two important metrics defining your business continuity and disaster recovery strategy:
- Recovery Time Objective (RTO)
RTO is measured in terms of how long the business can survive following a disaster before operations are restored to normal. For example, if Amazon goes down, how long it can stay off until customers start looking to place orders elsewhere—one minute, maybe?
- Recovery Point Objective (RPO)
RPO is a measurement of the maximum tolerable amount of data to lose. Simply speaking, it’s the amount of new data from the last time you created a backup. If Amazon did backups every 24 hours, a disaster could wipe out all new orders placed the previous day.
In an ideal world, the RTO and RPO should both be as short as possible. In reality, you need to take your assets and prioritize the recovery according to their criticality and your budget.
Test and practice DR plan
Just like every pilot, doctor, fireman, and others need to be regularly trained, a good Disaster recovery plan needs to be verified and regularly tested. Every year we take the most critical assets from the asset inventory explained above, the most likely threats, and see how quickly we can recover from the disaster. We simulate:
- Partial data corruption when only a small subset of customer’s data is affected
- Full data corruption when most of our customer’s data is affected
- Resource failure when, for example, Cosmos DB is unavailable
- Datacenter failure when some of Azure data centers are down
But it’s not all about hard skills. Successful disaster recovery is also about communication, cooperation with other teams, and making the right decisions under pressure. These soft skills are an integral part of any disaster recovery plan and an important component of our employee training.
Being a SaaS vendor is a huge responsibility. Responsibility for keeping all services running and for minimizing the disaster impact to an acceptable level. All the described activities help us continuously provide a great service and protect all our clients from unwanted disruptions.