How Kontent.ai handles disaster recovery

Things can go wrong anywhere—even in headless CMSs. How can you be sure that your website will be always-on if you use a SaaS?

Written by Matej Zachar

Published on Published on Feb 10, 2021

In this article, I will explain what a disaster recovery plan is, who is responsible for it, and why it’s a crucial part of any SaaS service.

Kontent.ai is a multi-tenant solution. One resource failure is capable of causing an outage for many customers. That’s a fact. We are well aware of these consequences, and they were also the primary motivation for developing a solid disaster recovery plan. To keep our and your business continuously going on.

The disaster recovery plan

To create an effective disaster recovery plan, you first need to establish a team and every member’s responsibilities. At Kontent.ai, the team looks like this:

Chief Information Security Officer (CISO) - accountable for the whole disaster recovery process
VP of engineering - in charge of development processes
3rd level support - responsible for taking appropriate steps in case of a disaster event
DevOps - responsible for supporting developers with infrastructure issues

With the right group of people, you can start thinking about:

Inventory of assets
- What assets do we own?
Risk assessment
- How important are they?
- What threats can cause their failures?
- What to do if it really happens?

Let’s take a closer look at each of these questions.

Inventory of assets

What assets do we own? The answer should define the most critical data. In our case, they include:

Content items
Content models
Subscription information
Billing information
And other data

Therefore we know which resources need to be restored in case of disaster recovery event:

Cosmos DB
Azure SQL
Azure Storage, etc.

It’s crucial to review the list regularly. Any implementation change may be a subject of a disaster recovery plan. This strongly applies to Kontent.ai as we release new functionality every two weeks.

Risk assessment

Once the asset inventory is completed, you move on to the risk assessment. For each resource, identify potential threats and analyze them before they occur. Always ask the same question: “What’s the worst scenario that our business might have to deal with?”

Our list contains, among others, the following very common scenarios:

User error
Everyone makes mistakes, and their impact varies wildly from a minor inconvenience to a major problem that affects multiple users.
Resource failure
Kontent.ai relies on MS Azure infrastructure. What happens if any resource fails? Important services could be interrupted.
A bug in the production code
No code is perfect. Bug in the production code may affect a customer’s data, create inconsistencies in the data, or permanently delete it.
Natural disaster
Natural occurrences like wildfires, earthquakes, hurricanes, and so on could cause serious damage to data centers.

Once you know what assets you have and what may happen to them, the next reasonable question is: “What should be restored first?”

It all depends on the asset’s criticality. After speaking to each data owner, you need to determine how critical each resource and the data stored on it really is to your business. A different set of disaster recovery controls is applied to the specific resource based on RTO and RPO.

Define RTO and RPO

There are two important metrics defining your business continuity and disaster recovery strategy:

Recovery Time Objective (RTO)
RTO is measured in terms of how long the business can survive following a disaster before operations are restored to normal. For example, if Amazon goes down, how long it can stay off until customers start looking to place orders elsewhere—one minute, maybe?
Recovery Point Objective (RPO)
RPO is a measurement of the maximum tolerable amount of data to lose. Simply speaking, it’s the amount of new data from the last time you created a backup. If Amazon did backups every 24 hours, a disaster could wipe out all new orders placed the previous day.

In an ideal world, the RTO and RPO should both be as short as possible. In reality, you need to take your assets and prioritize the recovery according to their criticality and your budget.

When it comes to Kontent.ai, the RPO equals zero minutes. This impressive feat is achieved through the use of incremental backups, which continuously capture and store all changes made since the last backup. This means that in the event of a disaster, data can be recovered up to the last recorded change, ensuring minimal data loss and disruption. In addition, thanks to regular disaster recovery drills, Kontent.ai has effectively minimized their RTO to 12 hours. While RTO represents the maximum downtime, it's worth noting that most incidents are resolved much faster and are documented on our status page. This robust disaster recovery strategy proves Kontent.ai’s commitment to data protection and operational resilience.

Test and practice DR plan

Just like every pilot, doctor, fireman, and others need to be regularly trained, a good Disaster recovery plan needs to be verified and regularly tested. Every year we take the most critical assets from the asset inventory explained above, the most likely threats, and see how quickly we can recover from the disaster. We simulate:

Partial data corruption when only a small subset of customer’s data is affected
Full data corruption when most of our customer’s data is affected
Resource failure when, for example, Cosmos DB is unavailable
Datacenter failure when some of Azure data centers are down

But it’s not all about hard skills. Successful disaster recovery is also about communication, cooperation with other teams, and making the right decisions under pressure. These soft skills are an integral part of any disaster recovery plan and an important component of our employee training.

Being a SaaS vendor is a huge responsibility. Responsibility for keeping all services running and for minimizing the disaster impact to an acceptable level. All the described activities help us continuously provide a great service and protect all our clients from unwanted disruptions.

Table of contents

The disaster recovery plan
- Inventory of assets
- Risk assessment
- Define RTO and RPO
- Test and practice DR plan

How Kontent.ai handles disaster recovery