Building a cloud-first headless CMS, which is a multitenant SaaS platform based on a microservices architecture that can scale as needed, is not a trivial task. A lot of planning and strategizing goes into the process of architecting, designing, and implementing the service to make sure it works for all the users, no matter where they are coming from and where they are heading. In this blog post, we are going to have a look at different parts of the Kentico Cloud architecture to show you what is under the hood of Kentico Cloud.
Building Something New
I had the privilege and luxury of being part of the original Kentico Cloud team. For me, as a Software Architect and Developer, it was always a dream to build an interesting project from scratch. The excitement of the initial "Yay, let’s do it!" moment was followed by the "But how?" phase. And that's when our journey to a cloud CMS started.
Firstly, it's important to understand what are the different roles and critical components of a cloud-first headless CMS.
As the graphic above illustrates, a concept of cloud CMS is one that needs to consider many different touchpoints and scenarios. From project management and content production and management to publishing and delivery, the system has to provide services and capabilities that support these scenarios.
We knew we needed to build a product that would do all of that in a scalable, secure, and highly available fashion. We knew we needed a product that would be easy to maintain and extensible. These were key requirements that drove the discussion about technology and architectural decisions we made along the way.
The first decision we made was to go with Microsoft Azure as a cloud platform. At Kentico, we have plenty of experience with .NET, Azure, and Microsoft technologies in general, and we've always found it to be a stable and reliable technology and have been pretty happy with it.
Initially, we decided to split the system into two main services or deployment units, if you will:
- Content Management—create and maintain content with a content-first approach.
- Content Delivery—deliver content to target presentation code.
At first, a shared content repository was very appealing.
But we soon realized that having a common storage has its disadvantages:
- The management service needs storage optimized for writing while delivery service needs storage optimized for reading and querying.
- Not all data are needed by both sides.
- Each service has different storage scalability targets.
- A shared-data structure induces tight coupling between services.
- Maintaining and updating the DB would make the delivery service unavailable.
- The delivery service has higher availability needs than the management service.
As the list of ‘cons’ grew, we began to focus more towards an independent, loosely-coupled services architecture. We leveraged one of the most common practices to decouple services—queue based messaging.
In short, the management service publishes events representing changes of the data state in a defined contract. The delivery service receives and processes such messages and saves them in its storage.
Advantages of this architecture:
- Queues support multiple consumers (subscribers), and other possible consumers can process the same messages for other purposes (audit log, full-text indexing, machine learning, …).
- Developer teams are more independent.
- Services can be deployed independently.
- No cascade failing of services—downtime of one service does not affect the other service.
- And solving all disadvantages listed previously.
We adopted this architecture hoping it will help us if things get more complicated. And surprise, surprise, they have.
Our events queue is based on the Azure Event Hubs service offered by Microsoft Azure. We were considering the Azure Service Bus message broker that offers many advanced enterprise features, but, in the end, we decided not to go with it since our needs were slightly different and Service Bus would be an overly complex solution for our needs. Azure Event Hubs gives us great performance for a more reasonable price, which allows us to keep the price of the service lower. We really consider Event Hubs (along with Apache Kafka) as one of the basic services that should be at the core of any distributed system.
Preview vs. Live
Shortly after we happily immersed ourselves into coding, we realized the importance of preview. Users really need at least these environments:
- Preview—bleeding edge of content state.
- Live—data for general audience.
To achieve high availability of the live environment, we had to isolate it from the preview environment where developers could and inevitably will make bugs in code. Such bugs could degrade the performance of the live environment.
This leads to the architecture where the whole Content Delivery Service is duplicated (live and preview) and isolated into their own Azure Services.
Making Things Fast
The next challenge we had to solve was how to make Kentico Cloud respond fast, no matter where in the world users work with it. Although our delivery content storage is optimized for read operations, there are still complex queries that can be pretty resource intensive.
For example, Delivery API allows you to specify a depth of referenced items (modular items) and get the whole graph of items with one call. As you can imagine, to retrieve the content is an immense workload to handle. However, the performance of the system doesn't depend only on how fast the system retrieves the requested content. It also depends on how far that data has to travel to reach the destination. Right now, Kentico Cloud administration instances are hosted in the Microsoft Azure US West (California) data center. Now, let's say the web app displaying the content stored in Kentico Cloud runs in Sydney. Light can travel through optical fiber from US West to Sydney and back in about 130 ms, theoretically. If you add the latency caused by IP infrastructure, you get to about 165 ms, which is an average ping time to mainland Australia.
To provide Kentico Cloud customers with the highest level of performance possible, we decided to make a Content Delivery Network (CDN) part of the infrastructure.
Taking It to the next Level with the CDN
Although we could address both issues by deploying an instance of Kentico Cloud in each region and use local caching to bump up the performance, it would take much more effort, add an operational complexity, and, frankly, most likely be much less effective than leveraging a third-party CDN service for it.
We decided to use Fastly CDN, which provides us with many points of presence and extremely fast cache invalidation. Another benefit of using the CDN is that it allows us to serve stale content, which comes in very handy in situations when the origin server is down or slow. In those scenarios, CDN returns a slightly older version of the content to ensure uninterrupted operations of your digital properties connected to Kentico Cloud.
Performance Testing Results
The above-mentioned architecture allowed as to assume that Kentico Cloud is a highly scalable and robust solution with sufficient performance for our current customer base and the near future (at least the next year).
To validate the assumption, we stress- and load- tested the whole environment using Microsoft Azure Performance testing tools. Comprehensive tests simulated user behavior have included content creation, content update, and content delivery in respect of the worst-case scenario.
- Mean response time was 100 ms.
- With the current DocumentDB performance setting (1000 RU/s), we have more than sufficient performance for our expected customer base in the near future.
- DocumentDB is a bottleneck, luckily it is easy to scale up by simply buying more RU/s.
- Real-world scenario (not the worst-case) had a mean response time 40 ms.
We are very happy with these results. Not only does it allow us to sleep better, but it also let us focus on delivering valuable functionality to our customers as opposed to fighting performance issues.
Monitoring and Troubleshooting
Hosting applications in a public cloud requires a systematic and pedantic approach to monitoring and troubleshooting. Because of the limited access to hardware and operating systems that a cloud service such as Kentico Cloud runs on, proper instrumentation has to be in place. Although Azure, by default, does it well for most of its services, one still needs instrumentation on the application layer.
Most of our needs have been addressed by Azure Application Insights, which I would recommend to almost everybody to consider. We use Application Insights in combination with NLog. NLog allows us to write logs to files and is configurable during runtime so engineers, while troubleshooting, can temporarily enhance the log level of particular code areas. Such logs end up in Application Insights, along with standard Application Insight data (requests, exceptions, dependencies) for easy lookup and analysis. We also configured alerts and deployed other measurement tools (for example Pingdom) to enhance our proactive and reactive monitoring and issues escalation.
Your Data Is Safe
Azure SQL Server
For relational data (Users, Projects, Roles, Subscriptions). Continuous backup (transaction log) with the ability to restore to any point of time in the last 14 days.
As content management storage. Daily backup with 14-day retention. Although we have experience with MongoDB, we came to the conclusion that Azure Table is cheaper and sufficient for our rather basic storage needs.
As assets (files) storage. Daily backup with 14-day retention. Content management storage uses private storage while Content Delivery service uses public storage (so files can be cached by CDN).
We use DocumentDB as a Content Delivery storage. There is no need to backup our DocumentDB storage. In the case of emergency, all its data could be recreated from the management content store. Storing content items as JSON documents has its merits, and with great DocumentDB functionality and scalability, our bases shall be covered here.
DocumentDB has just recently become part of Azure Cosmos DB.
Building the experience
We have started developing our back end with .NET 4.5 MVC Web API 2. We are now looking at transitioning to .NET Core and will eventually migrate to it after tooling and adoption of .NET Core becomes sufficient for our needs. We are running background tasks with WebJobs.
The front end of Kentico Cloud is built as a Single Page Application using React.js. The SPA provides the flexibility we need and gives us the ability to create a UI delivering the best possible user experience. We originally started building the Kentico Cloud administration UI using React.js plus Flux, but as the product has grown, we decided to refactor using React.js plus Redux instead. This has been an interesting experience, which deserves a blog post of its own.
Also, our administration front end is integrated with the Intercom customer messaging platform, which allows end users to communicate with our Customer Success and Development teams. Moreover, tracking how users interact with the application allows us to understand better what areas of the product provide the highest value to customers and who potential early adopters for validation of new features are. We really take customer success seriously, and Intercom is helping us with it a lot.
As you can see, building a distributed system is not an easy task and, based on the requirements, the way to go about it can differ significantly. There is no silver bullet for it. However, applying best practices like code instrumentation and decoupling services into independent deployment units (or micro-services) can set you up for a great start.