Searching and indexing modular content with Kontent and Algolia
Search—you never imagine there's much to implement until you have to do it yourself. What are the known search functionality challenges and how to resolve them when using a headless CMS?
Rosta StrizPublished on Jul 27, 2021
Every content-focused website or application needs to offer search functionality. With the introduction of headless content management systems, implementing search introduces new challenges you have to be prepared for. In this article, we’ll explore what these challenges are and how they can be addressed.
A headless CMS allows content to be published in many places. And the content can be modularized, so that content instance does not need to be duplicated each place it appears. Our goal is to provide a robust search that can:
- Find and show where the content appears
- Be up-to-date everywhere
We’ll be using Algolia as an external search index that will take care of the actual searching and indexing of your content. The search index is also going to be autonomously synchronized with any content changes editors make in Kontent, so the visitor is always presented with up-to-date content.
Although using an external service “just” for searching may seem like an unnecessary extra step, I’ll explain that this should actually be the preferred architecture when implementing an enterprise-level search solution that needs to handle complex content structure, is not limited by any number of used languages, and requires great performance and stability.
Indexing content stored in a headless CMS
When working with headless content and its structure, the main difference lies in the way you index your content in the search index. I mean the transformation between your content model from Kontent and the structure you decide to create inside the search index. This transformation process is not exclusive to search, but you can run into a very similar challenge when implementing content recommendations or even personalization (basically, whenever you want to synchronize your content between your content inventory and a third-party data storage service).
The content structure of larger projects tends to be complicated. The content is often heavily modular, i.e., hierarchical, with different content pieces being reused across the inventory. It is also multilingual or regionally specific, which means you’ll be working with a wide array of different language variants, as well as collections. Moreover, regular content production has become an industry standard, so some parts of the content will be highly dynamic, and editors will most likely be producing new content frequently.
Following all of the recommended content modeling best practices, you will soon discover that the content model you’ve created in your headless CMS is not very well adapted for the search scenario. Don’t worry, it really shouldn’t be. A typical content model is usually more sophisticated than conventional search solutions are designed to handle. The following paragraphs will help you understand how to approach the indexing challenge and what to watch out for when implementing the search functionality for a complex content model.
What is modeled vs. What is presented?
The traditional CMSs are built to work with pages. Pages have their structure, and they are being displayed on a specific URL in a predefined way. Even if the pages are modular and composed of different things, you still have the presentation layer (i.e., the “head”) coupled with the rest of the system, so it’s pretty easy to establish what is rendered where and how. In the headless CMS, the presentation layer is decoupled from the content structure, so the CMS itself has no way of telling what content is being displayed where despite having the information about the complete content structure. This lack of presentation information is one of the basic challenges connected to indexing your content because you don’t really care in what part of your content model the information is actually found. What you really need is to find all the places the information is being displayed on your website, or in your application, in order to navigate your visitors to the content they are trying to find.
In reality, there is rarely a complete disconnect between the model and the presentation layer. Different methods are often used to make development and content editing easier, so a lot of projects will still contain content types specifically referencing the presentation or implementation structure, like pages. Alternatively, some types will contain special elements that are used for navigation, like URL slugs or special taxonomies. We’ll be leveraging this structural information in our indexing efforts.
The “simple” approach
There’s a simple (naive) way to implement search in headless by mirroring the content structure of the headless CMS. Some of our competitors present this as an out-of-the-box feature with their provided search endpoints. In some simple scenarios, this approach can actually work fine and produce a working search. The content structure in those scenarios has to be flat (e.g., searching simple blog posts) and, going back to the previous section, has to be easily transformed into a rendered page. Let’s quickly go through a simple example.
Let’s say we are working with an
Article content type—visualized in the image above. We only want our visitors to be able to search through a list of such articles. Although the content model itself (on the left) might also contain some linked content properties (i.e., details about the author), we can very well decide we don’t want to include such content into the search functionality. This will work great, as whenever a visitor searches for a phrase, the search index can list articles based on the received query.
Now, since you always want to have your search index up to date, we have to talk about content synchronization as well. The simple approach described above works well with the usual change notification systems (like webhooks). Whenever you are informed a new article was published by the CMS, you can just simply add it to the search index basically as-is, or with a minimal transformation effort. The same goes for unpublishing or making changes to existing content—it’s always clear what has to change and how to achieve that.
Search can be straightforward for simple, flat content. But when content gets more sophisticated, the search approach needs to become more sophisticated as well.
What about modular content
Once we want to include modular content, i.e., linking multiple items together to compose the final content displayed on the page, the above-mentioned simple approach is no longer sufficient. Let’s assume you simply transfer all of your content items as separate entities into your search index. You’ll still be able to search all of the items, but the results would consist of different content items that matched with said queries. Since we’re working with modular content now, you might not have the information about where exactly your result content items are actually being displayed on your webpage, thus you might not be able to navigate your visitors to the found content. A search result like that is usually pretty useless.
To demonstrate this issue, let’s take a look at another, more complicated example.
Feature content type is just like a typical landing page and contains links to various different content types, like
case study, and
partner. If we mirrored the structure in the search index and, as a result of a search query, got the
partner content item, we wouldn’t be able to say where this content is actually displayed on the website since it can be rendered practically anywhere.
The simplest solution is to index only the items you can navigate to (as the simple approach did) and embed all of the linked modular content into it. This will work for the searching itself, however, another challenge arises when you try to keep a structure like that up to date and synchronized with your CMS.
Synchronizing modular content
With modular content, the content can be reused in multiple places. How can we know where this is content is being used?
The basic notification principles apply to modular content as well. Whenever there’s a change, you’ll get notified which modular content changed and how. However, you won’t receive any information about where a specific content item is being linked from. In order to be able to efficiently synchronize your search index, you have to introduce a structure that will allow you to find all the places the changed modular content resides in and partially update them all at once. That way, your index stays up to date all the time.
Some projects leverage a content inventory map that tells you from where every item is linked. However, creating, maintaining, and updating such a map for a really large content inventory is a complex task.
Our proposed solution is to model the search index in such a way that it can act as the aforementioned map. Whenever a change notification comes, the indexed structure itself will decide which parts of the index should be updated. This will resolve all of the presented issues, as well as offload the heavy lifting onto the search index.
The suggested search index structure
Just to sum up, we want to model our search index to satisfy these key requirements:
- Every indexed item has to embed all of the content we want the visitor to be able to search for in the context of the given item (e.g., all content from components and linked items).
- We need to be able to navigate to every indexed item (e.g., map the item to a URL).
- We need to be able to partially update any piece of content, even if it’s embedded in multiple different items, without the information about its actual placement.
This is how such a search index can be modeled:
You can still decide to omit certain parts of the content model (for example, we are not using the linked
case study here).
The most important thing is to keep the meta-information about each included embedded item so when there’s a request for an update, we are able to search for all of the occurrences of the given content item.
You can see that when there’s a request for a modular content update, the search index itself can handle it and update all copies of the embedded content item as Algolia allows you to search for exact facet values.
The next step in the transformation process is flattening the content hierarchy so the indexed structure is uniformly searchable. Simply put—we don’t really care at what depth the content lives in our content model, the only thing that matters is that it is presented somewhere on the top-level page.
Here’s a screenshot of an actual search index item that is modeled according to the described principle:
The original content type of this particular content item is actually very hierarchical. Here, we are using only a two-level hierarchy to model it in the search index. On the top level, we have the item that contains a URL (through the
slug property) and all of the standard content item metadata, so we are able to work with specific languages or collections. Additionally, under the
content property, you can see the actual contents of the item, as well as all of the linked items and their content. The content itself (in our case a concatenation of rich text and text properties of each item) is put into the
contents property, which is also the one property that is actually being searched by our visitors. On top of that, we still have all the other metadata we can use to locate a concrete content piece (through matching
Searching and synchronization
Now, when you are using this structure for your full-text search, you are just searching the
content.contents properties of each item. Only the text contents of the top-level content item with all of your modular content pieces embedded.
This approach also addresses the issue we had with modular content and its synchronization. In a model like this, whenever a new change request comes, finding the top-level items that contain the changed content item becomes very easy. You are just searching for items that have the codename and language of the updated item in the
content array, and you’ll immediately get a list of parent items you need to change. Algolia also offers partial updates, so it’s possible to change just the part that has changed, and your index will always be up to date.
Let’s try it out
We actually built an example integration between Kontent and Algolia—you can check out the repository on GitHub. The example contains:
- simple documentation
- serverless function for the initial index synchronization
- serverless function for partial content updates
- a custom element that lets you try out the newly built search directly from the Kontent UI
To test it out, you’ll only need an Algolia account, which you can get for free here.
The integration lets you build and immediately test out your search, which will support multiple languages and modular content out of the box. Give it a try and let us know on our Discord if you have any questions or if you just want to chat with other Kontent developers.