Machine learning: How to ensure labels are smart, not lucky
Without labels and categories organizing our lives, the world would be chaos. However, what you consider to be a “sensible label” for something may not be the term others would use, and choosing the “correct” terms turns into a game of chance. So how can you guarantee that your project’s content is organized with appropriate and consistent categories?
In this article, I’ll demonstrate how to use ML.NET and the Kentico Kontent .NET SDKs to automate adding taxonomy terms to content items. Using data-driven categorization through machine learning, we will ensure that our taxonomy terms are consistent, while also taking the burden of choosing a sensible label off of our content editors’ shoulders.
Let’s turn that game of “lucky labeling” into a smarter solution!
No data science or machine learning degree required
I was initially intimidated by the algorithms, statistics, and other mathematical complexities surrounding creating machine learning models, but the fears were quickly dispelled when I discovered Microsoft’s ML.NET Model Builder Extension for Visual Studio. This extension uses automated machine learning to find the best machine learning algorithm for your scenario and data set. Once it chooses the best algorithm, you’re able to select that algorithm, train your model, and evaluate it all within a graphical user interface. The big takeaway: It requires zero machine learning/data science experience.
When could ML.NET categorization be helpful in a Kontent project?
Applying machine learning for content categorization could be helpful if:
- You have similar taxonomy terms in your project and editors are struggling to decide which term best fits their content.
- There are a lot of taxonomy terms in the project, and you need to identify which ones can be removed.
- You have a large data set and need to identify what terms you should add to your project.
There are likely many more scenarios where machine learning could be helpful, but these were the ideas that came to mind when I started this proof of concept. In the demonstration below, and for the remainder of the article, we are going to focus on scenario #1: You have similar taxonomy terms in your project, and editors are struggling to decide which term best fits their content.
To demonstrate how ML.NET can automate content categorization in Kentico Kontent, I created a .NET Core 3.1 console application that suggests which Netflix category a movie should be listed in based upon its title, description, and rating. For the sake of simplicity, I’ve focused on a non-hierarchical labeling structure where one taxonomy term is assigned to a movie. The steps for creating the project are broken down into the following stages and assumes you know how to create a console application in Visual Studio:
- Establishing the ML.NET model
- Preparing Kentico Kontent
- Running content against the model
- Importing the suggested taxonomy term
Establishing the ML.NET model
The first step for creating my ML.NET model was to find a suitable dataset in the supported SQL Database, CSV, or TSV formats. I chose to use a CSV file but had to sanitize the values to make it a valid CSV and remove some ASCII characters from the data. I then downloaded and installed Microsoft’s ML.NET Model Builder Extension for Visual Studio. Upon completion, I opened my console application in Visual Studio, right-clicked the project, and used “Add” to add “Machine Learning” to the project.
After machine learning has been added, the extension provides six machine learning templates to choose from, including a “Custom Scenario” option. Per Microsoft’s scenario examples and descriptions, “Issue Classification” best fit my scenario, so I selected it.
The next steps are for training the model, so I selected my sanitized CSV, identified the column I want my program to predict, and chose which columns the model should base its suggestion upon.
I clicked the “Train” button at the bottom of the screen, which opened a screen where I could specify how long to train the model. Microsoft has some training recommendations here: https://docs.microsoft.com/en-us/dotnet/machine-learning/automate-training-with-model-builder#how-long-should-i-train-for. In 120 seconds, it managed a 57.44% accuracy. Clicking “Evaluate” allowed me to test the model against some of the data in the CSV and broke down the top 5 prediction results.
After a few tests, I wasn’t happy with the 57% accuracy results, suggesting that the Microsoft training time recommendation was too short for my seventeen different classification categories. That inspired me to re-run the training for 1 hour, which resulted in an improved ~66% accuracy. I then clicked the “Code” button that generated and added two machine learning projects to my solution, and I was able to consume the model using:
I ran the console application, and it correctly produced the output “Children & Family Movies,” demonstrating that my ML.NET model was ready to be linked to a Kontent project.
Preparing Kentico Kontent
Next, I needed to prepare a Kentico Kontent project that would use my console application. In a real-world scenario, the project may exist before the need for “Smart” labeling arose, but in my proof of concept the demand for machine learning was driven by my dataset, not my Kontent project.
To set up Kontent, I created a new project in my subscription that would contain a single content type called Movie and a single taxonomy group called Listed in. The content type and taxonomy group consisted of elements and terms from the CSV I used to build my machine learning model:
- Title: a simple text element
- Rating: a multiple-choice element containing all 11 content ratings (rated TV-G to rated R) in my CSV
- Release Date: a date & time element
- Description: a rich text element
- Listed in: a taxonomy group containing all 17 of “Listed in” terms in my CSV
Once the content type and taxonomy terms were present, I started making some sample content item variants using not-yet-released movie titles, ratings, and descriptions. I left them in the “Draft” workflow step to ensure I was able to upsert data to the items from my console application.
Running content against the model
Now it was time to pull that content from Kentico Kontent and feed it into my generated ML.NET model code so it could make a prediction. For this, I installed the Kentico Kontent .NET SDK NuGet package and created a MovieListing Class that returns a strongly typed Movie:
I instantiated this class from my main Program.cs, passing my API keys to it, and returned a DeliveryItemListingResponse that I can loop through. I set up my ML.NET consumption logic in a separate class for easier maintainability and a cleaner main program:
And I instantiated it from Program.cs. The TaxonomyPredictor.GetTaxonomy(Movie) method could then be used to suggest “listed in” terms when looping through my list of movies returned by the MovieListing.GetMovies() method.
This produced the “best match” taxonomy term in the console when I ran the application.
Importing the suggested taxonomy term
Finally, it was time to automate upserting the suggested taxonomy terms to the content item variants sitting in my Kentico Kontent project. I achieved this using the Kentico Kontent .NET Management SDK in a separate class called TaxonomyImporter:
I instantiated the importer in Program.cs so I could use the UpsertTaxonomy method when looping through my list of movies. I then had to create a strongly typed MovieImport model that inherits from Movie in order to accommodate what the Kentico Kontent Management SDK expects when upserting content item variants:
Once this was done, and I added logic to use environment variables, appsettings.json, and configuration binding using a combination of .NET Core configuration options, the application was complete! Below is the completed Program.cs file, configuration files, and result of running the program:
Imagine the possibilities
In this article, I showed you how to use the ML.NET Model Builder with a .NET Core console application and a Kentico Kontent project in order to automate assigning taxonomy terms to your content. In my demonstration, I focused on a flat taxonomy structure and straightforward classification scenario, but this is just one of almost endless machine learning possibilities. What if you took this to the next level and attached it to Kentico Kontent webhooks? How about using image classification to automatically write descriptions for your assets? Maybe you want to dive into machine learning and create your own custom model to perform more complex categorization specific to your industry? With the right tools and mindset, I think anything is possible.
You can find the full source code for my application, supporting data files, and instructions on how to test your own copy of the project here: