blogDevelopment

Machine learning: How to ensure labels are smart, not lucky

By Michael BerryMay 27, 2020

Without labels and categories organizing our lives, the world would be chaos. However, what you consider to be a “sensible label” for something may not be the term others would use, and choosing the “correct” terms turns into a game of chance. So how can you guarantee that your project’s content is organized with appropriate and consistent categories?

In this article, I’ll demonstrate how to use ML.NET and the Kentico Kontent .NET SDKs to automate adding taxonomy terms to content items. Using data-driven categorization through machine learning, we will ensure that our taxonomy terms are consistent, while also taking the burden of choosing a sensible label off of our content editors’ shoulders.

Let’s turn that game of “lucky labeling” into a smarter solution!

No data science or machine learning degree required

I was initially intimidated by the algorithms, statistics, and other mathematical complexities surrounding creating machine learning models, but the fears were quickly dispelled when I discovered Microsoft’s ML.NET Model Builder Extension for Visual Studio. This extension uses automated machine learning to find the best machine learning algorithm for your scenario and data set. Once it chooses the best algorithm, you’re able to select that algorithm, train your model, and evaluate it all within a graphical user interface. The big takeaway: It requires zero machine learning/data science experience.

When could ML.NET categorization be helpful in a Kontent project?

Applying machine learning for content categorization could be helpful if:

  1. You have similar taxonomy terms in your project and editors are struggling to decide which term best fits their content.
  2. There are a lot of taxonomy terms in the project, and you need to identify which ones can be removed.
  3. You have a large data set and need to identify what terms you should add to your project.

There are likely many more scenarios where machine learning could be helpful, but these were the ideas that came to mind when I started this proof of concept. In the demonstration below, and for the remainder of the article, we are going to focus on scenario #1: You have similar taxonomy terms in your project, and editors are struggling to decide which term best fits their content.

Demonstration

To demonstrate how ML.NET can automate content categorization in Kentico Kontent, I created a .NET Core 3.1 console application that suggests which Netflix category a movie should be listed in based upon its title, description, and rating. For the sake of simplicity, I’ve focused on a non-hierarchical labeling structure where one taxonomy term is assigned to a movie. The steps for creating the project are broken down into the following stages and assumes you know how to create a console application in Visual Studio:

  1. Establishing the ML.NET model
  2. Preparing Kentico Kontent
  3. Running content against the model
  4. Importing the suggested taxonomy term


Establishing the ML.NET model

The first step for creating my ML.NET model was to find a suitable dataset in the supported SQL Database, CSV, or TSV formats. I chose to use a CSV file but had to sanitize the values to make it a valid CSV and remove some ASCII characters from the data. I then downloaded and installed Microsoft’s ML.NET Model Builder Extension for Visual Studio. Upon completion, I opened my console application in Visual Studio, right-clicked the project, and used “Add” to add “Machine Learning” to the project.

After machine learning has been added, the extension provides six machine learning templates to choose from, including a “Custom Scenario” option. Per Microsoft’s scenario examples and descriptions, “Issue Classification” best fit my scenario, so I selected it.

The next steps are for training the model, so I selected my sanitized CSV, identified the column I want my program to predict, and chose which columns the model should base its suggestion upon. 

I clicked the “Train” button at the bottom of the screen, which opened a screen where I could specify how long to train the model. Microsoft has some training recommendations here: https://docs.microsoft.com/en-us/dotnet/machine-learning/automate-training-with-model-builder#how-long-should-i-train-for. In 120 seconds, it managed a 57.44% accuracy. Clicking “Evaluate” allowed me to test the model against some of the data in the CSV and broke down the top 5 prediction results. 

After a few tests, I wasn’t happy with the 57% accuracy results, suggesting that the Microsoft training time recommendation was too short for my seventeen different classification categories. That inspired me to re-run the training for 1 hour, which resulted in an improved ~66% accuracy. I then clicked the “Code” button that generated and added two machine learning projects to my solution, and I was able to consume the model using:

// Add input data
var input = new ModelInput();
input.Title = "The Nightmare Before Christmas";
input.Description = "Jack Skellington, king of Halloween Town, " +
					"discovers Christmas Town, but his attempts to bring Christmas to his home causes confusion.";
input.Rating = "PG";

// Load model and predict output of sample data
ModelOutput result = ConsumeModel.Predict(input);
Console.WriteLine(result.Prediction);

I ran the console application, and it correctly produced the output “Children & Family Movies,” demonstrating that my ML.NET model was ready to be linked to a Kontent project. 

Preparing Kentico Kontent

Next, I needed to prepare a Kentico Kontent project that would use my console application. In a real-world scenario, the project may exist before the need for “Smart” labeling arose, but in my proof of concept the demand for machine learning was driven by my dataset, not my Kontent project. 

To set up Kontent, I created a new project in my subscription that would contain a single content type called Movie and a single taxonomy group called Listed in. The content type and taxonomy group consisted of elements and terms from the CSV I used to build my machine learning model:

  • Title: a simple text element
  • Rating: a multiple-choice element containing all 11 content ratings (rated TV-G to rated R) in my CSV
  • Release Date: a date & time element
  • Description: a rich text element
  • Listed in: a taxonomy group containing all 17 of “Listed in” terms in my CSV

Once the content type and taxonomy terms were present, I started making some sample content item variants using not-yet-released movie titles, ratings, and descriptions. I left them in the “Draft” workflow step to ensure I was able to upsert data to the items from my console application.

Running content against the model

Now it was time to pull that content from Kentico Kontent and feed it into my generated ML.NET model code so it could make a prediction. For this, I installed the Kentico Kontent .NET SDK NuGet package and created a MovieListing Class that returns a strongly typed Movie:

//file location: MLNET-kontent-taxonomy-app/MovieListing.cs
class MovieListing
    {
        IDeliveryClient client;

        public MovieListing(KontentKeys keys)
        {
            client = DeliveryClientBuilder
                .WithOptions(builder => builder
                .WithProjectId(keys.ProjectId)
                .UsePreviewApi(keys.PreviewApiKey)
                .Build())
                .Build();
        }        

        public async Task<DeliveryItemListingResponse<Movie>> GetMovies()
        {
            DeliveryItemListingResponse<Movie> response = await client.GetItemsAsync<Movie>(
                new EqualsFilter("system.type", "movie"),
                new ElementsParameter("title", "rating", "description", "listed_in")
                );

            return response;
        }        
    }

//file location: MLNET-kontent-taxonomy-app/Models/Movie.cs
public partial class Movie
    {
        public const string Codename = "movie";
        public const string DescriptionCodename = "description";
        public const string ListedInCodename = "listed_in";
        public const string RatingCodename = "rating";
        public const string ReleaseDateCodename = "release_date";
        public const string TitleCodename = "title";

        public string Description { get; set; }
        [JsonProperty("listed_in")]
        public IEnumerable<TaxonomyTerm> ListedIn { get; set; }
        [JsonProperty("rating")]
        public IEnumerable<MultipleChoiceOption> Rating { get; set; }
        [JsonProperty("release_date")]
        public DateTime? ReleaseDate { get; set; }
        public ContentItemSystemAttributes System { get; set; }
        [JsonProperty("title")]
        public string Title { get; set; }
    }

I instantiated this class from my main Program.cs, passing my API keys to it, and returned a DeliveryItemListingResponse that I can loop through. I set up my ML.NET consumption logic in a separate class for easier maintainability and a cleaner main program:

//file location: MLNET-kontent-taxonomy-app/TaxonomyPredictor.cs
//generated using: https://github.com/Kentico/kontent-generators-net
class TaxonomyPredictor
    {
        public string GetTaxonomy(Movie movie)
        {
            // Add input data
            var input = new ModelInput();

            input.Title = movie.Title;
            input.Rating = movie.Rating.ToList().First().Name;
            input.Description = movie.Description;

            // Load model and predict output of sample data
            ModelOutput result = ConsumeModel.Predict(input);
            Console.WriteLine("Listing best match: " + result.Prediction);

            //formatting to meet Kontent codename requirements 
            //ex: Children & Family Movies => children___family_movies
            var formatted_prediction = result.Prediction.Replace(" ", "_").Replace("&", "_").ToLower();

            return formatted_prediction;
        }
    }

And I instantiated it from Program.cs. The TaxonomyPredictor.GetTaxonomy(Movie) method could then be used to suggest “listed in” terms when looping through my list of movies returned by the MovieListing.GetMovies() method.

//file location: MLNET-kontent-taxonomy-app/Program.cs

MovieListing movieListing = new MovieListing(keys);
TaxonomyPredictor predictor = new TaxonomyPredictor();

var movies = movieListing.GetMovies();

foreach (Movie movie in movies.Result.Items)
{
	if (movie.ListedIn.Count() < 1)
	{
		string formatted_prediction = predictor.GetTaxonomy(movie);

		Console.WriteLine(formatted_prediction);
	}
}

This produced the “best match” taxonomy term in the console when I ran the application.

Importing the suggested taxonomy term

Finally, it was time to automate upserting the suggested taxonomy terms to the content item variants sitting in my Kentico Kontent project. I achieved this using the Kentico Kontent .NET Management SDK in a separate class called TaxonomyImporter:

//file location: MLNET-kontent-taxonomy-app/TaxonomyImporter.cs
class TaxonomyImporter  
    {
        ManagementClient client;

        public TaxonomyImporter(KontentKeys keys)
        {
            ManagementOptions options = new ManagementOptions
            {
                ProjectId = keys.ProjectId,
                ApiKey = keys.ManagementApiKey,
            };

            // Initializes an instance of the ManagementClient client
            client = new ManagementClient(options);
        }

        public async Task<string> UpsertTaxonomy(Movie movie, string listing_prediction)
        {

            MovieImport stronglyTypedElements = new MovieImport
            {
                ListedIn = new[] { TaxonomyTermIdentifier.ByCodename(listing_prediction) }
            };

            // Specifies the content item and the language variant
            ContentItemIdentifier itemIdentifier = ContentItemIdentifier.ByCodename(movie.System.Codename);
            LanguageIdentifier languageIdentifier = LanguageIdentifier.ByCodename(movie.System.Language);
            ContentItemVariantIdentifier identifier = new ContentItemVariantIdentifier(itemIdentifier, languageIdentifier);

            // Upserts a language variant of your content item
            ContentItemVariantModel<MovieImport> response = await client.UpsertContentItemVariantAsync(identifier, stronglyTypedElements);

            return response.Elements.Title + " updated.";
        }
    }

I instantiated the importer in Program.cs so I could use the UpsertTaxonomy method when looping through my list of movies. I then had to create a strongly typed MovieImport model that inherits from Movie in order to accommodate what the Kentico Kontent Management SDK expects when upserting content item variants:

//file location: MLNET-kontent-taxonomy-app/Models/MovieImport.cs
public partial class MovieImport : Movie
    {
        [JsonProperty("listed_in")]
        public new IEnumerable<TaxonomyTermIdentifier> ListedIn { get; set; }
        [JsonProperty("rating")]
        public new IEnumerable<MultipleChoiceOptionIdentifier> Rating { get; set; }
    }

Once this was done, and I added logic to use environment variables, appsettings.json, and configuration binding using a combination of .NET Core configuration options, the application was complete! Below is the completed Program.cs file, configuration files, and result of running the program:

//file location: MLNET-kontent-taxonomy-app/Program.cs
class Program
{
	static void Main(string[] args)
	{
		Console.WriteLine("Starting program...");

		var environmentName = Environment.GetEnvironmentVariable("ASPNETCORE_ENVIRONMENT");

		var configuration = new ConfigurationBuilder()
			.AddJsonFile("appsettings.json", optional: true, reloadOnChange: true)
			.AddJsonFile($"appsettings.{environmentName}.json", optional: true, reloadOnChange: true)
			.Build();

        //file location: MLNET-kontent-taxonomy-app/Configuration/KontentKeys.cs
		var keys = new KontentKeys();

		ConfigurationBinder.Bind(configuration.GetSection("KontentKeys"), keys);

		MovieListing movieListing = new MovieListing(keys);
		TaxonomyPredictor predictor = new TaxonomyPredictor();
		TaxonomyImporter importer = new TaxonomyImporter(keys);

		var movies = movieListing.GetMovies();

		foreach (Movie movie in movies.Result.Items)
		{
			if (movie.ListedIn.Count() < 1)
			{
				string formatted_prediction = predictor.GetTaxonomy(movie);
				var upsertResponse = importer.UpsertTaxonomy(movie, formatted_prediction).Result;

				Console.WriteLine(upsertResponse);
			}
		}

		Console.WriteLine("Program finished.");
	}        
}

//file location: MLNET-kontent-taxonomy-app/Configuration/KontentKeys.cs
public class KontentKeys
    {
        public string ProjectId { get; set; }
        public string PreviewApiKey { get; set; }
        public string ManagementApiKey { get; set; }
    }

//file location: MLNET-kontent-taxonomy-app/appsettings.json
{
  "KontentKeys": {
    "ProjectId": "<YOUR PROJECT ID>",
    "PreviewApiKey": "<YOUR PREVIEW API KEY>",
    "ManagementApiKey": "<YOUR MANAGEMENT API KEY>"
  }
}

Imagine the possibilities

In this article, I showed you how to use the ML.NET Model Builder with a .NET Core console application and a Kentico Kontent project in order to automate assigning taxonomy terms to your content. In my demonstration, I focused on a flat taxonomy structure and straightforward classification scenario, but this is just one of almost endless machine learning possibilities. What if you took this to the next level and attached it to Kentico Kontent webhooks? How about using image classification to automatically write descriptions for your assets? Maybe you want to dive into machine learning and create your own custom model to perform more complex categorization specific to your industry? With the right tools and mindset, I think anything is possible.

You can find the full source code for my application, supporting data files, and instructions on how to test your own copy of the project here:

Get the code

Written by
Michael Berry

I’m the US Support Lead at Kentico. In between managing an all-star team of support engineers and helping customers, I like to challenge myself with technical Kentico Kontent projects.

More articles from Michael

Subscribe to Kentico Kontent Newsletter

Stay in the loop. Get the hottest updates while they’re fresh!