Machine Learning with ML.NET in UWP: Binary Classification

In this article we will use ML.NET to build and compare four Machine Learning Binary Classification pipelines. Each model uses another algorithm to predict the quality of wine from 11 physicochemical features. The characteristics of the prediction models are visualized using OxyPlot. All the code is in C# (“Look mom, no Python!”) and hosted in a UWP app together with some other ML.NET use cases.

Here’s how the Binary Classification sample page looks like. It displays the quality metrics of each model, and a selection of wines on which the models disagree:

SamplePageWithDisagreements

This Binary Classification sample evolved from a copy of the code and the datasets from this article on Rubik’s Code.

In the second part of this article we will focus on model evaluation and pipeline customization. We will move rapidly through the basic steps to implement a typical Machine Learning use case in ML.NET. If you want more details on each of the steps, please (re-)visit the previous articles in this series.

In the first part of this article we try to provide some relevant technical background (“Machine Learning for Developers”).

Binary Classification

Binary Classification is using a classification rule to place the elements of a given set into two groups, or to predict which group each element belongs to. In Machine Learning, Binary Classification is a part of supervised learning, which means that the classifier requires labeled (rated) samples for training and evaluation.

Math, science, decision trees, unraveling the mysteries

Binary Classification boils down to the universal problem of separating the good from the bad. It should not come as a surprise that very diverse algorithms exist, originating from very diverse domains (mathematics, probability theory, biology, operations research). Here’s a small list of Binary Classification algorithms:

  • Decision trees
  • Random forests
  • Bayesian networks
  • Support vector machines
  • Neural networks
  • Logistic regression
  • Probit model

These algorithms make different assumptions on your data and its distribution, have different performance in training and inferencing, and have different configuration options (model parameters). Fortunately there are some good resources available that help you determine the appropriate model for your Machine Learning problem, like this ‘How to choose algorithms for Azure Machine Learning Studio’. Here’s a relevant table from this article. It compares some Binary Classification models:

Algorithm Accuracy Training time Linearity Parameters Notes
Two-class classification          
logistic regression   5  
decision forest   6  
decision jungle   6 Low memory footprint
boosted decision tree   6 Large memory footprint
neural network     9 Additional customization is possible
averaged perceptron 4  
support vector machine   5 Good for large feature sets
locally deep support vector machine     8 Good for large feature sets
Bayes’ point machine   3  

If you’re more into diagrams, there’s an excellent graphical cheat sheet right here. Here’s its binary classification overview:

CheatSheet

ML.NET has implementations for most binary classification algorithms, but recommends the following trainers:

  • AveragedPerceptronTrainer
  • StochasticGradientDescentClassificationTrainer
  • LightGbmBinaryTrainer
  • FastTreeBinaryClassificationTrainer
  • SymSgdClassificationTrainer

The BinaryClassificationCatalog has more members than this. Here are all the current trainer classes:

  • AveragedPerceptronTrainer
  • BinaryClassificationGamTrainer
  • FastForestClassification
  • FastTreeBinaryClassificationTrainer
  • FieldAwareFactorizationMachineTrainer
  • LightGbmBinaryTrainer
  • LinearSvmTrainer
  • LogisticRegression
  • PriorTrainer
  • RandomTrainer (removed: https://github.com/dotnet/machinelearning/pull/2849)
  • StochasticGradientDescentClassificationTrainer
  • SymSgdClassificationTrainer

We will not compare all of these algorithms in this article, just a representative set. The choice was limited by the hosting technology: some algorithms still have some known compatibility glitches with UWP.

Here’s some background on the contestants in our comparison:

Linear Svm

Linear SVM has its roots in mathematics. A Support Vector Machine (SVM) places the training data as a p-dimensional vector (a list of p numbers) in space, and then calculates a (p-1)-dimensional hyperplane so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

In two-dimensional space the hyperplane is just a line dividing a plane in two parts -one for each class- like in this illustration from Chris Albon:

SupportVectorClassifier

Linear SVM tries to create a hyperplane with all positive samples on one side and all negative samples on the other. Whether that works and how hard this is, depends on the data set. When the data are not linearly separable, a hinge loss function is introduced to represent the price paid for inaccurate predictions. The configuration of the model will try to minimize this function.

Linear SVM is a workhorse, simple and fast, but it may be overly simplistic for some problems. For a deeper dive into SVM, check this article.

Perceptron

The perceptron algorithm is inspired by how a brain works, and hence has its origin in biology. Perceptron can be considered as a single artificial neuron that gets an input signal for each feature, with the strength of the signal being the feature value. During training, the perceptron learns the weights for the features and stores these in its activation function. If the weighted sum of the feature values passes a threshold value, the neuron ‘fires’ and the predicted result is positive. Here’s how this looks like in an image, from “What the Hell is Perceptron”:

Perceptron

The learning happens one example at a time and with multiple iterations over the dataset, so you may expect long training times over larger datasets. Just like SVM, the perceptron is a linear classifier, so it will make mistakes if the data is not linearly separable. There’s an excellent deeper dive into perceptron right here.

Logistic Regression

Logistic Regression has its origin in statistics. Despite the ‘regression’ in its name (regression is predicting a continuous value) it is actually a powerful tool for two-class and multiclass classification. That is because Logistic Regression predicts a probability (“the likelihood”) – a continuous value between 0 and 1 that can easily be mapped to a class or a Boolean.

Logistic regression is not a linear classifier. ‘Logistic’ refers to a specific S-shaped curve. This logistic curve is almost linear at both extremes, and exponential in the middle. This makes it a natural fit for dividing data into groups (image from “Understanding Logistic Regression in Python”):
linear_vs_logistic_regression

Logistic Regression does not work well on data that has much outliers, or when there is high correlation between the features. Check our articles on correlation analysis and distribution analysis on how to detect and avoid these. For more details on Logistic Regression, check these course notes.

Stochastic Dual Coordinate Ascent

Stochastic Dual Coordinate Ascent (SDCA) is an algorithm for large-scale supervised learning that has its origins in mathematical programming – the science of finding the best element with regard to some criteria from a set of available alternatives. Most Machine Learning algorithms try to learn their model’s parameters by minimizing some kind of loss function, such as ordinary least squares in linear regression or the already mentioned hinge loss function. When the training set become bigger and no simple formulas exist, solving the parameters analytically (with closed-form equations) becomes economically impractical. In those cases it makes more sense to use an optimization algorithm like Coordinate Descent or Gradient Descent. Such algorithms bring you ‘close enough’ to the mathematical solution in a number of steps that each do small adjustments on the parameters. When even these calculations are too expensive, you can try reversing the problem by exploiting the duality gap (“the minimum of a curve is close enough to the maximum not on the curve”). In a nutshell, that’s what SDCA does: it iteratively reads one random (stochastic) sample from the training set and updates the model parameters, until the duality gap is sufficiently small.

Microsoft’s version of SDCA is optimized for large out-of-memory datasets and parallelism.

Decision trees

We regret that there is no Decision Tree based algorithm in our sample app. The implementations of this family of algorithms in ML.NET are currently broken for UWP.

Metrics Reloaded

Evaluating the prediction model is an essential part of any Machine Learning project. There are many metrics available to assess the quality of a model and/or compare it to another model. Most of these metrics are built around the confusion matrix which describes the performance of the model. For a two-class problem this matrix looks like this (image from Yann Dubois’ awesome Machine Learning glossary):

confusion-matrix

 

The confusion matrix holds the number of

  • True Positives: The cases in which the model predicted YES and the actual output was also YES,
  • True Negatives: The cases in which the model predicted NO and the actual output was NO,
  • False Positives: The cases in which the model predicted YES and the actual output was NO, and
  • False Negatives: The cases in which the model predicted NO and the actual output was YES.

Here are some common metrics for the evaluation of binary classifiers (illustrations by Chris Albon):

Accuracy

Accuracy is the ratio of the number of correct predictions to the total number of input samples.

Formula: (TP + TN) / (TP + FP + TN + FN)

Accuracy is the most common model quality metric, but it’s not always useful. For a highly unbalanced distribution and/or when the cost of making a mistake is high, accuracy is not the metric you are looking for.

Let’s say there is a one percent chance to find rare elements for car batteries somewhere in the underground, or a one percent chance of discovering some rare disease in a patient, and you want to create a prediction model for this. A model that would always return false (“computer says no”) has an accuracy of 99% in that scenario, but it would still be worthless.

Accuracy

Recall

Recall (a.k.a. Sensitivity) is the fraction of positive observations that have been correctly predicted.

Formula: TP / (TP + FN)

If missing a positive example is important/expensive, you should focus on maximizing recall.

Recall

Specificity

Specificity is the recall for the negatives (the ability to find true negatives).

Formula: TN / (TN + FP)

Precision

Precision is fraction of positive predictions that were actually positive.

Formula: TP / (TP + FP)

If false positives are expensive, maximize precision.

F1 Score

F1 score is the harmonic mean between precision and recall. If one of these two values decreases dramatically, the F1 score also does. F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0.

Formula: 2 * (Precision * Recall) / (Precision + Recall)

F1 score tells you how precise your classifier is (how many instances it classifies correctly) as well as how robust it is (it does not miss significant groups of instances).

F1Score

Area under the ROC curve

Area Under Curve (AUC) is one of the most widely used metrics for the evaluation of binary classifiers. AUC is the probability that the classifier will rank a randomly chosen positive example higher than a randomly chosen negative example. The referenced curve is called Receiving Operating Characteristic (ROC) and plots the True Positive Rate (TP / (TP + FN)) against the False Positive Rate (FP / (FP + TN)) for all probability thresholds.

AreaUnderTheCurve

Plotting the curve is an expensive operation, but calculating the area under it is not. AUC measures the entire two-dimensional area underneath the entire ROC curve from (0,0) to (1,1). The result is a value between 0.5 (which means that your model behaves like random classifier) and 1 (too good to be true, you’re probably overfitting).

More Metrics

There are more metrics to evaluate and compare binary classifiers, especially the classifiers that return a probability: Logarithmic Loss and Cross-Entropy and more. Not all algorithms in our sample app are in this category, so we decided to ignore these metrics for the moment.

Now is the time to dive into the code.

Battle of the Binary Stars

We’re in a traditional Machine Learning use case with reading and preparing the data, training and evaluating the model, and finally using the model for predicting:

MachineLearningModel

Getting the Raw Data

Here’s how the training and testing data sets look like. They list 11 physicochemical and sensory characteristics of Portuguese Vinho Verde white wine, together with a quality score on a scale of 10:

WineQualityDataSet

Here’s the corresponding input structure. We decorated the fields with the LoadColumn attribute to facilitate reading the file with one of the built-in ML.NET text readers:

public class BinaryClassificationData
{
    [LoadColumn(0)]
    public float FixedAcidity;

    [LoadColumn(1)]
    public float VolatileAcidity;

    [LoadColumn(2)]
    public float CitricAcid;

    [LoadColumn(3)]
    public float ResidualSugar;

    [LoadColumn(4)]
    public float Chlorides;

    [LoadColumn(5)]
    public float FreeSulfurDioxide;

    [LoadColumn(6)]
    public float TotalSulfurDioxide;

    [LoadColumn(7)]
    public float Density;

    [LoadColumn(8)]
    public float Ph;

    [LoadColumn(9)]
    public float Sulphates;

    [LoadColumn(10)]
    public float Alcohol;

    [LoadColumn(11), ColumnName("Label")]
    public float Label;
}

The models will predict the quality of the wine as a Boolean – good or bad. Here’s how the output structure looks like:

public class BinaryClassificationPrediction
{
    [ColumnName("PredictedLabel")]
    public bool PredictedLabel;

    public int LabelAsNumber => PredictedLabel ? 1 : 0;
}

The file is read with LoadFromTextFile() and stored in memory with Cache(). The latter will improve the training speed for the online learners -the ones that iteratively read a sample- such as SDCA:

var trainData = MLContext.Data.LoadFromTextFile<BinaryClassificationData>(
        path: trainingDataPath,
        separatorChar: ';',
        hasHeader: true);

trainData = MLContext.Data.Cache(trainData);

When this code runs, the training data becomes available as an IDataView.

Preparing the data

In ML.NET, the model is defined by a pipeline of components that each apply transformations to the data. Some transformations make the data more representative and compatible with the classifier, other transformations skip the training data with missing or out-of-range values (outliers).

Here are the pipeline steps for our sample:

  • fill out missing values for Fixed Acidity,
  • translate the numeric quality score to a Boolean,
  • create a vector with all feature values, and
  • add one of the binary classifiers.

Here’s how that pipeline looks like in C#:

public PredictionModel<BinaryClassificationData, BinaryClassificationPrediction> BuildAndTrain(
    string trainingDataPath,
    IEstimator<ITransformer> algorithm)
{
    IEstimator<ITransformer> pipeline =
        MLContext.Transforms.ReplaceMissingValues(
            outputColumnName: "FixedAcidity",
            replacementMode: MissingValueReplacingEstimator.ReplacementMode.Mean)
        .Append(MLContext.FloatToBoolLabelNormalizer())
        .Append(MLContext.Transforms.Concatenate("Features",
            new[]
            {
                "FixedAcidity",
                "VolatileAcidity",
                "CitricAcid",
                "ResidualSugar",
                "Chlorides",
                "FreeSulfurDioxide",
                "TotalSulfurDioxide",
                "Density",
                "Ph",
                "Sulphates",
                "Alcohol"}))
        .Append(algorithm);

    // ...

}

Let’s get into the details of some of the preparation steps.

Dealing with missing values

Input data that has missing values for one or more features can ruin the training of your prediction model. ML.NET comes with some useful transforms to deal with this:

Building a Custom Data Transformation Component

The FloatToBoolNormalizer in the previous code snippet is a transformation component that we wrote ourselves. It transforms the numeric quality column from the dataset into a Boolean, without relying on hard-coded thresholds. Boolean is the required data type for the binary classifier’s Label column.

The custom component is a mini pipeline on its own: an IEstimator<ITransformer>. We start with dividing all the values into two bins or buckets (feel free to pronounce it as “bouquets”) of the same size, using NormalizeBinning() from the NormalizationCatalog. After this first transformation the Label column now holds 0 or 1, where the algorithm still expects a Boolean.

To transform the 0-or-1 to a Boolean, we created a CustomMappingFactory with some helper input and output classes, and the appropriate CustomMappingFactory attribute. It is appended with CustomMapping() to the mini pipeline.

The whole component is made reusable in multiple ML.NET pipelines by exposing it as an extension method to MLContext:

public static class MLContextExtensions
{
    /// <summary>
    /// Divides the numeric Label in two 'buckets' and transforms it to a Boolean.
    /// </summary>
    public static IEstimator<ITransformer> FloatToBoolLabelNormalizer(this MLContext mLContext)
    {
        var normalizer = mLContext.Transforms.NormalizeBinning(
            outputColumnName: "Label", maximumBinCount: 2);

        return normalizer.Append(mLContext.Transforms.CustomMapping(new MapFloatToBool().GetMapping(), "MapFloatToBool"));
    }

    private class LabelInput
    {
        public float Label { get; set; }
    }

    private class LabelOutput
    {
        public bool Label { get; set; }

        public static LabelOutput True = new LabelOutput() { Label = true };
        public static LabelOutput False = new LabelOutput() { Label = false };
    }

    [CustomMappingFactoryAttribute("MapFloatToBool")]
    private class MapFloatToBool : CustomMappingFactory<LabelInput, LabelOutput>
    {
        public override Action<LabelInput, LabelOutput> GetMapping()
        {
            return (input, output) =>
            {
                if (input.Label > 0)
                    output.Label = true;
                else
                    output.Label = false;
            };
        }
    }
}

For another example and more details, check the excellent ML.NET Cook Book.

Training the Models

We now have a training data set in memory, and a pipeline with all necessary transformation components. All we need to do to create the model is call Fit() and turn it into a PredictionEngine:

ITransformer model = pipeline.Fit(trainData);
return new PredictionModel<BinaryClassificationData, BinaryClassificationPrediction>(
	MLContext, 
	model);

The PredictionModel is a reconstruction of a class that disappeared from the API when we we building the sample app. We have kept it in the code as a little helper to keep the algorithm and its prediction engine together:

public PredictionModel(MLContext mlContext, ITransformer transformer)
{
    Transformer = transformer;
    Engine = mlContext.Model.CreatePredictionEngine<TSrc, TDst>(Transformer);
}

All what’s left to build the models is instantiate each of the four algorithms that we’re going to compare:

Here are the private fields to hold all of these:

private PredictionModel<BinaryClassificationData, BinaryClassificationPrediction> _perceptronBinaryModel;
private PredictionModel<BinaryClassificationData, BinaryClassificationPrediction> _linearSvmModel;
private PredictionModel<BinaryClassificationData, BinaryClassificationPrediction> _logisticRegressionModel;
private PredictionModel<BinaryClassificationData, BinaryClassificationPrediction> _sdcabModel;

And these are the calls to build and train all four models:

_perceptronBinaryModel = await ViewModel.BuildAndTrain(trainingDataLocation, ViewModel.MLContext.BinaryClassification.Trainers.AveragedPerceptron());
_linearSvmModel = await ViewModel.BuildAndTrain(trainingDataLocation, ViewModel.MLContext.BinaryClassification.Trainers.LinearSvm());
_logisticRegressionModel = await ViewModel.BuildAndTrain(trainingDataLocation, ViewModel.MLContext.BinaryClassification.Trainers.LbfgsLogisticRegression());
_sdcabModel = await ViewModel.BuildAndTrain(trainingDataLocation, ViewModel.MLContext.BinaryClassification.Trainers.SdcaLogisticRegression());

We confess that this code is implemented in the MVVM View and we probably deserve a Walk of Atonement for this (“shame … shame … shame”). We just did not know upfront which and how many of the algorithms would work in a UWP context, hence the exploratory code patterns.

Testing the Models

Binary classifiers fall into two categories (pun intended): the ones that return a category class, and the ones that return a probability. In ML.NET both types have another class for their metrics: BinaryClassificationMetrics versus CalibratedBinaryClassificationMetrics. Fortunately there is an inheritance relationship between these two: the calibrated class extends the list of standard binary metrics (such as Accuracy, F1Score, the whole ConfusionMatrix, and the AreaUnderRocCurve) with the probability-related ones (such as LogLoss and Entropy).

The ML.NET API confusingly forces two different calls to trigger the evaluation: Evaluate() versus EvaluateNonCalibrated(). Here are the wrapper methods in the sample app that load and prepare the test data set, generate predictions with Transform(), compare the predicted with the actual results, and finally return the metrics:

public CalibratedBinaryClassificationMetrics Evaluate(
    PredictionModel<BinaryClassificationData, BinaryClassificationPrediction> model,
    string testDataLocation)
{
    var testData = MLContext.Data.LoadFromTextFile<BinaryClassificationData>(
        path: testDataLocation,
        separatorChar: ';',
        hasHeader: true);

    var scoredData = model.Transformer.Transform(testData);
    return MLContext.BinaryClassification.Evaluate(scoredData);
}

public BinaryClassificationMetrics EvaluateNonCalibrated(
    PredictionModel<BinaryClassificationData, BinaryClassificationPrediction> model,
    string testDataLocation)
{
    var testData = MLContext.Data.LoadFromTextFile<BinaryClassificationData>(
        path: testDataLocation,
        separatorChar: ';',
        hasHeader: true);

    var scoredData = model.Transformer.Transform(testData);
    return MLContext.BinaryClassification.EvaluateNonCalibrated(scoredData);
}

Here’s the call in the View that fetches these metrics:

BinaryClassificationMetrics metrics = await ViewModel.EvaluateNonCalibrated(_perceptronBinaryModel, _testDataPath);

Again, there are some violations of the MVVM pattern here, but we did not know which and how many of these metrics would end up in the diagram.

Here’s how the metrics are visualized with a bar chart in the sample app:

SamplePage

The lack of quality in the two models on the left is clearly noticeable. The good wines do not seem to be linearly separable from the bad wines. It seems that logistic regression is the way to go in this particular scenario. The dataset is also not big enough to justify SDCA, the last algorithm is not better than the third, but is clearly a lot slower.

Saving the Models

The ML.NET API could not be more straightforward: use Model.Save() to … save the model:

public void Save(
    PredictionModel<BinaryClassificationData, BinaryClassificationPrediction> model,
    string modelName)
{
    var storageFolder = ApplicationData.Current.LocalFolder;
    string modelPath = Path.Combine(storageFolder.Path, modelName);

    MLContext.Model.Save(
        model: model.Transformer,
        inputSchema: null,
        filePath: modelPath);
}

Consuming the Models

We assume that in most cases you’ll be interested at runtime in assessing the quality of just a limited number of wines, so we built the prediction method around PredictionEngine.Predict() – the call to generate a single prediction:

public IEnumerable<BinaryClassificationPrediction> Predict(
    PredictionModel<BinaryClassificationData, BinaryClassificationPrediction> model,
    IEnumerable<BinaryClassificationData> data)
{
    foreach (BinaryClassificationData datum in data)
        yield return model.Engine.Predict(datum);
}

The sample app uses this method to get predictions from each of the 4 models for the first 50 wines in the test data set:

var testDataLocation = await MlDotNet.FilePath(@"ms-appx:///Data/winequality_white_test.csv");
var tests = await ViewModel.GetSample(testDataLocation);
var size = 50;
var data = tests.ToList().Take(size);

var perceptronPrediction = (await ViewModel.Predict(_perceptronBinaryModel, data)).ToList();
var linearSvmPrediction = (await ViewModel.Predict(_linearSvmModel, data)).ToList();
var logisticRegressionPrediction = (await ViewModel.Predict(_logisticRegressionModel, data)).ToList();
var sdcabPrediction = (await ViewModel.Predict(_sdcabModel, data)).ToList();

This code gets called when you click the ‘View Disagreements’ button. Observe that the models –even from the same family (linear or logistic)- disagree on quite some samples.

The table in the center of the page lists the differences:

SamplePageWithDisagreements

It is clear that the linear models are the drama queens in this setup. They’re too easily overoptimistic or overpessimistic on the quality of the wines.

All the code and more

The UWP sample app lives here on GitHub. It hosts a lot more ML.NET scenarios than what we covered in this article. Look at the code, play with the code. If it sparks joy, please return the favor and give the repo a star!

Enjoy!

Advertisements

Machine Learning with ML.NET in UWP: Field-Aware Factorization Machine

In this article we demonstrate how to use a Field-Aware Factorization Machine to recommend hotels on the Las Vegas Strip, based on the traveler type (solo, family, business, …) and the season. We’ll use ML.NET for the Machine Learning stuff, OxyPlot for visualization, and a UWP app to host it all. This is how the sample page looks like:

FfmRecommendation

This page looks pretty much like the one in our previous article on recommendation in Machine Learning. That article also recommended hotels – but since we were using Matrix Factorization we could only use one feature to base the recommendation on (i.c. traveler type). In the previous article, the prediction model was defined by preparing the two feature columns (TravelerType and Hotel) and passing these to a recommendation algorithm, like this:

var pipeline = _mlContext.Transforms.Conversion.MapValueToKey("Hotel")
                .Append(_mlContext.Transforms.Conversion.MapValueToKey("TravelerType"))
                .Append(_mlContext.Recommendation().Trainers.MatrixFactorization(
                                    labelColumnName: "Label",
                                    matrixColumnIndexColumnName: "Hotel",
                                    matrixRowIndexColumnName: "TravelerType"))
                .Append(_mlContext.Transforms.Conversion.MapKeyToValue("Hotel"))
                .Append(_mlContext.Transforms.Conversion.MapKeyToValue("TravelerType"));

The corresponding pipeline in this article would look very similar:

var pipeline = _mlContext.Transforms.Categorical.OneHotEncoding("TravelerTypeOneHot", "TravelerType")
                .Append(_mlContext.Transforms.Categorical.OneHotEncoding("HotelOneHot", "Hotel"))
                .Append(_mlContext.Transforms.Concatenate("Features", "TravelerTypeOneHot", "HotelOneHot"))
                .Append(_mlContext.BinaryClassification.Trainers.FieldAwareFactorizationMachine(new string[] { "Features" }));

The advantage of a Field-Aware Factorization Machine over Matrix Factorization is that you’re not limited to two features. This allows you to provide much better recommendations. Hotels could be recommended on a combination of traveler type, season, country of origin, etc.

Field-Aware Factorization in Machine Learning

A Field-Aware Factorization Machine (FFM) is a recommendation algorithm that is specialized in deriving knowledge from large and sparse datasets. It recognizes feature conjunctions in feature vectors. This is particularly useful in Click-Through Rate prediction (CTR). Check this article for an introduction and a comparison to other algorithms.

The FFM algorithm takes a vector of numerical features as input. When we’re dealing with categorical data (countries, months, traveler types, …) we need to transform these into numbers. In the context of FFM the right approach is to use One-Hot Encoding, which boils down to pivoting feature values into separate columns. There’s an excellent beginners guide to one-hot encoding right here, but this illustration from Chris Albon’s Machine Learning Flashcards says it all:

 

One-Hot_Encoding

The one-hot encoding transformation extends the schema with a lot of new features, sparsely filled. That exactly how FFM likes it…

Field-Aware Factorization in ML.NET

With the FieldAwareFactorizationMachineTrainer and the OneHotEncodingTransformer, ML.NET has all the ingredients to implement a Field-Aware Factorization Machine scenario.

Let’s write some code

Here’s how a typical Machine Learning use case looks like:

MachineLearningModel

Raw data is collected, then cleaned up, completed, and transformed to fit the algorithm. Part of the data is used to train the model, another part is used to evaluate it. When the model passed all tests, it is persisted for consumption.

The sample app uses a light-weight MVVM architecture. The beef of the ML.NET manipulation is in this Model class.

Getting the Raw Data

The raw data in the sample app comes from a comma-separated flat file with Trip Advisor reviews from 2015 for hotels on the Las Vegas Strip:

RecommendationDataSet

Preparing the Data

While we read the data, we already start the preparation. FFM is a binary classification algorithm, so we need to transform the rating (0-5) to a Boolean value (“recommended or not”). Here’s the input structure:

public class FfmRecommendationData
{
    public bool Label;

    public string TravelerType;

    public string Season;

    public string Hotel;
}

The predicted label will also be a Boolean, and the algorithm also provides the corresponding probability. Here’s the model’s output structure:

public class FfmRecommendationPrediction
{
    public bool PredictedLabel;

    public float Probability;

    public string TravelerType;

    public string Season;

    public string Hotel;
}

Before we read all the data in an IDataView, we need to create a MLContext instance:

private MLContext _mlContext = new MLContext(seed: null);

While we read the data, we transform the score (0 to 5) into a Boolean by comparing it to a threshold value. We used 3 as the threshold, since the general ratings in the dataset are pretty high – which is probably why these were allowed to make public.

With the LoadFromEnumerable() method we transform it into an IDataView:

private IDataView _allData;
private ITransformer _model;

public IEnumerable<FfmRecommendationData> Load(string trainingDataPath)
{
    // Populating an IDataView from an IEnumerable.
    var data = File.ReadAllLines(trainingDataPath)
        .Skip(1)
        .Select(x => x.Split(';'))
        .Select(x => new FfmRecommendationData
        {
            Label = double.Parse(x[4]) > _ratingThreshold,
            Season = x[5],
            TravelerType = x[6],
            Hotel = x[13]
        });

    _allData = _mlContext.Data.LoadFromEnumerable(data);

    // Just 'return data;' would also do the trick...
    return _mlContext.Data.CreateEnumerable<FfmRecommendationData>(_allData, reuseRowObject: false);
}

For the next step in preparing the data we will rely on some ML.NET transformations. First we apply a OneHotEncoding transformation to all the features, to transform the categorical data into numbers. Then all features are combined into a vector with a call to Concatenate(). The model building pipeline is then completed with a FieldAwareFactorizationMachine:

var pipeline = _mlContext.Transforms.Categorical.OneHotEncoding("TravelerTypeOneHot", "TravelerType")
    .Append(_mlContext.Transforms.Categorical.OneHotEncoding("SeasonOneHot", "Season"))
    .Append(_mlContext.Transforms.Categorical.OneHotEncoding("HotelOneHot", "Hotel"))
    .Append(_mlContext.Transforms.Concatenate("Features", "TravelerTypeOneHot", "SeasonOneHot", "HotelOneHot"))
    .Append(_mlContext.BinaryClassification.Trainers.FieldAwareFactorizationMachine(new string[] { "Features" }));

Training the Model

To train the model, we send 450 randomly chosen rows from the dataset to it. The rows are selected with some methods from the DataOperationsCatalog: first we rearrange the dataset with ShuffleRows(), then we pick some rows with TakeRows():

var trainingData = _mlContext.Data.ShuffleRows(_allData);
trainingData = _mlContext.Data.TakeRows(trainingData, 450);

After training the model, we’ll create a strongly typed PredictionEngine from it, for individual recommendations. So we declared a field to host this:

private PredictionEngine<FfmRecommendationData, FfmRecommendationPrediction> _predictionEngine;

The model is created and trained with a call to Fit(), and then we create the prediction engine from it with CreatePredictionEngine():

_model = pipeline.Fit(trainingData);
_predictionEngine = _mlContext.Model.CreatePredictionEngine<FfmRecommendationData, FfmRecommendationPrediction>(_model);

Testing the Model

To test the model we send another 100 random rows from the dataset to it and call the Transform() to generate the predictions. A call to the Evaluate() method in the BinaryClassificationCatalog will compare the predicted labels to the original ones:

public CalibratedBinaryClassificationMetrics Evaluate(string testDataPath)
{
    var testData = _mlContext.Data.ShuffleRows(_allData);
    testData = _mlContext.Data.TakeRows(testData, 100);

    var scoredData = _model.Transform(testData);
    var metrics = _mlContext.BinaryClassification.Evaluate(
        data: scoredData, 
        labelColumnName: "Label", 
        scoreColumnName: "Probability", 
        predictedLabelColumnName: "PredictedLabel");

    // Place a breakpoint here to inspect the quality metrics.
    return metrics;
}

The result of the evaluation is an instance of CalibratedBinaryClassificationMetrics with useful statistics such as accuracy, entropy, recall, and F1-score:

CalibratedMetrics

Persisting the Model

When you’re happy with the model’s quality, then you can serialize and persist it for later use with a call to the Save() method from the ModelOperationsCatalog:

public void Save(string modelName)
{
    var storageFolder = ApplicationData.Current.LocalFolder;
    string modelPath = Path.Combine(storageFolder.Path, modelName);

    _mlContext.Model.Save(_model, inputSchema: null, filePath: modelPath);
}

Consuming the Model

There are two ways to consume the model. With a call to Transform() you can generate predictions or recommendations for a group of input records. The resulting IDataView can be transformed to a list of prediction records with CreateEnumerable():

public IEnumerable<FfmRecommendationPrediction> Predict(IEnumerable<FfmRecommendationData> recommendationData)
{
    // Group prediction
    var data = _mlContext.Data.LoadFromEnumerable(recommendationData);
    var predictions = _model.Transform(data);
    return _mlContext.Data.CreateEnumerable<FfmRecommendationPrediction>(predictions, reuseRowObject: false);
}

The strongly typed PredictionEngine that we created after training the model, can be used for single recommendation. Its Predict() method runs the prediction pipeline for a single input record:

public FfmRecommendationPrediction Predict(FfmRecommendationData recommendationData)
{
    // Single prediction
    return _predictionEngine.Predict(recommendationData);
}

When the predicted label is false, the model does not recommend the hotel/travelertype/season combination. In that case we reverse the displayed probability (the score) so that its values become a range from –1 (strongly discouraged) to +1 (strongly recommended). This is done in the MVVM View (the code-behind in the page):

var result = await ViewModel.Predict(recommendationData);
if (!result.PredictedLabel)
{
    // Bring to a range from -1 (highly discouraged) to +1 (highly recommended).
    result.Probability = -result.Probability;
}

 

The Model in Action

Here’s the FFM Recommendation page from the sample app again:

FfmRecommendation

When the model is ready for operation, the combo boxes are populated and unlocked. When you change the traveler type or the season, a group prediction is done for all the hotels in the data set. Its result is displayed on the left in a horizontal bar chart. We’ll not diving into its details, since it’s basically the same diagram as the one in the previous article (we’re just displaying the probability (-1 to +1) instead of the predicted rating (0 to 5).

When you select a hotel in the combo box in the bottom left corner, a single prediction is made, and the result is displayed next to it. The diagram for the group predictions only displays recommended hotels, but in the single prediction you can pick your own hotel. That’s why we decided to display a negative probability for negative advice.

If you want to run this scenario yourself, feel free to download the sample app. Its source lives here on GitHub.

Enjoy!

Machine Learning with ML.NET in UWP: Recommendation

In this article we describe how to define, train, evaluate, persist, and use a ML.NET Recommendation model in a UWP app. The blog post is part of a series on implementing different Machine Learning scenarios with .NET Open Source frameworks and components such as

All articles in the series are supported by the same UWP sample app that lives here on GitHub. Since the previous article was published, this sample app was upgraded to the latest prereleases of ML.NET thanks to Pull Requests from the Microsoft ML.NET Team itself (thanks Eric!). This means that the syntax in the code snippets is quite different from the previous articles, but much closer to the imminent official release.

Here’s how the Recommendation page in the sample app looks like:

Recommendation

It builds a model to generate recommendations for hotels on the Las Vegas Strip for a selected traveler type (single, family, business, …):

  • when you select a traveler type in the combo box, the top 10 recommended hotels appear in the diagram, and
  • when you select a hotel in the second combo box, a predicted rating will appear next to it.

Recommendation in Machine Learning

Machine Learning recommender systems are highly popular in e-commerce and social networks. They’re used for recommending books, TV series, music, events, products, friends, dating profiles, and a lot more.

There are two approaches for generating recommendations:

  • Content Based Filtering recommends items to a user that are similar to previously highly rated items by the same user. The advantage of this is transparency (the model can explain why it recommends the item). Unfortunately this approach does not scale well with large data.
  • Collaborative Based Filtering will recommend the items to a user that were highly rated by other -but similar- users. In most real world scenarios not every user has rated every item, so the base data can be very sparse. This makes the approach unsuitable in some scenarios.

Matrix Factorization in Machine Learning

Matrix Factorization is a common technique to solve the sparsity problem with Collaborative Base Filtering that we just mentioned. In a nutshell its goal is to mass-predict the missing ratings. Matrix Factorization is entirely based on linear algebra which is something that your CPUs, GPUs, and/or AI Accelerators are pretty good at. If you want to dive into the mathematical details, allow me to recommend (pun intended) the article with the very appropriate name A Gentle Introduction to Matrix Factorization for Machine Learning.

Major advantages of this algorithm are that it scales very well with large data and it is very fast. You don’t have take my word for it, but there must me a reason why Amazon and Netflix are relying on it. The algorithm has the disadvantage that it cannot always easy explain why it recommends an item. You must have stumbled upon recommendations like this before:

netflix

Before we dive into the code, allow us to clarify something: Matrix Factorization does NOT answer the question “What items would you recommend for this user?”. Instead it solves the “Here’s a list of products and a (list of) user(s), please predict their ratings” problem. So when you use it in your apps, there is some preprocessing (selecting the products to evaluate) and some postprocessing (filtering relevant recommendation) to do. Basically the algorithm always has to deal with too much data. Don’t worry about that: Matrix Factorization is a real Weapon of Mass Prediction

Matrix Factorization in ML.NET

For Matrix Factorization in ML.NET you’ll need the MatrixFactorizationTrainer class. It comes in a separate NuGet package (Microsoft.ML.Recommender):

RecommenderNuGet

Model Input and Output

For training and testing the model, we’ll use a 2015 dataset with 510 Las Vegal hotel ratings from TripAdvisor. Here’s how it looks like:

RecommendationDataSet

Matrix Factorization predicts the rating (“Label”) between only two fields (“Features”) . If you have to deal with more fields, then you’ll need the FieldAwareFactorizationMachine instead.

In the sample app we choose TravelerType and Hotel as features to respectively play the roles of ‘similar user’ and ‘recommended item’. The Score column contains the rating and will play the role of ‘label’ (the thing to predict). Since the prediction engine’s output column is also called Score, we renamed it to Label for the input.

Here’s the structure of input samples that we will feed the model with:

public class RecommendationData
{
    public float Label;

    public string TravelerType;

    public string Hotel;
}

The prediction looks like this:

public class RecommendationPrediction
{
    public float Score;

    public string TravelerType;

    public string Hotel;
}

Observe the lack of LoadColumn and ColumnName attributes on top of the fields – we had these in all the previous posts in this article series. We don’t need the attributes here because we’re not using a TextLoader to read the training and testing data sets. Instead we’ll create our IDataView with a call to the LoadFromEnumerable() method. This same method allows you to populate the model with records from a database:

private IDataView trainingData;

public IEnumerable<RecommendationData> Load(string trainingDataPath)
{
    var data = File.ReadAllLines(trainingDataPath)
        .Skip(1)
        .Select(x => x.Split(';'))
        .Select(x => new RecommendationData
        {
            Label = uint.Parse(x[4]),
            TravelerType = x[6],
            Hotel = x[13]
        })
        .OrderBy(x => (x.GetHashCode())) // Cheap Randomization.
        .Take(400);

    // Populating an IDataView from an IEnumerable.
    trainingData = _mlContext.Data.LoadFromEnumerable(data);

    // Keep DataView in memory.
    trainingData = _mlContext.Data.Cache(trainingData);

    // Populating an IEnumerable from an IDataView.
    return _mlContext.Data.CreateEnumerable<RecommendationData>(trainingData, reuseRowObject: false);
}

Part of the data set will be used for training, and another for evaluating the model. Since the original data set is sorted on Hotel name, we applied cheap randomization logic to the rows by sorting them on their GetHashCode() value.

The Cache() method keeps the selected columns (in our case: all columns) in memory after they’re accessed for the first time. For iterative algorithms this really is a time saver – at least if the data fits into memory.

Defining and Building the Model

The recommendation model is an ITransformer that is created from an EstimatorChain with a MatrixFactorization at its heart. You have to specify the label (labelColumn) and the two features (matrixRowIndexColumnName and matrixColumnIndexColumnName) and some options to fine tune the algorithm. Before sending the feature values to the transformer, they’re added to a dictionary with MapValueToKey(). The reverse function of that is MapKeyToValue(). It ensures that the original values are returned with the predicted score.

Here’s the whole pipeline:

private ITransformer _model;

public void Build()
{
    var pipeline = _mlContext.Transforms.Conversion.MapValueToKey("Hotel")
                    .Append(_mlContext.Transforms.Conversion.MapValueToKey("TravelerType"))
                    .Append(_mlContext.Recommendation().Trainers.MatrixFactorization(
                                        labelColumn: DefaultColumnNames.Label,
                                        matrixColumnIndexColumnName: "Hotel",
                                        matrixRowIndexColumnName: "TravelerType",
                                        // Optional fine tuning:
                                        numberOfIterations: 20,
                                        approximationRank: 8,
                                        learningRate: 0.4))
                    .Append(_mlContext.Transforms.Conversion.MapKeyToValue("Hotel"))
                    .Append(_mlContext.Transforms.Conversion.MapKeyToValue("TravelerType"));

    // Place a breakpoint here to peek the training data.
    var preview = pipeline.Preview(trainingData, maxRows: 10);

    _model = pipeline.Fit(trainingData);
}

The extremely useful Preview() method was recently added to the API. It allows you to inspect the content and schema of the pipeline while debugging – it feels a bit like the old SSIS Data Viewer:

PreviewSchema

PreviewRowContent

The prediction model is trained with a Fit() call.

Evaluating the Model

It’s always a good idea to evaluate your freshly trained model. Typically this is done by sending it a set of previously unknown –but labeled- data set rows. The Transform() call generates the predictions, while Evaluate() compares these with the original labels:

public RegressionMetrics Evaluate(string testDataPath)
{
    //var testData = _mlContext.Data.LoadFromTextFile<RecommendationData>(testDataPath);
    var data = File.ReadAllLines(testDataPath)
        .Skip(1)
        .Select(x => x.Split(';'))
        .Select(x => new RecommendationData
        {
            Label = uint.Parse(x[4]),
            TravelerType = x[6],
            Hotel = x[13]
        })
        .OrderBy(x => (x.GetHashCode())) // Cheap Randomization.
        .TakeLast(200);

    var testData = _mlContext.Data.LoadFromEnumerable(data);
    var scoredData = _model.Transform(testData);
    var metrics = _mlContext.Recommendation().Evaluate(scoredData);

    // Place a breakpoint here to inspect the quality metrics.
    return metrics;
}

The evaluation returns a RegressionMetrics instance with useful information on the quality of the model – such as the coefficient of determination, and the relative squared error:

RegressionMetrics

If you notice that your model lacks accuracy, then you need to fine tune its parameters and/or provide more representative training data and/or select another algorithm.

Persisting the Model

The model can be serialized and persisted with a call to Save():

public void Save(string modelName)
{
    var storageFolder = ApplicationData.Current.LocalFolder;
    string modelPath = Path.Combine(storageFolder.Path, modelName);

    _mlContext.Model.Save(_model, inputSchema: null, filePath: modelPath);
}

Inferencing with the model

There are two ways for creating recommendation scores. The first one generates a prediction for a single feature combination: a score for one specific traveler type/hotel combination. The API for this scenario cannot be more straightforward: you create a prediction engine with CreatePredictionEngine() and then you call Predict() to … predict:

public RecommendationPrediction Predict(RecommendationData recommendationData)
{
    // Single prediction
    var predictionEngine = _model.CreatePredictionEngine<RecommendationData, RecommendationPrediction>(_mlContext);
    return predictionEngine.Predict(recommendationData);
}

This code is triggered when you select a hotel from the lower left combo box  on the page:

Recommendation

The second way to generate recommendations takes a list of feature pairs instead of a single one. When you select an entry in the traveler type combo box, we first create a list of RecommendationData records – one for each hotel in the original data set. Then we call the Predict() method in the ViewModel – the sample app uses a lightweight MVVM architecture:

// Group Prediction
var recommendations = new List<RecommendationData>();
foreach (var hotel in ViewModel.Hotels)
{
    recommendations.Add(new RecommendationData
    {
        Hotel = hotel,
        TravelerType = TravelerTypesCombo.SelectedValue.ToString()
    });
}
var predictions = await ViewModel.Predict(recommendations);

This list is changed into an IDataView with same the LoadFromEnumerable() call that we encountered when loading the training data. The recommendation model transforms it into a IDataView that adheres to the output schema through the Transform() method. Finally, with the CreateEnumerable() method this structure is translated to a list of prediction entities:

public IEnumerable<RecommendationPrediction> Predict(IEnumerable<RecommendationData> recommendationData)
{
    // Group prediction
    var data = _mlContext.Data.LoadFromEnumerable(recommendationData);
    var predictions = _model.Transform(data);
    return _mlContext.Data.CreateEnumerable<RecommendationPrediction>(predictions, reuseRowObject: false);
}

There are 21 hotels in the data set, so this method returns 21 ratings. The end user is of course not interested in all of these. With a little LINQ query you can get the 10 most appropriate recommendations:

var recommendationsResult = predictions
        .Select(p => p)
        .OrderByDescending(p => p.Score)
        .ToList()
        .Take(10)
        .Reverse();

[Note: The reverse() is only there because we build up the bar chart from bottom to top.]

A word of warning

The current NuGet package for Microsoft.ML carries the v1.0.0-preview tag, so we may be close to an official release. This is not the case for the Microsoft.ML.Recommender. This one seems to need some extra stabilization sprints. In its current version, Matrix Factorization yields different types of exceptions when you’re running in x86 mode. With a little luck you only get weird results like these:

Recommendation_x86

Don’t worry, it’s a known issue, the team is working on it…

Let there be XAML

Let’s jump to the visualization of the predictions. For the horizontal bar chart on the sample page, we borrowed the diagram from the MultiClass Classification sample. XAML-wise we declared a PlotView with its PlotModel. The model has a CategoryAxis for the hotel names and a LinearAxis for the predicted score (0-5). The values are represented in a BarSeries:

<oxy:PlotView x:Name="Diagram"
                Background="Transparent"
                BorderThickness="0"
                Margin="0 0 40 60"
                Grid.Column="1">
    <oxy:PlotView.Model>
        <oxyplot:PlotModel Subtitle="Recommended Hotels"
                            PlotAreaBorderColor="{x:Bind OxyForeground}"
                            TextColor="{x:Bind OxyForeground}"
                            TitleColor="{x:Bind OxyForeground}"
                            SubtitleColor="{x:Bind OxyForeground}">
            <oxyplot:PlotModel.Axes>
                <axes:CategoryAxis Position="Left"
                                    TextColor="{x:Bind OxyForeground}"
                                    TicklineColor="{x:Bind OxyForeground}"
                                    TitleColor="{x:Bind OxyForeground}" />
                <axes:LinearAxis Position="Bottom"
                                    Title="Predicted Score (higher is better)"
                                    TextColor="{x:Bind OxyForeground}"
                                    TicklineColor="{x:Bind OxyForeground}"
                                    TitleColor="{x:Bind OxyForeground}" />
            </oxyplot:PlotModel.Axes>
            <oxyplot:PlotModel.Series>
                <series:BarSeries LabelPlacement="Inside"
                                    LabelFormatString="{}{0:0.00}"
                                    TextColor="{x:Bind OxyText}"
                                    FillColor="{x:Bind OxyFill}" />
            </oxyplot:PlotModel.Series>
        </oxyplot:PlotModel>
    </oxy:PlotView.Model>
</oxy:PlotView>

When the predictions come in, we add category (with the name) and a BarItem (with the score) for each of the maximum 10 hotels. These are added to their respective series, and the plot is refreshed:

// Update diagram
var categories = new List<string>();
var bars = new List<BarItem>();
foreach (var prediction in recommendationsResult)
{
    categories.Add(prediction.Hotel);
    bars.Add(new BarItem { Value = prediction.Score });
}

var plotModel = Diagram.Model;

(plotModel.Axes[0] as CategoryAxis).ItemsSource = categories;
(plotModel.Series[0] as BarSeries).ItemsSource = bars;
plotModel.InvalidatePlot(true);

That’s it for today. The UWP sample app –which is featured on the ML.NET Community Samples page- lives here on GitHub.

Enjoy!

Machine Learning with ML.NET in UWP: Feature Correlation Analysis

In this article we show how to perform Feature Correlation Analysis and display the results in a Heat Map in the context of Machine Learning in UWP. It’s the fourth in a series that started here, on implementing Machine Learning scenarios in UWP using Open Source frameworks and components such as

All articles in the series revolve around a single UWP sample app that lives here on GitHub. Here’s how the Feature Correlation Analysis page looks like:

HeatMap

It displays the correlation between different properties in the popular Titanic passengers dataset: age, fare, ticket class, whether the passenger was accompanied with siblings, spouses, parents or children, and whether he or she survived the trip.

The darker red or blue squares on the heat map indicate that the corresponding properties on X and Y axis have a higher correlation with each other. Higher correlation is a warning sign for possible negative impact on the classification model when both features would be added to the training data.

Feature Correlation Analysis

Feature Correlation Analysis in Machine Learning

The topic of this article is Feature Correlation Analysis. Just like in the previous article -Feature Distribution Analysis- we are in the “data preparation” phase of a Machine Learning scenario. We’re not training or even defining models yet, we’re selecting the features to train them with. An ideal feature set contains features that are highly correlated with the classification (in ML.NET terminology: the Label Column), yet uncorrelated to each other.

Identifying the right feature set highly impacts the quality and performance of the subsequent learning and generalization steps. Here are two important reasons not to keep on adding feature columns to a training data set:

  • while the predictive power of a classifier first increases with the number of dimensions/features used, there comes a break point where it decreases again (the so-called Curse of Dimensionality). The training data set is always a finite set of samples with discrete values, while the prediction space may be infinite and continuous in all dimensions. So the more features a data set with fixed a number of samples has, the less representative it may become. Secondly,
  • there’s also the cost incurred by adding features. Two features that are highly correlated with each other don’t add much value to a classifier, but they sure add cost at training, persisting and/or inference time. Machine Learning activities can be pretty resource intensive on CPU, memory, and elapsed time, so it sure makes sense to limit the number or features.

The process of converting a set of observations of possibly correlated variables into a smaller set of values of linearly uncorrelated variables is called Principal Component Analysis. This technique was invented by Karl Pearson, the same person that defined its main instrument: the Pearson correlation coefficient – a measure of the linear correlation between two variables X and Y. Pearson’s correlation coefficient is the covariance of the two variables divided by the product of their standard deviations. It has a value between +1 and -1, where 1 is total positive linear correlation, 0 is no linear correlation, and -1 is total negative linear correlation.

During Principal Component Analysis a matrix is calculated with the correlation between each pair of features. Highly correlated feature pairs are then ‘sanitized’ by removing one or combining both (e.g. by creating a new feature by multiplying the variables’ values). At the same time you also calculate the correlation between each attribute and the output variable (the Label), to select only those attributes that have a moderate-to-high positive or negative correlation (close to -1 or 1) and drop those attributes with a low correlation (value close to zero).

Feature Correlation Analysis with ML.NET and Math.NET

Data Preparation is outside the core business of ML.NET itself, but for retrieving and manipulating the candidate training data we can count one on its most important spin-off components: the DataView API.

DataViewCentric

We can fetch the samples, optionally filter and add missing data, and then pivot it into arrays of feature values (exactly what we did in the previous article). Al we need is TextLoader to create an IDataView from the data set, get the column names from its Schema, call GetColumn() to get the array, and ‘upgrade’ the data type to double:

var reader = new TextLoader(_mlContext,
                            new TextLoader.Arguments()
                            {
                                Separator = ",",
                                HasHeader = true,
                                Column = new[]
                                    {
                                    new TextLoader.Column("Survived", DataKind.R4, 1),
                                    new TextLoader.Column("PClass", DataKind.R4, 2),
                                    new TextLoader.Column("Age", DataKind.R4, 5),
                                    new TextLoader.Column("SibSp", DataKind.R4, 6),
                                    new TextLoader.Column("Parch", DataKind.R4, 7),
                                    new TextLoader.Column("Fare", DataKind.R4, 9)
                                    }
                            });
var dataView = reader.Read(src);
var result = new List<List<double>>();
for (int i = 0; i < dataView.Schema.ColumnCount; i++)
{
    var columnName = dataView.Schema.GetColumnName(i);
    result.Add(dataView.GetColumn<float>(_mlContext, columnName).Select(f => (double)f).ToList());
}

return result;

Now that we have all the feature values in arrays, it’s time to calculate the correlations. The MathNet.Numerics.Statistics.Correlation class from Math.NET hosts implementations for several Pearson, Spearman, and other correlation calculations.

We decided to make a copy of the code that calculates the Pearson correlation between two IEnumerable<double> instances:

/// <summary>
/// Computes the Pearson Product-Moment Correlation coefficient.
/// </summary>
/// <param name="dataA">Sample data A.</param>
/// <param name="dataB">Sample data B.</param>
/// <returns>The Pearson product-moment correlation coefficient.</returns>
/// <remarks>Original Source: https://github.com/mathnet/mathnet-numerics/blob/master/src/Numerics/Statistics/Correlation.cs </remarks>
public static double Pearson(IEnumerable<double> dataA, IEnumerable<double> dataB)
{
    var n = 0;
    var r = 0.0;

    var meanA = 0d;
    var meanB = 0d;
    var varA = 0d;
    var varB = 0d;

    using (IEnumerator<double> ieA = dataA.GetEnumerator())
    using (IEnumerator<double> ieB = dataB.GetEnumerator())
    {
        while (ieA.MoveNext())
        {
            if (!ieB.MoveNext())
            {
                throw new ArgumentOutOfRangeException(nameof(dataB), "Array too short.");
            }

            var currentA = ieA.Current;
            var currentB = ieB.Current;

            var deltaA = currentA - meanA;
            var scaleDeltaA = deltaA / ++n;

            var deltaB = currentB - meanB;
            var scaleDeltaB = deltaB / n;

            meanA += scaleDeltaA;
            meanB += scaleDeltaB;

            varA += scaleDeltaA * deltaA * (n - 1);
            varB += scaleDeltaB * deltaB * (n - 1);
            r += (deltaA * deltaB * (n - 1)) / n;
        }

        if (ieB.MoveNext())
        {
            throw new ArgumentOutOfRangeException(nameof(dataA), "Array too short.");
        }
    }

    return r / Math.Sqrt(varA * varB);
}

For the sake of completeness: that same Math.NET class also hosts code to calculate the whole matrix. For the sample app this would require importing a lot more code (linear algebra classes such as Matrix) or adding the whole NuGet package to the project.

Here’s how the sample app calculates the whole correlation matrix:

// Read data
var matrix = await ViewModel.LoadCorrelationData();

// Populate diagram
var data = new double[6, 6];
for (int x = 0; x < 6; ++x)
{
    for (int y = 0; y < 5 - x; ++y)
    {
        var seriesA = matrix[x];
        var seriesB = matrix[5 - y];

        var value = Statistics.Pearson(seriesA, seriesB);

        data[x, y] = value;
        data[5 - y, 5 - x] = value;
    }

    data[x, 5 - x] = 1;
}

All we now need is a way to properly visualize it.

Correlation Heat Maps

A heat map is a representation of data in which the values are represented by colors. They are ideal to highlight patterns and extreme values in rectangular data such as matrixes.

Correlation Heat Maps in Machine Learning

In Machine Learning, heat maps are used to display correlations between feature values. The typical (“Pearson”) color scheme is a gradient that goes from

  • red for high positive correlation (value +1), over
  • white for no correlation (value 0), to
  • blue for high negative correlation (value –1).

Sometimes the values are normalized (brought to a range from 0 to 1) like in the next image. When there are a lot of features, the value labels are omitted in the the diagram. Since correlation is commutative (the correlation between A and B is the same as the correlation between B and A) it suffices to only display half of the matrix, like this:

26821073-15081119795115542

This diagram also omitted the correlations on the diagonal: the red squares that indicate the full positive correlation between each feature and itself.

Here’s an example of a diagram (from here) showing less features. It’s common to display the whole matrix and the value labels:

heatmap_2.994292b9

The above diagram was created with Python (Pandas and Seaborn) and shows the correlation between all the numerical values in the already mentioned Titanic Passengers dataset.

Here’s the UWP sample app version of the very same diagram, calculated with ML.NET and Math.NET and visualized with OxyPlot:

HeatMapZoom

The small differences in correlations for the Age feature are caused by the sample app not compensating missing values. We have the matrix, let’s plot this diagram.

Correlation Heat Maps with OxyPlot

To draw an OxyPlot diagram, you start with placing a PlotView element in your XAML:

<oxy:PlotView x:Name="Diagram"
                Background="Transparent"
                BorderThickness="0" />

Then you can declaratively or programmatically decorate it with a PlotModel and different Axis instances. A correlation heat map uses a CategoryAxis in both dimensions:

plotModel.Axes.Add(new CategoryAxis
{
    Position = AxisPosition.Bottom,
    Key = "HorizontalAxis",
    ItemsSource = new[]
    {
        "Survived",
        "Class",
        "Age",
        "Sib / Sp",
        "Par / Chi",
        "Fare"
    },
    TextColor = foreground,
    TicklineColor = foreground,
    TitleColor = foreground
});

plotModel.Axes.Add(new CategoryAxis
{
    Position = AxisPosition.Left,
    Key = "VerticalAxis",
    ItemsSource = new[]
    {
        "Fare",
        "Parents / Children",
        "Siblings / Spouses",
        "Age",
        "Class",
        "Survived"
    },
    TextColor = foreground,
    TicklineColor = foreground,
    TitleColor = foreground
});

The legend on top of the diagram is an extra LinearColorAxis in the appropriate OxyPalette:

plotModel.Axes.Add(new LinearColorAxis
{
    // Pearson color scheme from blue over white to red.
    Palette = OxyPalettes.BlueWhiteRed31,
    Position = AxisPosition.Top,
    Minimum = -1,
    Maximum = 1,
    TicklineColor = OxyColors.Transparent
});

If you’re not entirely satisfied with the color scheme, feel free to create your own custom OxyPalette instance: it’s just a 3-color gradient.

The matrix itself is a HeatMapSeries with 6 values in each dimension, rendered as rectangles:

var heatMapSeries = new HeatMapSeries
{
    X0 = 0,
    X1 = 5,
    Y0 = 0,
    Y1 = 5,
    XAxisKey = "HorizontalAxis",
    YAxisKey = "VerticalAxis",
    RenderMethod = HeatMapRenderMethod.Rectangles,
    LabelFontSize = 0.12,
    LabelFormatString = ".00"
};

plotModel.Series.Add(heatMapSeries);

Diagram.Model = plotModel;

To display the label in the square you need to set the LabelFormatString. This will only be applied if you also set a value to LabelFontSize.

OxyPlot does not support the triangular version of the heat map. Missing values always get the default value and you can’t make them transparent:

HalfMap

To populate the diagram, assign the Data to the series, and refresh the plot:

(plotModel.Series[0] as HeatMapSeries).Data = data;

// Update diagram
Diagram.InvalidatePlot();

Again, here’s the resulting heat map in the sample app:

HeatMap

Interpretation

The dark blue square on the diagram reveals a relatively high negative correlation between the Passenger Class and Ticket Fare. This means that the value for the one can easily be derived from the other – think “first class tickets cost more than second class tickets”. Adding both as a feature would not add more value than adding only one of them.

Data scientists would probably extract a new feature from these two (something like “Luxury”) or would break up “Passenger Class” and “Ticket Fare” in more basic components, like locations on the ship that passengers had access to. Anyway, the heat map clearly highlights the feature combinations that need further analysis.

Source

In this article we used components from ML.NET, Math.NET and OxyPlot to calculate and visualize the correlation heat map on candidate training data for a classification model.

The UWP sample app host more Machine Learning scenarios. It lives here on GitHub.

Enjoy!

Machine Learning with ML.NET in UWP: Feature Distribution Analysis

Welcome to the third article in this series on implementing Machine Learning scenarios with Open Source technologies in UWP apps. We were using ML.NET for modeling and OxyPlot for data visualization, and we will continue to do so. For this article we added Math.NET to our shopping basket: to calculate some statistics that are important when analyzing the input data. We will analyze the distribution of values for columns in the candidate model training data to detect whether these columns would be a useful as feature, or whether they need filtering, or whether they should be ignored at all.

Feature Analysis in Machine Learning

When it comes to the ‘Garbage in – Garbage out’ principle, Machine Learning is not different from any other process. Before training a model data scientists will perform Feature Analysis to identify the features that are most useful in solving the problem. This includes steps such as:

  • analyzing the distribution of values,
  • looking for missing data and deciding whether to ignore, replace, or reject it,
  • analyzing correlation between columns, and
  • combining columns into new candidate features (so called Feature Engineering)

To illustrate why Feature Analysis is important, just take a look at the following diagram that shows the predicted NBA player salary (red) versus the real salary (blue) in the Regression page of our sample app:

Regression

It’s pretty clear that the trained algorithm is not very useful: it does not look significantly better than a random prediction. Next to the vertical axis you even see the model predicting negative salaries. This bad performance is partly because we did not do any analysis before just feeding the model with raw data.

Feature Analysis in ML.NET

Although the data preparation step is not the core business of ML.NET, we still have good news. The announcement of ML.NET 0.10.0 revealed IDataView to become a shared type across libraries in the .NET ecosystem. If you dive into the IDataView Design Principles you’ll observe that the DataView can be a core component in analyzing raw data sets, since it allows

  • cursoring,
  • large data,
  • caching,
  • peeking,
  • and so on.

You’ll observe that we will use IDataView properties and (extension) methods in this article to stay as close as possible to the following schema:

DataViewCentric

Introducing the Box Plot

The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data. It is based on the five number summary:

  • minimum,
  • first quartile,
  • median,
  • third quartile, and
  • maximum.

In the simplest box plot the central rectangle spans the first quartile to the third quartile: the interquartile range or IQR – the likely range of variation. A segment inside the rectangle shows the median (the typical value) and “whiskers” above and below the box show the locations of the minimum and maximum. The so-called Tukey boxplot uses the lowest value still within 1.5 IQR of the lower quartile, and the highest value still within 1.5 IQR of the upper quartile respectively as minimum and maximum. It represents the values outside that boundary (the extreme values) as ‘outlier’ dots:

BoxPlotParts

The box plot can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

Box plots may reveal whether or not you should use all of the values for training the model, and which algorithm you should prefer (some algorithms assume a symmetrical distribution). Here’s an article with more details on how to interpret box plots.

Box plot in OxyPlot

Most of the data visualization frameworks for UWP support box plots, and OxyPlot is not an exception. All you need to do is insert a PlotView control in your XAML:

<oxy:PlotView 
	x:Name="RegressionDiagram"
	Background="Transparent"
	BorderThickness="0" />

The control’s PlotModel is populated with a BoxPlotSeries that is displayed against a CategoryAxis for the property names and a LinearAxis for the values. Check the previous articles in this blog series on how to do define a model and its axes in XAML and C#.

In our sample app we wanted to highlight the distributions having outliers. We added two series to the model – each with a different color:

var cleanSeries = new BoxPlotSeries
{
    Stroke = foreground,
    Fill = OxyColors.DarkOrange
};
plotModel.Series.Add(cleanSeries);

var outlinerSeries = new BoxPlotSeries
{
    Stroke = foreground,
    Fill = OxyColors.Firebrick,
    OutlierSize = 4
};
plotModel.Series.Add(outlinerSeries);

The next step is to provide BoxPlotItem instances for the series, but first we need to calculate these.

Enter Math.NET.

Boxplot in Math.NET

Math.NET is an Open Source C# library covering fundamental mathematics. Its code is distributed over several NuGet packages covering domains such as numerical computing, algebra, signal processing, and geometry. Math.NET Numerics is the package that contains the functions from descriptive statistics that we’re interested in. It targets .NET 4.0 and higher, including Mono and .NET Standard 1.3 and higher. The sample app does not use the NuGet package however. Because the source code is effective and very well structured -what else did you expect from mathematicians- it was easy to identify and grab the source code for the calculations that we wanted, so that’s what we did.

The SortedArrayStatistics class contains all the functions we need (Median, Quartiles, Quantiles) and more.

Boxplot in the Sample App

To draw a box plot, we first need to get the data. ML.NET uses a TextLoader for this:

var trainingDataPath = await MlDotNet.FilePath(@"ms-appx:///Data/Mall_Customers.csv");
var reader = new TextLoader(_mlContext,
                            new TextLoader.Arguments()
                            {
                                Separator = ",",
                                HasHeader = true,
                                Column = new[]
                                    {
                                    new TextLoader.Column("Age", DataKind.R4, 2),
                                    new TextLoader.Column("AnnualIncome", DataKind.R4, 3),
                                    new TextLoader.Column("SpendingScore", DataKind.R4, 4),
                                    }
                            });

var file = _mlContext.OpenInputFile(trainingDataPath);
var src = new FileHandleSource(file);
var dataView = reader.Read(src);

The result of the Read() method is an IDataView tabular structure. We can query its Schema to find out column names, and with the GetColumn() extension method we can fetch all values for the specified column.

This is how the sample app basically pivots the data view from a list of rows to a list of columns:

var result = new List<List<double>>();
for (int i = 0; i < dataView.Schema.ColumnCount; i++)
{
    var columnName = dataView.Schema.GetColumnName(i);
    result.Add(dataView
	.GetColumn<float>(_mlContext, columnName)
	.Select(f => (double)f)
	.ToList());
}

return result;

Notice that we switched from float (the low-memory favorite type in ML.NET) to double (the high-precision favorite type in Math.NET).

The array of column values is used to build a BoxPlotItem to be added to the PlotModel:

// Read data
var regressionData = await ViewModel.LoadRegressionData();

// Populate diagram
for (int i = 0; i < regressionData.Count; i++)
{
    AddItem(plotModel, regressionData[i], i);
}

Here’s the code to calculate all the box plot constituents. Remember to sort the array first, since we rely on Math.ML’s Sorted Array Statistics here:

values.Sort();
var sorted = values.ToArray();

// Box: Q1, Q2, Q3
var median = sorted.Median();
var firstQuartile = sorted.LowerQuartile();
var thirdQuartile = sorted.UpperQuartile();

// Whiskers
var interQuartileRange = thirdQuartile - firstQuartile;
var step = interQuartileRange * 1.5;
var upperWhisker = thirdQuartile + step;
upperWhisker = sorted.Where(v => v <= upperWhisker).Max();
var lowerWhisker = firstQuartile - step;
lowerWhisker = sorted.Where(v => v >= lowerWhisker).Min();

// Outliers
var outliers = sorted.Where(v => v < lowerWhisker || v > upperWhisker).ToList();

Here’s the creation of the OxyPlot box plot item itself. The first parameter refers to the category index:

var item = new BoxPlotItem(
    x: slot,
    lowerWhisker: lowerWhisker,
    boxBottom: firstQuartile,
    median: median,
    boxTop: thirdQuartile,
    upperWhisker: upperWhisker)
{
    Outliers = outliers
};

In the following code snippet we assign the new item to one of the two series (with and without outliers) to obtain the wanted color scheme:

if (outliers.Any())
{
    (plotModel.Series[1] as BoxPlotSeries).Items.Add(item);
}
else
{
    (plotModel.Series[0] as BoxPlotSeries).Items.Add(item);
}

This is how the final result looks like. The diagram on the left shows the raw data for the Regression sample in the app. Notice that all properties are distributed asymmetrically and come with lots of outliers. That’s not a solid base for creating a prediction model:

BoxPlot

The diagram on the right shows the data for the Clustering sample. This was our very first ML.NET project so we decided to use a proven and cleaned data set, and this shows in the box plot. For the sake of completeness, here’s that Clustering sample again. Its prediction model works very well:

Clustering

One more thing about the box plot. When you right click a shape on the diagram, OxyPlot shows its tracker with the details:

FeatureDistributionAnalysis_Tracker

When outliers are identified in the analysis, you may decide to skip these when training the model, using FilterByColumn(). Check this sample code for more details.

Source

In this article we demonstrated how to build box plot diagrams in UWP, using ML.NET, Math.NET and OxyPlot. Even when ML.NET is not targeting Feature Analysis, its IDataView API is very helpful in getting the column data.

The sample app lives here on GitHub.

Enjoy!

Machine Learning with ML.NET in UWP: Multiclass Classification

This is the second in a series of articles on implementing Machine Learning scenarios with ML.NET and OxyPlot in UWP apps. If you’re looking for an introduction to these technologies, please check part one of this series. In this article we will build, train, evaluate, and consume a multiclass classification model to detect the language of a piece of text.

All blog posts in this series are based on a single sample app that lives here on GitHub.

Classification

Classification in Machine Learning

Classification is a  technique from supervised learning to categorize data into a desired number of labeled classes. In binary classification the prediction yields one of two possible outcomes (basically solving ‘true or false’ problems). This article however focuses on multiclass classification, where the prediction model has two or more possible outcomes.

Here are some real-world classification scenario’s:

  • road sign detection in self-driving cars,
  • spoken language understanding,
  • market segmentation (predict if a customer will respond to marketing campaign), and
  • classification of proteins according to their function.

There’s a wide range of multiclass classification algorithms available. Here are the most used ones:

  • k-Nearest Neighbors learns by example. The model is a so-called lazy one: it just stores the training data, all computation is deferred. At prediction time it looks up the k closest training samples. It’s very effective on small training sets, like in the face recognition on your mobile phone.
  • Naive Bayes is a family of algorithms that use principles from the field of probability theory and statistics. It is popular in text categorization and medical diagnosis.
  • Regression involves fitting a curve to numeric data. When used for classification, the resulting numerical value must be transformed back into a label. Regression algorithms have been used to identify future risks for patients, and to predict voting intent.
  • Classification Trees and Forests use flowchart-like structures to make decisions. This family of algorithms is particularly useful when transparency is needed, e.g. in loan approval or fraud detection.
  • A set of Binary Classification algorithms can be made to work together to form a multiclass classifier using a technique called ‘One-versus-All’ (OVA).

If you want to know more about classification then check this straightforward article. It is richly illustrated with Chris Albon’s awesome flash cards like this one:

1_OqOzEP0pLmqvEBirKAjPXQ

Classification in ML.NET

ML.NET covers all the major algorithm families and more with the following multiclass classification learners:

The API allows to implement the following flow:

MachineLearningSteps

When we dive into the code, you’ll recognize that same pattern.

Building a Language Recognizer

The case

In this article we’ll build and use a model to detect the language of a piece of text from a set of languages. The model will be trained to recognize English, German, French, Italian, Spanish, and Romanian. The training and evaluation datasets (and a lot of the code) are borrowed from this project by Dirk Bahle.

Safety Instructions

In the previous article, we explained why we’re using v0.6 of the ML.NET API instead of the current one (v0.9). There is some work to be done by different Microsoft teams to adjust the UWP/.NET Core/ML.NET components to one another. 

The sample app works pretty well, as long as you comply with the following safety instructions:

  • don’t upgrade the ML.NET NuGet package,
  • don’t run the app in Release mode, and
  • always bend your knees not your back when lifting heavy stuff.

In the last couple of iterations the ML.NET team has been upgrading its API from the original Microsoft internal .NET 1.0 code to one that is on par with other Machine Learning frameworks. The difference is huge! A lot of the v0.6 classes that you encounter in this sample are now living in the Legacy namespace or were even removed from the package.

As far as possible we’ll try to point the hyperlinks in this article to the corresponding members in the newer API. The documentation on older versions is continuously cleaned up and we don’t want you to end up on this page:

DeletedDocs

If you want to know multiclass classification looks like in the newest API, then check this official sample.

Alternative Strategy

We can imagine that some of you don’t want to wait for all pieces of the technical puzzle to come together, or are reluctant to use ML.NET in UWP. Allow us to promote an alternative approach. WinML is an inference engine to use trained local ONNX machine learning models in your Windows apps. Not all end user (UWP) apps are interested in model training – they only want the use a model for running predictions. You can build, train, and evaluate a Machine Learning model in a C# console app with ML.NET, then save it as ONNX with this converter, then load and consume it in a UWP app with WinML:

onnx-diagram-v03

The ML.NET console app can be packaged, deployed and executed as part of your UWP app by including it as a full trust desktop extension. In this configuration the whole solution can even be shipped to the store.

The Code

A Lottie-driven busy indicator

Depending on the algorithm family, training and using a machine learning model can be CPU intensive and time consuming. To entertain the end user during these processes and to verify that these does not block the UI, we added an extra element to the page. An UWP Lottie animation will play the role of a busy indicator:

<lottie:LottieAnimationView 
	x:Name="BusyIndicator"
	FileName="Assets/loading.json"
	Visibility="Collapsed" />

When the load-build-train-test-save-consume scenario starts, the image will become visible and we start the animation:

BusyIndicator.Visibility = Windows.UI.Xaml.Visibility.Visible;
BusyIndicator.PlayAnimation();

Here’s how this looks like:

Lottie

When the action stops, we hide the control and pause the animation:

BusyIndicator.Visibility = Windows.UI.Xaml.Visibility.Collapsed;
BusyIndicator.PauseAnimation();

As explained in the previous article, we moved all machine model processing of the main UI thread by making it awaitable:

public Task Train()
{
    return Task.Run(() =>
    {
        _model.Train();
    });
}

Load data

The training dataset is a TAB separated value file with the labeled input data: an integer corresponding to the language, and some text:

RawData

The input data is modeled through a small class. We use the Column attribute to indicate column sequence number in the file, and special names for the algorithm. Supervised learning algorithms always expect a “Label” column in the input:

public class MulticlassClassificationData
{
    [Column(ordinal: "0", name: "Label")]
    public float LanguageClass;

    [Column(ordinal: "1")]
    public string Text;

    public MulticlassClassificationData(string text)
    {
        Text = text;
    }
}

The output of the classification model is a prediction that contains the predicted language (as a float – just like the input) and the confidence percentages for all languages. We used the ColumnName attribute to link the class members to these output columns:

public class MulticlassClassificationPrediction
{
    private readonly string[] classNames = { "German", "English", "French", "Italian", "Romanian", "Spanish" };

    [ColumnName("PredictedLabel")]
    public float Class;

    [ColumnName("Score")]
    public float[] Distances;

    public string PredictedLanguage => classNames[(int)Class];

    public int Confidence => (int)(Distances[(int)Class] * 100);
}

The MVVM Model has properties to store the untrained model and the trained model, respectively a LearningPipeline and a PredictionModel:

public LearningPipeline Pipeline { get; private set; }

public PredictionModel<MulticlassClassificationData, MulticlassClassificationPrediction> Model { get; private set; }

We used the ‘classic’ text loader from the Legacy namespace to load the data sets, so watch the using statement:

using TextLoader = Microsoft.ML.Legacy.Data.TextLoader;

The first step in the learning pipeline is loading the raw data:

Pipeline = new LearningPipeline();
Pipeline.Add(new TextLoader(trainingDataPath).CreateFrom<MulticlassClassificationData>());

Extract features

To prepare the data for the classifier, we need to manipulate both incoming fields. The label does not represent a numerical series but a language. So with a Dictionarizer we create a ‘bucket’ for each language to hold the texts. The TextFeaturizer populates the Features column with a numeric vector that represents the text:

// Create a dictionary for the languages. (no pun intended)
Pipeline.Add(new Dictionarizer("Label"));

// Transform the text into a feature vector.
Pipeline.Add(new TextFeaturizer("Features", "Text"));

Train model

Now that the data is prepared, we can hook the classifier into the pipeline. As already mentioned, there are multiple candidate algorithms here:

// Main algorithm
Pipeline.Add(new StochasticDualCoordinateAscentClassifier());
// or
// Pipeline.Add(new LogisticRegressionClassifier());
// or
// Pipeline.Add(new NaiveBayesClassifier()); // yields weird metrics...

The predicted label is a vector, but we want one of our original input labels back  – to map it to a language. The PredictedLabelColumnsOriginalValueConverter does this:

// Convert the predicted value back into a language.
Pipeline.Add(new PredictedLabelColumnOriginalValueConverter()
    {
        PredictedLabelColumn = "PredictedLabel"
    }
);

The learning pipeline is complete now. We can train the model:

public void Train()
{
    Model = Pipeline.Train<MulticlassClassificationData, MulticlassClassificationPrediction>();
}

The trained machine learning model can be saved now:

public void Save(string modelName)
{
    var storageFolder = ApplicationData.Current.LocalFolder;
    using (var fs = new FileStream(
        Path.Combine(storageFolder.Path, modelName),
        FileMode.Create,
        FileAccess.Write,
        FileShare.Write))
        Model.WriteAsync(fs);
}

Evaluate model

In supervised learning you can evaluate a trained model by providing a labeled input test data set and see how the predictions compare against it. This gives you an idea of the accuracy of the model and indicates whether you need to retrain it with other parameters or another algorithm.

We create a ClassificationEvaluator for this, and inspect the ClassificationMetrics that return from the Evaluate() call:

public ClassificationMetrics Evaluate(string testDataPath)
{
    var testData = new TextLoader(testDataPath).CreateFrom<MulticlassClassificationData>();

    var evaluator = new ClassificationEvaluator();
    return evaluator.Evaluate(Model, testData);
}

Some of the returned metrics apply to the whole model, some are calculated per label (language). The following diagram presents the Logarithmic Loss of the classifier per language (the PerClassLogLoss field). Loss represents a degree of uncertainty, so lower values are better:

MulticlassClassificationStart

Observe that some languages are harder to detect than others.

Model consumption

The Predict() call takes a piece of text and returns a prediction:

public MulticlassClassificationPrediction Predict(string text)
{
    return Model.Predict(new MulticlassClassificationData(text));
}

The prediction contains the predicted language and a set of scores for each language. Here’s what we do with this information in the sample app:

MulticlassClassification

We are pretty impressed to see how easy it is to build a reliable detector for 6 languages. The trained model would definitely make sense in a lot of .NET applications that we developed in the last couple of years.

Visualizing the results

We decided to use OxyPlot for visualizing the data in the sample app, because it’s light-weight and it does all the graphs we needed. In the previous article in this series we created all the elements programmatically. So this time we’ll focus on the XAML.

Axes and Series

Here’s the declaration of the PlotView with its PlotModel. The model has a CategoryAxis for the languages and a LinearAxis for the log-loss values. The values are represented in a BarSeries:

<oxy:PlotView x:Name="Diagram"
                Background="Transparent"
                BorderThickness="0"
                Margin="0 0 40 60"
                Grid.Column="1">
    <oxy:PlotView.Model>
        <oxyplot:PlotModel Subtitle="Model Quality"
                            PlotAreaBorderColor="{x:Bind OxyForeground}"
                            TextColor="{x:Bind OxyForeground}"
                            TitleColor="{x:Bind OxyForeground}"
                            SubtitleColor="{x:Bind OxyForeground}">
            <oxyplot:PlotModel.Axes>
                <axes:CategoryAxis Position="Left"
                                    ItemsSource="{x:Bind Languages}"
                                    TextColor="{x:Bind OxyForeground}"
                                    TicklineColor="{x:Bind OxyForeground}"
                                    TitleColor="{x:Bind OxyForeground}" />
                <axes:LinearAxis Position="Bottom"
                                    Title="Logarithmic loss per class (lower is better)"
                                    TextColor="{x:Bind OxyForeground}"
                                    TicklineColor="{x:Bind OxyForeground}"
                                    TitleColor="{x:Bind OxyForeground}" />
            </oxyplot:PlotModel.Axes>
            <oxyplot:PlotModel.Series>
                <series:BarSeries LabelPlacement="Inside"
                                    LabelFormatString="{}{0:0.00}"
                                    TextColor="{x:Bind OxyText}"
                                    FillColor="{x:Bind OxyFill}" />
            </oxyplot:PlotModel.Series>
        </oxyplot:PlotModel>
    </oxy:PlotView.Model>
</oxy:PlotView>

Apart from the OxyColor and OxyThickness values we were able to define the whole diagram in XAML. Thats not too bad for a prerelease NuGet package…

When the page is loaded in the sample app, we fill out the missing declarations, and update the diagram’s UI:

var plotModel = Diagram.Model;
plotModel.PlotAreaBorderThickness = new OxyThickness(1, 0, 0, 1);
Diagram.InvalidatePlot();

Adding the data

After the evaluation of the classification model, we iterate through the quality metrics. We create a BarItem for each language. All items are then added to the series:

var bars = new List<BarItem>();
foreach (var logloss in metrics.PerClassLogLoss)
{
    bars.Add(new BarItem { Value = logloss });
}

(plotModel.Series[0] as BarSeries).ItemsSource = bars;
plotModel.InvalidatePlot(true);

The sample app

The sample app lives here on NuGet. We take the opportunity here to proudly mention that it is featured in the ML.NET Machine Learning Community gallery.

Enjoy!

Machine Learning with ML.NET in UWP: Clustering

This is the first in a series of articles on implementing Machine Learning scenarios in UWP apps. We will use

  • ML.NET for defining, training, evaluating and running Machine Learning models,
  • OxyPlot for visualizing the data, and
  • we’re planning to bring in Math.NET for number crunching – if needed.

All of these are cross platform Open Source technologies, all of these are written in C#, all of these are free, and all of these can be used on the UWP platform, albeit with some -hopefully temporary- restrictions.

Currently the large majority of the online samples on ML.NET are straightforward console apps. That’s fine if want to learn the API, but we want to figure out how ML.NET behaves in a more hostile enterprise-ish environment – where calculations should not block the UI, data should be visualized in sexy graphs, and architectural constraints may apply. With that in mind, we created a sample UWP MVVM app on GitHub, with pages covering different Machine Learning use cases, like

  • building and using models for clustering, classification, and regression, and
  • analyzing the input data that is used to train these models (that’s what data scientists call feature engineering).

Here’s a screenshot from that app:

MulticlassClassification

Machine Learning

Machine learning is a data science technique that allows computers to use existing data to forecast future behaviors, outcomes, and trends. Using machine learning, computers learn without being explicitly programmed. The forecasts and predictions are produced by so called ‘models’ that are upfront defined, trained with test data, evaluated, persisted, and then called upon in client apps.

Typical Machine Learning implementations include (source: Princeton):

  • optical character recognition: categorize images of handwritten characters by the letters represented
  • face detection: find faces in images (or indicate if a face is present)
  • spam filtering: identify email messages as spam or non-spam
  • topic spotting: categorize news articles (say) as to whether they are about politics, sports, entertainment, etc.
  • spoken language understanding: within the context of a limited domain, determine the meaning of something uttered by a speaker to the extent that it can be classified into one of a fixed set of categories
  • medical diagnosis: diagnose a patient as a sufferer or non-sufferer of some disease
  • customer segmentation: predict, for instance, which customers will respond to a particular promotion
  • fraud detection: identify credit card transactions (for instance) which may be fraudulent in nature
  • weather prediction: predict, for instance, whether or not it will rain tomorrow

Well, this is exactly what ML.NET covers, and here are the steps to build all of these scenarios:

MachineLearningSteps

The domain of machine learning (and data science in general) is currently dominated by two programming languages: Python (with tools like scikit-learn) and R. These are great environments for data scientists, but are not targeted to developers. Enter ML.NET.

ML.NET

ML.NET is not the first Machine Learning environment from Microsoft (think of the Data Mining components of SQL Server Analysis Services, Cognitive Toolkit, and Azure Machine Learning services), but it is the first framework that targets application developers. ML.NET brings a large set of model-based Machine Learning analytic and prediction capabilities into the .NET world. The framework is built upon .NET Core to run cross-platform on Linux, Windows and MacOS. Developers can define and train a Machine Learning models or reuse an existing models by a 3rd party, and run it on any environment offline.

ML.NET is mature …

The origins of the ML.NET library go back many years. Shortly after the introduction of the Microsoft .NET Framework in 2002, Microsoft Research began a project called TMSN (“text mining search and navigation”) to enable  developers to include ML code in Microsoft products and technologies. The project was very successful, and over the years grew in size and usage internally at Microsoft. Somewhere around 2011 the library was renamed to TLC (“the learning code”). TLC is widely used within Microsoft.

The ML.NET library started a descendant of TLC, with the Microsoft-specific features removed.

… but not finished yet

ML.NET 0.1 was announced at //Build 2018. The core functionality exists since 2002, so the first iterations probably focused on modernizing the C# code and hiding/removing the Microsoft-specific stuff. In version 0.5 a large portion of the original pipeline API with concepts such as Estimators, Transforms and DataView was moved to the Legacy namespace and replaced by one that is more consistent with well-known frameworks like Scikit-Learn, TensorFlow and Spark. Version 0.7 removed the framework’s dependency on x64 devices. ML.NET is and is currently at v0.9. It was extended with feature engineering, further reducing the functionality gap with Python and R environments.

You find the ML.NET source code here on GitHub.

What about UWP?

UWP apps are safe to run and easy to install and uninstall in a clean way. To achieve this, they run in a sandbox

  • with a reduced API surface where some Win32 an COM calls are not accessible or even available, and
  • within a security context that restricts access to the external environment (file system, processes, external devices).

When experimenting with ML.NET in UWP we observed the following handful of issues:

  • ML.NET uses Reflection.Emit, which is not allowed in the native compilation step that occurs when compiling UWP apps in release mode,
  • ML.NET uses the MEF composition container from v0.7 onwards, and not al MEF classes are exposed to UWP yet (a known issue), and
  • ML.NET does file and process manipulations on classic API’s that are not available in UWP because of the sandbox, or because they live in a different namespace.

These issues are known by the different teams at Microsoft and we believe that they will be tackled sooner or later (the issues, not the teams Smile). In the mean time we’ll stick to version v0.6. Don’t worry: the v0.6 NuGet package has most of the core functionality (load and transform training data into memory, create a model, train the model, evaluate and save the model, use the model) and algorithms, but not the latest API.

Clustering

Clustering is a set of Machine Learning algorithms that help to identify the meaningful groups in a larger population. It is used by social networks and search engines for targeting ads and determining relevance rates. Clustering is a subdomain of unsupervised learning. Its algorithms learn from test data that has not been labeled, classified or categorized. Instead of responding to feedback, unsupervised learning algorithms identify commonalities in the data and react based on the presence or absence of such commonalities in each new piece of data.

These diagrams show what clustering does:
CA40FFF2-7AF4-4EF5-9899-ADE67C8EB2A7
Most courses and handbooks on Machine Learning start with clustering, because it’s easy compared to the other algorithms (no labeling, no evaluation, easy visualization if you stick to less than 4D).

So it makes sense to start this ML.NET-OxyPlot-UWP series with a article on clustering.

ML.NET algorithms

There is a few clustering algorithms available, but the K-Means family is the most used, and that’s what ML.NET provides. K-means is often referred to as Lloyd’s algorithm. It takes the number of wanted clusters (‘k’) as its primary input. The algorithm has three steps. It executes the first step and then starts looping between the other two:

  • The first step chooses the initial centroids, with the most basic method being to choose k samples from the dataset.
  • The second step assigns each sample to its nearest centroid.
  • The third step creates new centroids by taking the mean value of all of the samples assigned to each previous centroid.

The algorithm iterates through steps 2 and 3 until the position of the centroids stabilizes.

Enough talking, let’s get our hands dirty

Prepare the Solution

The sample app started as a new UWP app, with the following NuGet packages:

  • ML.NET v0.6 (because higher versions don’t work for UWP)
  • OxyPlot latest UWP prerelease
  • .NET Core latest prerelease (because we want to see if and when bugs are fixed)
  • LottieUWP (because we want to show a fancy animation while calculating)

ProjectConfig

As already mentioned we stay in debug mode to avoid Reflection issues, and we target x64 architecture only since we use a pre-v0.7 version of ML.NET.

Solution Architecture

The Visual Studio solution uses a lightweight MVVM-architecture where the

  • Models contain all business logic, the
  • ViewModels make the models accessible to the Views, the
  • Views focus on UI only, and the
  • Services take care of everything that doesn’t fit the previous categories.

As a result, the MVVM Model classes contain code that is pretty close to the existing ML.NET Console app samples. Here’s for example how a Machine Learning Model is trained:

public void Train(IDataView trainingDataView)
{
    Model = Pipeline.Fit(trainingDataView);
}

In general, MVVM ViewModels make the models accessible to the UI by enabling data binding and change propagation. Machine Learning scenarios typically work with large data files and they run CPU-intensive processes. In the sample app the ViewModels are responsible for pushing all that heavy lifting off the UI thread to keep app responsive while calculating.

Here’s the pattern: all the model’s CPU-bound actions are made awaitable by wrapping them in a Task. Here’s how such a call looks like:

public Task Train(IDataView trainingDataView)
{
    return Task.Run(() =>
    {
        _model.Train(trainingDataView);
    });
}

This allows the Views to remain responsive. It can update controls and start animations while calculations are in progress. Here’s how a XAML page updates its UI and then starts training a model:

TrainingBox.IsChecked = true;
await ViewModel.Train(trainingDataView);

Nothing fancy, right? Well: if you would train the model synchronously, the checkbox would only update after the calculation.

Let’s start with a hack

UWP has its own file API with limited access to the drives and its own logical URI schemes to refer to the files. ML.NET is based on the .NET Standard specification, and uses the classic API’s with physical file paths. So we decided to add a helper method that takes a UWP file reference (think ‘ms-appx:///something’), copies the file to local app storage, and then returns the physical path that classic API’s know and love. Here’s that helper:

public static async Task<string> FilePath(string uwpPath)
{
    var originalFile = await StorageFile.GetFileFromApplicationUriAsync(new Uri(uwpPath));
    var storageFolder = ApplicationData.Current.LocalFolder;
    var localFile = await originalFile.CopyAsync(storageFolder, originalFile.Name, NameCollisionOption.ReplaceExisting);
    return localFile.Path;
}

Let’s now dive into ML.NET.

Load the data

In this first scenario we will divide a group of shopping mall customers (borrowed from this Python sample) into 5 clusters. The raw data contains records with a customer id, gender, age, annual income, and a spending score. Notice that we’re in an unsupervised learning environment, so the data is ‘unlabeled’ – there is no cluster id column to learn from:

Clustering_data

The MLContext class is central in the recent ML.NET API’s (sort of DBContext in Entity Framework). In v0.6 it was still called LocalEnvironment. We need an instance of it to share across components:

private LocalEnvironment _mlContext = new LocalEnvironment(seed: null); // v0.6;

The following code snippet from the MVVM Model shows how a TextLoader is used to describe the content of the file, read it, and return its schema as an IDataView:

public IDataView Load(string trainingDataPath)
{
    var reader = new TextLoader(_mlContext,
                                new TextLoader.Arguments()
                                {
                                    Separator = ",",
                                    HasHeader = true,
                                    Column = new[]
                                        {
                                            new TextLoader.Column("CustomerId", DataKind.I4, 0),
                                            new TextLoader.Column("Gender", DataKind.Text, 1),
                                            new TextLoader.Column("Age", DataKind.I4, 2),
                                            new TextLoader.Column("AnnualIncome", DataKind.R4, 3),
                                            new TextLoader.Column("SpendingScore", DataKind.R4, 4),
                                        }
                                });

    var file = _mlContext.OpenInputFile(trainingDataPath);
    var src = new FileHandleSource(file);
    return reader.Read(src);
}

The ML.NET algorithms only support the data types that are listed in the DataKind enumeration. The properties that we’re interested in (AnnualIncome and SpendingScore) are defined as float. In general, Machine Learning doesn’t like double because its values take more memory and the extra precision does not compensate that.

Here’s the call from the MVVM View. We read a file from the app’s Data folder, make a copy of it to the local storage, and pass the full path to the ML.NET code above:

// Prepare the files.
var trainingDataPath = await MlDotNet.FilePath(@"ms-appx:///Data/Mall_Customers.csv");

// Read training data.
var trainingDataView = await ViewModel.Load(trainingDataPath);

The returned IDataView does deferred execution similar to Enumerable in LINQ. The data is only physically read when it is consumed – in this case when training the model.

Here’s the raw data in a diagram:

Clustering_Raw_Plot

Extract the features

In the second step of this Machine Learning scenario, we prepare the input data and define the algorithm that will be used. These steps are combined into a ‘Learning Pipeline’ that represents the untrained model.

First we plug in a ConcatEstimator to ‘featurize’ the data. We enrich the input data with properties of the name and the type that the algorithm expects. The Clustering model looks for a single column called “Feature” with an expected data type. We create this from the columns that we want to cluster on: AnnualIncome and SpendingScore.

The last step in the pipeline represents the algorithm and defines the type of pipeline – the T in EstimatorChain<T>. in a Clustering scenario we will use a KMeansPlusPlusTrainer. The most important parameter is the number of clusters.

Here’s how the pipeline is created:

public void Build()
{
    Pipeline = new ConcatEstimator(_mlContext, "Features", "AnnualIncome", "SpendingScore")
        .Append(new KMeansPlusPlusTrainer(
            env: _mlContext,
            featureColumn: "Features",
            clustersCount: 5,
            advancedSettings: (a) =>
                {
                    // a.AccelMemBudgetMb = 1;
                    // a.MaxIterations = 1;
                    // a. ...
                }
            ));
}

The advanced settings parameter allows you to fine-tune the algorithm, e.g. to optimize it for speed or memory consumption. Of course this comes with a price: the quality of the model will decrease. Here’s what happens when you restrict the number of iterations to just one:

Clustering_bad_config

The initial centroids were used and never moved, so we ended up with green dots within the red cluster,and a dark blue cluster with only two dots.

In a production environment you would iterate through different algorithms and different configurations to create a model that fits your need and does not kill the hardware. It’s good to see that ML.NET supports this.

Train the model

The Machine Learning model is created and trained by feeding the pipeline with the training data using a Fit() call. In the ML.NET API that’s just a one-liner, but behind it is a very complex and CPU-intensive task:

public void Train(IDataView trainingDataView)
{
    Model = Pipeline.Fit(trainingDataView);
}

The resulting model -a TransformerChain<T>– can be immediately used for prediction, or it can be serialized for later use.

Evaluate the model

In the common Machine Learning scenarios we would use two data sets: one to train the model, and another for evaluation. Since clustering belongs to unsupervised learning, there’s no reference data set to evaluate the model.

That’s why we have used a data set that allows easy visual inspection. When looking at the diagram, it is pretty obvious to figure out where the five clusters should be. Observe that the KMeans algorithm with standard settings did a good job:

Clustering

Persist the model

The trained model can be persisted to a .zip file for later use or for use on another device with a simple call to the (oddly synchronous) SaveTo() method:

public void Save(string modelName)
{
    var storageFolder = ApplicationData.Current.LocalFolder;
    using (var fs = new FileStream(
            Path.Combine(storageFolder.Path, modelName), 
            FileMode.Create, 
            FileAccess.Write, 
            FileShare.Write))
    {
        Model.SaveTo(_mlContext, fs);
    }
}

Use the model

A trained model’s main purpose it to create predictions based on its input, so we need to define some classes to describe that input and output.

The class that describes the predicted result may contain all input fields (in our sample “AnnualIncome” and “SpendingScore”) and the algorithm-specific output. The KMeans algorithm spawns records with a “PredictedLabel” holding the most relevant cluster id and ”Score”, an array with the distances to the respective centroids.

You can use the ColumnName attribute to shape your own data types to these expectations:

    public class ClusteringPrediction
    {
        [ColumnName("PredictedLabel")]
        public uint PredictedCluster;

        [ColumnName("Score")]
        public float[] Distances;

        public float AnnualIncome;

        public float SpendingScore;
    }

Here’s the class that we use to store a single input point:

public class ClusteringData
{
    public float AnnualIncome;

    public float SpendingScore;
}

There are at least two ways to use the model for assigning clusters to input data

  • the Transform(IDataView) applies the transformation to a whole set of data, and there is also
  • the MakePredictionFunction<Tsrc, Tdst> (an extension method that is now called CreatePredictionEngine<Tsrc, Tdst>). This is a strongly typed method to create a prediction from a single input.

Here’s how both calls are used in the sample app:

public IEnumerable<ClusteringPrediction> Predict(IDataView dataView)
{
    var result = Model.Transform(dataView).AsEnumerable<ClusteringPrediction>(_mlContext, false);
    return result;
}

public ClusteringPrediction Predict(ClusteringData clusteringData)
{
    var predictionFunc = Model.MakePredictionFunction<ClusteringData, ClusteringPrediction>(_mlContext);
    return predictionFunc.Predict(clusteringData);
}

To color the graph in the sample, we asked the model to assign a cluster to each record in the training dataset:

var predictions = await ViewModel.Predict(trainingDataView);

This creates a list of predictions with the input fields and the assigned cluster. That’s what we used to populate the diagram.

The sample page also comes with two textboxes where you can fill out values for annual income and spending score. The trained model will assign the cluster to it. Here’s how that call looks like:

int.TryParse(AnnualIncomeInput.Text, out int annualIncome);
int.TryParse(SpendingScoreInput.Text, out int spendingScore);
var output = await ViewModel.Predict(
	new ClusteringData 
	{ 
		AnnualIncome = annualIncome, 
		SpendingScore = spendingScore 
	});

The result is a single prediction.

At this point in the scenario, most op the ML.NET sample app do Console.WriteLine of the predictions and then stop. But we’re in a UWP app, so we can have a lot more fun. Let’s add some sexy diagrams!

Visualizing the results

Introducing OxyPlot

OxyPlot is an open source plot generation library. It started in 2010 as a simple WPF plotting component, focusing on simplicity, performance and visual appearance. Today it targets multiple .NET platforms through a portable library, making it easy to re-use plotting code on different platforms. The library has implementations for WPF, Windows 8, Windows Phone, Windows Phone Silverlight, Windows Forms, Silverlight, GTK#, Xwt, Xamarin.iOS, Xamarin.Android, Xamarin.Forms, Xamarin.Mac … aaaand … UWP.

OxyPlot is primarily focused on two-dimensional coordinate systems, that’s the reason for the ‘xy’ in the name! Some of its restrictions are

  • lack of support for 3D plots,
  • no animations, and
  • supports data binding, but you must manually refresh the plots when changing your data.

The documentation is a work in progress that stalled somewhere back in 2015, but since then the OxyPlot team published more than enough example source code on any diagram type, and there are ‘getting started’ guides for all targeted technology stacks, including UWP and Xamarin.

OxyPlot ‘s API comes with all types of axes (linear, logarithmic, radial, date, time, category, color), series (area, bar, boxplot, candlestick, function, heat map, contour and many more) and annotations (point, rectangle, image, line, arrow, text, etc.) that you can think of. On top of that, it handles the Batman Equation correctly:

FunctionSeries

PlotView

The only UI control in the package is PlotView. It’s not really documented, but apart from the PlotModel which hosts the content it doesn’t have much properties after all – just look at the source code.

Since we will create everything in the sample app programmatically, the XAML is rather lean:

<oxy:PlotView x:Name="Diagram"
                Background="Transparent"
                BorderThickness="0" />

Axes

To represent the clusters, we will use two linear axes: a horizontal for the SpendingScore and a vertical for the AnnualIncome. Here’s how these are defined:

var foreground = OxyColors.SteelBlue;
var plotModel = new PlotModel
{
    PlotAreaBorderThickness = new OxyThickness(1, 0, 0, 1),
    PlotAreaBorderColor = foreground
};

var linearAxisX = new LinearAxis
{
    Position = AxisPosition.Bottom,
    Title = "Spending Score",
    TextColor = foreground,
    TicklineColor = foreground,
    TitleColor = foreground
};

plotModel.Axes.Add(linearAxisX);
var linearAxisY = new LinearAxis
{
    Maximum = 140,
    Title = "Annual Income",
    TextColor = foreground,
    TicklineColor = foreground,
    TitleColor = foreground
};
plotModel.Axes.Add(linearAxisY);

Please note that we could have defined this in XAML too.

Series

The dots that represent the training data are assigned to 5 ScatterSeries instances (one per cluster). Each series comes with its own color:

for (int i = 0; i < 5; i++)
{
    var series = new ScatterSeries
    {
        MarkerType = MarkerType.Circle,
        MarkerFill = _colors[i]
    };

    plotModel.Series.Add(series);
}

All we need to do now to prepare the diagram, is hook the PlotModel into the PlotView:

Diagram.Model = plotModel;

To (re)plot the diagram, we loop through the list of predictions on the training dataset, create a new ScatterPoint for each prediction, and add it to the appropriate cluster series:

foreach (var prediction in predictions)
{
    (Diagram.Model.Series[(int)prediction.PredictedCluster - 1] as ScatterSeries).Points.Add(
        new ScatterPoint
        (
            prediction.SpendingScore,
            prediction.AnnualIncome
        ));
}

When this code is executed you’ll see … nothing. You need to tell the diagram that its content was updated:

Diagram.InvalidatePlot();

Now you end up with the 5 colored clusters that we’ve seen before:

Clustering

Annotations

The extra dot for the single prediction is visualized as a PointAnnotation. We gave it the color of the assigned cluster, but a different shape and an extra label to easily spot it in the diagram:

var annotation = new PointAnnotation
    {
        Shape = MarkerType.Diamond,
        X = output.SpendingScore,
        Y = output.AnnualIncome,
        Fill = _colors[(int)output.PredictedCluster - 1],
        TextColor = OxyColors.SteelBlue,
        Text = "Here"
    };
Diagram.Model.Annotations.Add(annotation);
Diagram.InvalidatePlot();

Here’s how this looks like:

Cluster_Annotation

To finish this article, let’s promote some extra OxyPlot features.

Tracker

OxyPlot comes with a templatable tracker. By default it appears when you left-click on an item in a series, but you can also make it permanently visible. It shows a crosshair and the details of the underlying item, very convenient:

Clustering_bad_config_tracker

Zooming and Panning

By scrolling the mouse wheel you can zoom in and out, and with the right mouse button down you can pan the diagram:

Clustering_MouseActions

You want more?

This was just the first of a list of articles on ML.NET in UWP. Until now, we are very impressed!

If you want to play with the sample app, it lives here on GitHub.

Enjoy!