Category Archives: ML.NET

Machine Learning with ML.NET in UWP: Recommendation

In this article we describe how to define, train, evaluate, persist, and use a ML.NET Recommendation model in a UWP app. The blog post is part of a series on implementing different Machine Learning scenarios with .NET Open Source frameworks and components such as

All articles in the series are supported by the same UWP sample app that lives here on GitHub. Since the previous article was published, this sample app was upgraded to the latest prereleases of ML.NET thanks to Pull Requests from the Microsoft ML.NET Team itself (thanks Eric!). This means that the syntax in the code snippets is quite different from the previous articles, but much closer to the imminent official release.

Here’s how the Recommendation page in the sample app looks like:

Recommendation

It builds a model to generate recommendations for hotels on the Las Vegas Strip for a selected traveler type (single, family, business, …):

  • when you select a traveler type in the combo box, the top 10 recommended hotels appear in the diagram, and
  • when you select a hotel in the second combo box, a predicted rating will appear next to it.

Recommendation in Machine Learning

Machine Learning recommender systems are highly popular in e-commerce and social networks. They’re used for recommending books, TV series, music, events, products, friends, dating profiles, and a lot more.

There are two approaches for generating recommendations:

  • Content Based Filtering recommends items to a user that are similar to previously highly rated items by the same user. The advantage of this is transparency (the model can explain why it recommends the item). Unfortunately this approach does not scale well with large data.
  • Collaborative Based Filtering will recommend the items to a user that were highly rated by other -but similar- users. In most real world scenarios not every user has rated every item, so the base data can be very sparse. This makes the approach unsuitable in some scenarios.

Matrix Factorization in Machine Learning

Matrix Factorization is a common technique to solve the sparsity problem with Collaborative Base Filtering that we just mentioned. In a nutshell its goal is to mass-predict the missing ratings. Matrix Factorization is entirely based on linear algebra which is something that your CPUs, GPUs, and/or AI Accelerators are pretty good at. If you want to dive into the mathematical details, allow me to recommend (pun intended) the article with the very appropriate name A Gentle Introduction to Matrix Factorization for Machine Learning.

Major advantages of this algorithm are that it scales very well with large data and it is very fast. You don’t have take my word for it, but there must me a reason why Amazon and Netflix are relying on it. The algorithm has the disadvantage that it cannot always easy explain why it recommends an item. You must have stumbled upon recommendations like this before:

netflix

Before we dive into the code, allow us to clarify something: Matrix Factorization does NOT answer the question “What items would you recommend for this user?”. Instead it solves the “Here’s a list of products and a (list of) user(s), please predict their ratings” problem. So when you use it in your apps, there is some preprocessing (selecting the products to evaluate) and some postprocessing (filtering relevant recommendation) to do. Basically the algorithm always has to deal with too much data. Don’t worry about that: Matrix Factorization is a real Weapon of Mass Prediction

Matrix Factorization in ML.NET

For Matrix Factorization in ML.NET you’ll need the MatrixFactorizationTrainer class. It comes in a separate NuGet package (Microsoft.ML.Recommender):

RecommenderNuGet

Model Input and Output

For training and testing the model, we’ll use a 2015 dataset with 510 Las Vegal hotel ratings from TripAdvisor. Here’s how it looks like:

RecommendationDataSet

Matrix Factorization predicts the rating (“Label”) between only two fields (“Features”) . If you have to deal with more fields, then you’ll need the FieldAwareFactorizationMachine instead.

In the sample app we choose TravelerType and Hotel as features to respectively play the roles of ‘similar user’ and ‘recommended item’. The Score column contains the rating and will play the role of ‘label’ (the thing to predict). Since the prediction engine’s output column is also called Score, we renamed it to Label for the input.

Here’s the structure of input samples that we will feed the model with:

public class RecommendationData
{
    public float Label;

    public string TravelerType;

    public string Hotel;
}

The prediction looks like this:

public class RecommendationPrediction
{
    public float Score;

    public string TravelerType;

    public string Hotel;
}

Observe the lack of LoadColumn and ColumnName attributes on top of the fields – we had these in all the previous posts in this article series. We don’t need the attributes here because we’re not using a TextLoader to read the training and testing data sets. Instead we’ll create our IDataView with a call to the LoadFromEnumerable() method. This same method allows you to populate the model with records from a database:

private IDataView trainingData;

public IEnumerable<RecommendationData> Load(string trainingDataPath)
{
    var data = File.ReadAllLines(trainingDataPath)
        .Skip(1)
        .Select(x => x.Split(';'))
        .Select(x => new RecommendationData
        {
            Label = uint.Parse(x[4]),
            TravelerType = x[6],
            Hotel = x[13]
        })
        .OrderBy(x => (x.GetHashCode())) // Cheap Randomization.
        .Take(400);

    // Populating an IDataView from an IEnumerable.
    trainingData = _mlContext.Data.LoadFromEnumerable(data);

    // Keep DataView in memory.
    trainingData = _mlContext.Data.Cache(trainingData);

    // Populating an IEnumerable from an IDataView.
    return _mlContext.Data.CreateEnumerable<RecommendationData>(trainingData, reuseRowObject: false);
}

Part of the data set will be used for training, and another for evaluating the model. Since the original data set is sorted on Hotel name, we applied cheap randomization logic to the rows by sorting them on their GetHashCode() value.

The Cache() method keeps the selected columns (in our case: all columns) in memory after they’re accessed for the first time. For iterative algorithms this really is a time saver – at least if the data fits into memory.

Defining and Building the Model

The recommendation model is an ITransformer that is created from an EstimatorChain with a MatrixFactorization at its heart. You have to specify the label (labelColumn) and the two features (matrixRowIndexColumnName and matrixColumnIndexColumnName) and some options to fine tune the algorithm. Before sending the feature values to the transformer, they’re added to a dictionary with MapValueToKey(). The reverse function of that is MapKeyToValue(). It ensures that the original values are returned with the predicted score.

Here’s the whole pipeline:

private ITransformer _model;

public void Build()
{
    var pipeline = _mlContext.Transforms.Conversion.MapValueToKey("Hotel")
                    .Append(_mlContext.Transforms.Conversion.MapValueToKey("TravelerType"))
                    .Append(_mlContext.Recommendation().Trainers.MatrixFactorization(
                                        labelColumn: DefaultColumnNames.Label,
                                        matrixColumnIndexColumnName: "Hotel",
                                        matrixRowIndexColumnName: "TravelerType",
                                        // Optional fine tuning:
                                        numberOfIterations: 20,
                                        approximationRank: 8,
                                        learningRate: 0.4))
                    .Append(_mlContext.Transforms.Conversion.MapKeyToValue("Hotel"))
                    .Append(_mlContext.Transforms.Conversion.MapKeyToValue("TravelerType"));

    // Place a breakpoint here to peek the training data.
    var preview = pipeline.Preview(trainingData, maxRows: 10);

    _model = pipeline.Fit(trainingData);
}

The extremely useful Preview() method was recently added to the API. It allows you to inspect the content and schema of the pipeline while debugging – it feels a bit like the old SSIS Data Viewer:

PreviewSchema

PreviewRowContent

The prediction model is trained with a Fit() call.

Evaluating the Model

It’s always a good idea to evaluate your freshly trained model. Typically this is done by sending it a set of previously unknown –but labeled- data set rows. The Transform() call generates the predictions, while Evaluate() compares these with the original labels:

public RegressionMetrics Evaluate(string testDataPath)
{
    //var testData = _mlContext.Data.LoadFromTextFile<RecommendationData>(testDataPath);
    var data = File.ReadAllLines(testDataPath)
        .Skip(1)
        .Select(x => x.Split(';'))
        .Select(x => new RecommendationData
        {
            Label = uint.Parse(x[4]),
            TravelerType = x[6],
            Hotel = x[13]
        })
        .OrderBy(x => (x.GetHashCode())) // Cheap Randomization.
        .TakeLast(200);

    var testData = _mlContext.Data.LoadFromEnumerable(data);
    var scoredData = _model.Transform(testData);
    var metrics = _mlContext.Recommendation().Evaluate(scoredData);

    // Place a breakpoint here to inspect the quality metrics.
    return metrics;
}

The evaluation returns a RegressionMetrics instance with useful information on the quality of the model – such as the coefficient of determination, and the relative squared error:

RegressionMetrics

If you notice that your model lacks accuracy, then you need to fine tune its parameters and/or provide more representative training data and/or select another algorithm.

Persisting the Model

The model can be serialized and persisted with a call to Save():

public void Save(string modelName)
{
    var storageFolder = ApplicationData.Current.LocalFolder;
    string modelPath = Path.Combine(storageFolder.Path, modelName);

    _mlContext.Model.Save(_model, inputSchema: null, filePath: modelPath);
}

Inferencing with the model

There are two ways for creating recommendation scores. The first one generates a prediction for a single feature combination: a score for one specific traveler type/hotel combination. The API for this scenario cannot be more straightforward: you create a prediction engine with CreatePredictionEngine() and then you call Predict() to … predict:

public RecommendationPrediction Predict(RecommendationData recommendationData)
{
    // Single prediction
    var predictionEngine = _model.CreatePredictionEngine<RecommendationData, RecommendationPrediction>(_mlContext);
    return predictionEngine.Predict(recommendationData);
}

This code is triggered when you select a hotel from the lower left combo box  on the page:

Recommendation

The second way to generate recommendations takes a list of feature pairs instead of a single one. When you select an entry in the traveler type combo box, we first create a list of RecommendationData records – one for each hotel in the original data set. Then we call the Predict() method in the ViewModel – the sample app uses a lightweight MVVM architecture:

// Group Prediction
var recommendations = new List<RecommendationData>();
foreach (var hotel in ViewModel.Hotels)
{
    recommendations.Add(new RecommendationData
    {
        Hotel = hotel,
        TravelerType = TravelerTypesCombo.SelectedValue.ToString()
    });
}
var predictions = await ViewModel.Predict(recommendations);

This list is changed into an IDataView with same the LoadFromEnumerable() call that we encountered when loading the training data. The recommendation model transforms it into a IDataView that adheres to the output schema through the Transform() method. Finally, with the CreateEnumerable() method this structure is translated to a list of prediction entities:

public IEnumerable<RecommendationPrediction> Predict(IEnumerable<RecommendationData> recommendationData)
{
    // Group prediction
    var data = _mlContext.Data.LoadFromEnumerable(recommendationData);
    var predictions = _model.Transform(data);
    return _mlContext.Data.CreateEnumerable<RecommendationPrediction>(predictions, reuseRowObject: false);
}

There are 21 hotels in the data set, so this method returns 21 ratings. The end user is of course not interested in all of these. With a little LINQ query you can get the 10 most appropriate recommendations:

var recommendationsResult = predictions
        .Select(p => p)
        .OrderByDescending(p => p.Score)
        .ToList()
        .Take(10)
        .Reverse();

[Note: The reverse() is only there because we build up the bar chart from bottom to top.]

A word of warning

The current NuGet package for Microsoft.ML carries the v1.0.0-preview tag, so we may be close to an official release. This is not the case for the Microsoft.ML.Recommender. This one seems to need some extra stabilization sprints. In its current version, Matrix Factorization yields different types of exceptions when you’re running in x86 mode. With a little luck you only get weird results like these:

Recommendation_x86

Don’t worry, it’s a known issue, the team is working on it…

Let there be XAML

Let’s jump to the visualization of the predictions. For the horizontal bar chart on the sample page, we borrowed the diagram from the MultiClass Classification sample. XAML-wise we declared a PlotView with its PlotModel. The model has a CategoryAxis for the hotel names and a LinearAxis for the predicted score (0-5). The values are represented in a BarSeries:

<oxy:PlotView x:Name="Diagram"
                Background="Transparent"
                BorderThickness="0"
                Margin="0 0 40 60"
                Grid.Column="1">
    <oxy:PlotView.Model>
        <oxyplot:PlotModel Subtitle="Recommended Hotels"
                            PlotAreaBorderColor="{x:Bind OxyForeground}"
                            TextColor="{x:Bind OxyForeground}"
                            TitleColor="{x:Bind OxyForeground}"
                            SubtitleColor="{x:Bind OxyForeground}">
            <oxyplot:PlotModel.Axes>
                <axes:CategoryAxis Position="Left"
                                    TextColor="{x:Bind OxyForeground}"
                                    TicklineColor="{x:Bind OxyForeground}"
                                    TitleColor="{x:Bind OxyForeground}" />
                <axes:LinearAxis Position="Bottom"
                                    Title="Predicted Score (higher is better)"
                                    TextColor="{x:Bind OxyForeground}"
                                    TicklineColor="{x:Bind OxyForeground}"
                                    TitleColor="{x:Bind OxyForeground}" />
            </oxyplot:PlotModel.Axes>
            <oxyplot:PlotModel.Series>
                <series:BarSeries LabelPlacement="Inside"
                                    LabelFormatString="{}{0:0.00}"
                                    TextColor="{x:Bind OxyText}"
                                    FillColor="{x:Bind OxyFill}" />
            </oxyplot:PlotModel.Series>
        </oxyplot:PlotModel>
    </oxy:PlotView.Model>
</oxy:PlotView>

When the predictions come in, we add category (with the name) and a BarItem (with the score) for each of the maximum 10 hotels. These are added to their respective series, and the plot is refreshed:

// Update diagram
var categories = new List<string>();
var bars = new List<BarItem>();
foreach (var prediction in recommendationsResult)
{
    categories.Add(prediction.Hotel);
    bars.Add(new BarItem { Value = prediction.Score });
}

var plotModel = Diagram.Model;

(plotModel.Axes[0] as CategoryAxis).ItemsSource = categories;
(plotModel.Series[0] as BarSeries).ItemsSource = bars;
plotModel.InvalidatePlot(true);

That’s it for today. The UWP sample app –which is featured on the ML.NET Community Samples page- lives here on GitHub.

Enjoy!

Advertisements

Machine Learning with ML.NET in UWP: Feature Correlation Analysis

In this article we show how to perform Feature Correlation Analysis and display the results in a Heat Map in the context of Machine Learning in UWP. It’s the fourth in a series that started here, on implementing Machine Learning scenarios in UWP using Open Source frameworks and components such as

All articles in the series revolve around a single UWP sample app that lives here on GitHub. Here’s how the Feature Correlation Analysis page looks like:

HeatMap

It displays the correlation between different properties in the popular Titanic passengers dataset: age, fare, ticket class, whether the passenger was accompanied with siblings, spouses, parents or children, and whether he or she survived the trip.

The darker red or blue squares on the heat map indicate that the corresponding properties on X and Y axis have a higher correlation with each other. Higher correlation is a warning sign for possible negative impact on the classification model when both features would be added to the training data.

Feature Correlation Analysis

Feature Correlation Analysis in Machine Learning

The topic of this article is Feature Correlation Analysis. Just like in the previous article -Feature Distribution Analysis- we are in the “data preparation” phase of a Machine Learning scenario. We’re not training or even defining models yet, we’re selecting the features to train them with. An ideal feature set contains features that are highly correlated with the classification (in ML.NET terminology: the Label Column), yet uncorrelated to each other.

Identifying the right feature set highly impacts the quality and performance of the subsequent learning and generalization steps. Here are two important reasons not to keep on adding feature columns to a training data set:

  • while the predictive power of a classifier first increases with the number of dimensions/features used, there comes a break point where it decreases again (the so-called Curse of Dimensionality). The training data set is always a finite set of samples with discrete values, while the prediction space may be infinite and continuous in all dimensions. So the more features a data set with fixed a number of samples has, the less representative it may become. Secondly,
  • there’s also the cost incurred by adding features. Two features that are highly correlated with each other don’t add much value to a classifier, but they sure add cost at training, persisting and/or inference time. Machine Learning activities can be pretty resource intensive on CPU, memory, and elapsed time, so it sure makes sense to limit the number or features.

The process of converting a set of observations of possibly correlated variables into a smaller set of values of linearly uncorrelated variables is called Principal Component Analysis. This technique was invented by Karl Pearson, the same person that defined its main instrument: the Pearson correlation coefficient – a measure of the linear correlation between two variables X and Y. Pearson’s correlation coefficient is the covariance of the two variables divided by the product of their standard deviations. It has a value between +1 and -1, where 1 is total positive linear correlation, 0 is no linear correlation, and -1 is total negative linear correlation.

During Principal Component Analysis a matrix is calculated with the correlation between each pair of features. Highly correlated feature pairs are then ‘sanitized’ by removing one or combining both (e.g. by creating a new feature by multiplying the variables’ values). At the same time you also calculate the correlation between each attribute and the output variable (the Label), to select only those attributes that have a moderate-to-high positive or negative correlation (close to -1 or 1) and drop those attributes with a low correlation (value close to zero).

Feature Correlation Analysis with ML.NET and Math.NET

Data Preparation is outside the core business of ML.NET itself, but for retrieving and manipulating the candidate training data we can count one on its most important spin-off components: the DataView API.

DataViewCentric

We can fetch the samples, optionally filter and add missing data, and then pivot it into arrays of feature values (exactly what we did in the previous article). Al we need is TextLoader to create an IDataView from the data set, get the column names from its Schema, call GetColumn() to get the array, and ‘upgrade’ the data type to double:

var reader = new TextLoader(_mlContext,
                            new TextLoader.Arguments()
                            {
                                Separator = ",",
                                HasHeader = true,
                                Column = new[]
                                    {
                                    new TextLoader.Column("Survived", DataKind.R4, 1),
                                    new TextLoader.Column("PClass", DataKind.R4, 2),
                                    new TextLoader.Column("Age", DataKind.R4, 5),
                                    new TextLoader.Column("SibSp", DataKind.R4, 6),
                                    new TextLoader.Column("Parch", DataKind.R4, 7),
                                    new TextLoader.Column("Fare", DataKind.R4, 9)
                                    }
                            });
var dataView = reader.Read(src);
var result = new List<List<double>>();
for (int i = 0; i < dataView.Schema.ColumnCount; i++)
{
    var columnName = dataView.Schema.GetColumnName(i);
    result.Add(dataView.GetColumn<float>(_mlContext, columnName).Select(f => (double)f).ToList());
}

return result;

Now that we have all the feature values in arrays, it’s time to calculate the correlations. The MathNet.Numerics.Statistics.Correlation class from Math.NET hosts implementations for several Pearson, Spearman, and other correlation calculations.

We decided to make a copy of the code that calculates the Pearson correlation between two IEnumerable<double> instances:

/// <summary>
/// Computes the Pearson Product-Moment Correlation coefficient.
/// </summary>
/// <param name="dataA">Sample data A.</param>
/// <param name="dataB">Sample data B.</param>
/// <returns>The Pearson product-moment correlation coefficient.</returns>
/// <remarks>Original Source: https://github.com/mathnet/mathnet-numerics/blob/master/src/Numerics/Statistics/Correlation.cs </remarks>
public static double Pearson(IEnumerable<double> dataA, IEnumerable<double> dataB)
{
    var n = 0;
    var r = 0.0;

    var meanA = 0d;
    var meanB = 0d;
    var varA = 0d;
    var varB = 0d;

    using (IEnumerator<double> ieA = dataA.GetEnumerator())
    using (IEnumerator<double> ieB = dataB.GetEnumerator())
    {
        while (ieA.MoveNext())
        {
            if (!ieB.MoveNext())
            {
                throw new ArgumentOutOfRangeException(nameof(dataB), "Array too short.");
            }

            var currentA = ieA.Current;
            var currentB = ieB.Current;

            var deltaA = currentA - meanA;
            var scaleDeltaA = deltaA / ++n;

            var deltaB = currentB - meanB;
            var scaleDeltaB = deltaB / n;

            meanA += scaleDeltaA;
            meanB += scaleDeltaB;

            varA += scaleDeltaA * deltaA * (n - 1);
            varB += scaleDeltaB * deltaB * (n - 1);
            r += (deltaA * deltaB * (n - 1)) / n;
        }

        if (ieB.MoveNext())
        {
            throw new ArgumentOutOfRangeException(nameof(dataA), "Array too short.");
        }
    }

    return r / Math.Sqrt(varA * varB);
}

For the sake of completeness: that same Math.NET class also hosts code to calculate the whole matrix. For the sample app this would require importing a lot more code (linear algebra classes such as Matrix) or adding the whole NuGet package to the project.

Here’s how the sample app calculates the whole correlation matrix:

// Read data
var matrix = await ViewModel.LoadCorrelationData();

// Populate diagram
var data = new double[6, 6];
for (int x = 0; x < 6; ++x)
{
    for (int y = 0; y < 5 - x; ++y)
    {
        var seriesA = matrix[x];
        var seriesB = matrix[5 - y];

        var value = Statistics.Pearson(seriesA, seriesB);

        data[x, y] = value;
        data[5 - y, 5 - x] = value;
    }

    data[x, 5 - x] = 1;
}

All we now need is a way to properly visualize it.

Correlation Heat Maps

A heat map is a representation of data in which the values are represented by colors. They are ideal to highlight patterns and extreme values in rectangular data such as matrixes.

Correlation Heat Maps in Machine Learning

In Machine Learning, heat maps are used to display correlations between feature values. The typical (“Pearson”) color scheme is a gradient that goes from

  • red for high positive correlation (value +1), over
  • white for no correlation (value 0), to
  • blue for high negative correlation (value –1).

Sometimes the values are normalized (brought to a range from 0 to 1) like in the next image. When there are a lot of features, the value labels are omitted in the the diagram. Since correlation is commutative (the correlation between A and B is the same as the correlation between B and A) it suffices to only display half of the matrix, like this:

26821073-15081119795115542

This diagram also omitted the correlations on the diagonal: the red squares that indicate the full positive correlation between each feature and itself.

Here’s an example of a diagram (from here) showing less features. It’s common to display the whole matrix and the value labels:

heatmap_2.994292b9

The above diagram was created with Python (Pandas and Seaborn) and shows the correlation between all the numerical values in the already mentioned Titanic Passengers dataset.

Here’s the UWP sample app version of the very same diagram, calculated with ML.NET and Math.NET and visualized with OxyPlot:

HeatMapZoom

The small differences in correlations for the Age feature are caused by the sample app not compensating missing values. We have the matrix, let’s plot this diagram.

Correlation Heat Maps with OxyPlot

To draw an OxyPlot diagram, you start with placing a PlotView element in your XAML:

<oxy:PlotView x:Name="Diagram"
                Background="Transparent"
                BorderThickness="0" />

Then you can declaratively or programmatically decorate it with a PlotModel and different Axis instances. A correlation heat map uses a CategoryAxis in both dimensions:

plotModel.Axes.Add(new CategoryAxis
{
    Position = AxisPosition.Bottom,
    Key = "HorizontalAxis",
    ItemsSource = new[]
    {
        "Survived",
        "Class",
        "Age",
        "Sib / Sp",
        "Par / Chi",
        "Fare"
    },
    TextColor = foreground,
    TicklineColor = foreground,
    TitleColor = foreground
});

plotModel.Axes.Add(new CategoryAxis
{
    Position = AxisPosition.Left,
    Key = "VerticalAxis",
    ItemsSource = new[]
    {
        "Fare",
        "Parents / Children",
        "Siblings / Spouses",
        "Age",
        "Class",
        "Survived"
    },
    TextColor = foreground,
    TicklineColor = foreground,
    TitleColor = foreground
});

The legend on top of the diagram is an extra LinearColorAxis in the appropriate OxyPalette:

plotModel.Axes.Add(new LinearColorAxis
{
    // Pearson color scheme from blue over white to red.
    Palette = OxyPalettes.BlueWhiteRed31,
    Position = AxisPosition.Top,
    Minimum = -1,
    Maximum = 1,
    TicklineColor = OxyColors.Transparent
});

If you’re not entirely satisfied with the color scheme, feel free to create your own custom OxyPalette instance: it’s just a 3-color gradient.

The matrix itself is a HeatMapSeries with 6 values in each dimension, rendered as rectangles:

var heatMapSeries = new HeatMapSeries
{
    X0 = 0,
    X1 = 5,
    Y0 = 0,
    Y1 = 5,
    XAxisKey = "HorizontalAxis",
    YAxisKey = "VerticalAxis",
    RenderMethod = HeatMapRenderMethod.Rectangles,
    LabelFontSize = 0.12,
    LabelFormatString = ".00"
};

plotModel.Series.Add(heatMapSeries);

Diagram.Model = plotModel;

To display the label in the square you need to set the LabelFormatString. This will only be applied if you also set a value to LabelFontSize.

OxyPlot does not support the triangular version of the heat map. Missing values always get the default value and you can’t make them transparent:

HalfMap

To populate the diagram, assign the Data to the series, and refresh the plot:

(plotModel.Series[0] as HeatMapSeries).Data = data;

// Update diagram
Diagram.InvalidatePlot();

Again, here’s the resulting heat map in the sample app:

HeatMap

Interpretation

The dark blue square on the diagram reveals a relatively high negative correlation between the Passenger Class and Ticket Fare. This means that the value for the one can easily be derived from the other – think “first class tickets cost more than second class tickets”. Adding both as a feature would not add more value than adding only one of them.

Data scientists would probably extract a new feature from these two (something like “Luxury”) or would break up “Passenger Class” and “Ticket Fare” in more basic components, like locations on the ship that passengers had access to. Anyway, the heat map clearly highlights the feature combinations that need further analysis.

Source

In this article we used components from ML.NET, Math.NET and OxyPlot to calculate and visualize the correlation heat map on candidate training data for a classification model.

The UWP sample app host more Machine Learning scenarios. It lives here on GitHub.

Enjoy!

Machine Learning with ML.NET in UWP: Feature Distribution Analysis

Welcome to the third article in this series on implementing Machine Learning scenarios with Open Source technologies in UWP apps. We were using ML.NET for modeling and OxyPlot for data visualization, and we will continue to do so. For this article we added Math.NET to our shopping basket: to calculate some statistics that are important when analyzing the input data. We will analyze the distribution of values for columns in the candidate model training data to detect whether these columns would be a useful as feature, or whether they need filtering, or whether they should be ignored at all.

Feature Analysis in Machine Learning

When it comes to the ‘Garbage in – Garbage out’ principle, Machine Learning is not different from any other process. Before training a model data scientists will perform Feature Analysis to identify the features that are most useful in solving the problem. This includes steps such as:

  • analyzing the distribution of values,
  • looking for missing data and deciding whether to ignore, replace, or reject it,
  • analyzing correlation between columns, and
  • combining columns into new candidate features (so called Feature Engineering)

To illustrate why Feature Analysis is important, just take a look at the following diagram that shows the predicted NBA player salary (red) versus the real salary (blue) in the Regression page of our sample app:

Regression

It’s pretty clear that the trained algorithm is not very useful: it does not look significantly better than a random prediction. Next to the vertical axis you even see the model predicting negative salaries. This bad performance is partly because we did not do any analysis before just feeding the model with raw data.

Feature Analysis in ML.NET

Although the data preparation step is not the core business of ML.NET, we still have good news. The announcement of ML.NET 0.10.0 revealed IDataView to become a shared type across libraries in the .NET ecosystem. If you dive into the IDataView Design Principles you’ll observe that the DataView can be a core component in analyzing raw data sets, since it allows

  • cursoring,
  • large data,
  • caching,
  • peeking,
  • and so on.

You’ll observe that we will use IDataView properties and (extension) methods in this article to stay as close as possible to the following schema:

DataViewCentric

Introducing the Box Plot

The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data. It is based on the five number summary:

  • minimum,
  • first quartile,
  • median,
  • third quartile, and
  • maximum.

In the simplest box plot the central rectangle spans the first quartile to the third quartile: the interquartile range or IQR – the likely range of variation. A segment inside the rectangle shows the median (the typical value) and “whiskers” above and below the box show the locations of the minimum and maximum. The so-called Tukey boxplot uses the lowest value still within 1.5 IQR of the lower quartile, and the highest value still within 1.5 IQR of the upper quartile respectively as minimum and maximum. It represents the values outside that boundary (the extreme values) as ‘outlier’ dots:

BoxPlotParts

The box plot can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

Box plots may reveal whether or not you should use all of the values for training the model, and which algorithm you should prefer (some algorithms assume a symmetrical distribution). Here’s an article with more details on how to interpret box plots.

Box plot in OxyPlot

Most of the data visualization frameworks for UWP support box plots, and OxyPlot is not an exception. All you need to do is insert a PlotView control in your XAML:

<oxy:PlotView 
	x:Name="RegressionDiagram"
	Background="Transparent"
	BorderThickness="0" />

The control’s PlotModel is populated with a BoxPlotSeries that is displayed against a CategoryAxis for the property names and a LinearAxis for the values. Check the previous articles in this blog series on how to do define a model and its axes in XAML and C#.

In our sample app we wanted to highlight the distributions having outliers. We added two series to the model – each with a different color:

var cleanSeries = new BoxPlotSeries
{
    Stroke = foreground,
    Fill = OxyColors.DarkOrange
};
plotModel.Series.Add(cleanSeries);

var outlinerSeries = new BoxPlotSeries
{
    Stroke = foreground,
    Fill = OxyColors.Firebrick,
    OutlierSize = 4
};
plotModel.Series.Add(outlinerSeries);

The next step is to provide BoxPlotItem instances for the series, but first we need to calculate these.

Enter Math.NET.

Boxplot in Math.NET

Math.NET is an Open Source C# library covering fundamental mathematics. Its code is distributed over several NuGet packages covering domains such as numerical computing, algebra, signal processing, and geometry. Math.NET Numerics is the package that contains the functions from descriptive statistics that we’re interested in. It targets .NET 4.0 and higher, including Mono and .NET Standard 1.3 and higher. The sample app does not use the NuGet package however. Because the source code is effective and very well structured -what else did you expect from mathematicians- it was easy to identify and grab the source code for the calculations that we wanted, so that’s what we did.

The SortedArrayStatistics class contains all the functions we need (Median, Quartiles, Quantiles) and more.

Boxplot in the Sample App

To draw a box plot, we first need to get the data. ML.NET uses a TextLoader for this:

var trainingDataPath = await MlDotNet.FilePath(@"ms-appx:///Data/Mall_Customers.csv");
var reader = new TextLoader(_mlContext,
                            new TextLoader.Arguments()
                            {
                                Separator = ",",
                                HasHeader = true,
                                Column = new[]
                                    {
                                    new TextLoader.Column("Age", DataKind.R4, 2),
                                    new TextLoader.Column("AnnualIncome", DataKind.R4, 3),
                                    new TextLoader.Column("SpendingScore", DataKind.R4, 4),
                                    }
                            });

var file = _mlContext.OpenInputFile(trainingDataPath);
var src = new FileHandleSource(file);
var dataView = reader.Read(src);

The result of the Read() method is an IDataView tabular structure. We can query its Schema to find out column names, and with the GetColumn() extension method we can fetch all values for the specified column.

This is how the sample app basically pivots the data view from a list of rows to a list of columns:

var result = new List<List<double>>();
for (int i = 0; i < dataView.Schema.ColumnCount; i++)
{
    var columnName = dataView.Schema.GetColumnName(i);
    result.Add(dataView
	.GetColumn<float>(_mlContext, columnName)
	.Select(f => (double)f)
	.ToList());
}

return result;

Notice that we switched from float (the low-memory favorite type in ML.NET) to double (the high-precision favorite type in Math.NET).

The array of column values is used to build a BoxPlotItem to be added to the PlotModel:

// Read data
var regressionData = await ViewModel.LoadRegressionData();

// Populate diagram
for (int i = 0; i < regressionData.Count; i++)
{
    AddItem(plotModel, regressionData[i], i);
}

Here’s the code to calculate all the box plot constituents. Remember to sort the array first, since we rely on Math.ML’s Sorted Array Statistics here:

values.Sort();
var sorted = values.ToArray();

// Box: Q1, Q2, Q3
var median = sorted.Median();
var firstQuartile = sorted.LowerQuartile();
var thirdQuartile = sorted.UpperQuartile();

// Whiskers
var interQuartileRange = thirdQuartile - firstQuartile;
var step = interQuartileRange * 1.5;
var upperWhisker = thirdQuartile + step;
upperWhisker = sorted.Where(v => v <= upperWhisker).Max();
var lowerWhisker = firstQuartile - step;
lowerWhisker = sorted.Where(v => v >= lowerWhisker).Min();

// Outliers
var outliers = sorted.Where(v => v < lowerWhisker || v > upperWhisker).ToList();

Here’s the creation of the OxyPlot box plot item itself. The first parameter refers to the category index:

var item = new BoxPlotItem(
    x: slot,
    lowerWhisker: lowerWhisker,
    boxBottom: firstQuartile,
    median: median,
    boxTop: thirdQuartile,
    upperWhisker: upperWhisker)
{
    Outliers = outliers
};

In the following code snippet we assign the new item to one of the two series (with and without outliers) to obtain the wanted color scheme:

if (outliers.Any())
{
    (plotModel.Series[1] as BoxPlotSeries).Items.Add(item);
}
else
{
    (plotModel.Series[0] as BoxPlotSeries).Items.Add(item);
}

This is how the final result looks like. The diagram on the left shows the raw data for the Regression sample in the app. Notice that all properties are distributed asymmetrically and come with lots of outliers. That’s not a solid base for creating a prediction model:

BoxPlot

The diagram on the right shows the data for the Clustering sample. This was our very first ML.NET project so we decided to use a proven and cleaned data set, and this shows in the box plot. For the sake of completeness, here’s that Clustering sample again. Its prediction model works very well:

Clustering

One more thing about the box plot. When you right click a shape on the diagram, OxyPlot shows its tracker with the details:

FeatureDistributionAnalysis_Tracker

When outliers are identified in the analysis, you may decide to skip these when training the model, using FilterByColumn(). Check this sample code for more details.

Source

In this article we demonstrated how to build box plot diagrams in UWP, using ML.NET, Math.NET and OxyPlot. Even when ML.NET is not targeting Feature Analysis, its IDataView API is very helpful in getting the column data.

The sample app lives here on GitHub.

Enjoy!

Machine Learning with ML.NET in UWP: Multiclass Classification

This is the second in a series of articles on implementing Machine Learning scenarios with ML.NET and OxyPlot in UWP apps. If you’re looking for an introduction to these technologies, please check part one of this series. In this article we will build, train, evaluate, and consume a multiclass classification model to detect the language of a piece of text.

All blog posts in this series are based on a single sample app that lives here on GitHub.

Classification

Classification in Machine Learning

Classification is a  technique from supervised learning to categorize data into a desired number of labeled classes. In binary classification the prediction yields one of two possible outcomes (basically solving ‘true or false’ problems). This article however focuses on multiclass classification, where the prediction model has two or more possible outcomes.

Here are some real-world classification scenario’s:

  • road sign detection in self-driving cars,
  • spoken language understanding,
  • market segmentation (predict if a customer will respond to marketing campaign), and
  • classification of proteins according to their function.

There’s a wide range of multiclass classification algorithms available. Here are the most used ones:

  • k-Nearest Neighbors learns by example. The model is a so-called lazy one: it just stores the training data, all computation is deferred. At prediction time it looks up the k closest training samples. It’s very effective on small training sets, like in the face recognition on your mobile phone.
  • Naive Bayes is a family of algorithms that use principles from the field of probability theory and statistics. It is popular in text categorization and medical diagnosis.
  • Regression involves fitting a curve to numeric data. When used for classification, the resulting numerical value must be transformed back into a label. Regression algorithms have been used to identify future risks for patients, and to predict voting intent.
  • Classification Trees and Forests use flowchart-like structures to make decisions. This family of algorithms is particularly useful when transparency is needed, e.g. in loan approval or fraud detection.
  • A set of Binary Classification algorithms can be made to work together to form a multiclass classifier using a technique called ‘One-versus-All’ (OVA).

If you want to know more about classification then check this straightforward article. It is richly illustrated with Chris Albon’s awesome flash cards like this one:

1_OqOzEP0pLmqvEBirKAjPXQ

Classification in ML.NET

ML.NET covers all the major algorithm families and more with the following multiclass classification learners:

The API allows to implement the following flow:

MachineLearningSteps

When we dive into the code, you’ll recognize that same pattern.

Building a Language Recognizer

The case

In this article we’ll build and use a model to detect the language of a piece of text from a set of languages. The model will be trained to recognize English, German, French, Italian, Spanish, and Romanian. The training and evaluation datasets (and a lot of the code) are borrowed from this project by Dirk Bahle.

Safety Instructions

In the previous article, we explained why we’re using v0.6 of the ML.NET API instead of the current one (v0.9). There is some work to be done by different Microsoft teams to adjust the UWP/.NET Core/ML.NET components to one another. 

The sample app works pretty well, as long as you comply with the following safety instructions:

  • don’t upgrade the ML.NET NuGet package,
  • don’t run the app in Release mode, and
  • always bend your knees not your back when lifting heavy stuff.

In the last couple of iterations the ML.NET team has been upgrading its API from the original Microsoft internal .NET 1.0 code to one that is on par with other Machine Learning frameworks. The difference is huge! A lot of the v0.6 classes that you encounter in this sample are now living in the Legacy namespace or were even removed from the package.

As far as possible we’ll try to point the hyperlinks in this article to the corresponding members in the newer API. The documentation on older versions is continuously cleaned up and we don’t want you to end up on this page:

DeletedDocs

If you want to know multiclass classification looks like in the newest API, then check this official sample.

Alternative Strategy

We can imagine that some of you don’t want to wait for all pieces of the technical puzzle to come together, or are reluctant to use ML.NET in UWP. Allow us to promote an alternative approach. WinML is an inference engine to use trained local ONNX machine learning models in your Windows apps. Not all end user (UWP) apps are interested in model training – they only want the use a model for running predictions. You can build, train, and evaluate a Machine Learning model in a C# console app with ML.NET, then save it as ONNX with this converter, then load and consume it in a UWP app with WinML:

onnx-diagram-v03

The ML.NET console app can be packaged, deployed and executed as part of your UWP app by including it as a full trust desktop extension. In this configuration the whole solution can even be shipped to the store.

The Code

A Lottie-driven busy indicator

Depending on the algorithm family, training and using a machine learning model can be CPU intensive and time consuming. To entertain the end user during these processes and to verify that these does not block the UI, we added an extra element to the page. An UWP Lottie animation will play the role of a busy indicator:

<lottie:LottieAnimationView 
	x:Name="BusyIndicator"
	FileName="Assets/loading.json"
	Visibility="Collapsed" />

When the load-build-train-test-save-consume scenario starts, the image will become visible and we start the animation:

BusyIndicator.Visibility = Windows.UI.Xaml.Visibility.Visible;
BusyIndicator.PlayAnimation();

Here’s how this looks like:

Lottie

When the action stops, we hide the control and pause the animation:

BusyIndicator.Visibility = Windows.UI.Xaml.Visibility.Collapsed;
BusyIndicator.PauseAnimation();

As explained in the previous article, we moved all machine model processing of the main UI thread by making it awaitable:

public Task Train()
{
    return Task.Run(() =>
    {
        _model.Train();
    });
}

Load data

The training dataset is a TAB separated value file with the labeled input data: an integer corresponding to the language, and some text:

RawData

The input data is modeled through a small class. We use the Column attribute to indicate column sequence number in the file, and special names for the algorithm. Supervised learning algorithms always expect a “Label” column in the input:

public class MulticlassClassificationData
{
    [Column(ordinal: "0", name: "Label")]
    public float LanguageClass;

    [Column(ordinal: "1")]
    public string Text;

    public MulticlassClassificationData(string text)
    {
        Text = text;
    }
}

The output of the classification model is a prediction that contains the predicted language (as a float – just like the input) and the confidence percentages for all languages. We used the ColumnName attribute to link the class members to these output columns:

public class MulticlassClassificationPrediction
{
    private readonly string[] classNames = { "German", "English", "French", "Italian", "Romanian", "Spanish" };

    [ColumnName("PredictedLabel")]
    public float Class;

    [ColumnName("Score")]
    public float[] Distances;

    public string PredictedLanguage => classNames[(int)Class];

    public int Confidence => (int)(Distances[(int)Class] * 100);
}

The MVVM Model has properties to store the untrained model and the trained model, respectively a LearningPipeline and a PredictionModel:

public LearningPipeline Pipeline { get; private set; }

public PredictionModel<MulticlassClassificationData, MulticlassClassificationPrediction> Model { get; private set; }

We used the ‘classic’ text loader from the Legacy namespace to load the data sets, so watch the using statement:

using TextLoader = Microsoft.ML.Legacy.Data.TextLoader;

The first step in the learning pipeline is loading the raw data:

Pipeline = new LearningPipeline();
Pipeline.Add(new TextLoader(trainingDataPath).CreateFrom<MulticlassClassificationData>());

Extract features

To prepare the data for the classifier, we need to manipulate both incoming fields. The label does not represent a numerical series but a language. So with a Dictionarizer we create a ‘bucket’ for each language to hold the texts. The TextFeaturizer populates the Features column with a numeric vector that represents the text:

// Create a dictionary for the languages. (no pun intended)
Pipeline.Add(new Dictionarizer("Label"));

// Transform the text into a feature vector.
Pipeline.Add(new TextFeaturizer("Features", "Text"));

Train model

Now that the data is prepared, we can hook the classifier into the pipeline. As already mentioned, there are multiple candidate algorithms here:

// Main algorithm
Pipeline.Add(new StochasticDualCoordinateAscentClassifier());
// or
// Pipeline.Add(new LogisticRegressionClassifier());
// or
// Pipeline.Add(new NaiveBayesClassifier()); // yields weird metrics...

The predicted label is a vector, but we want one of our original input labels back  – to map it to a language. The PredictedLabelColumnsOriginalValueConverter does this:

// Convert the predicted value back into a language.
Pipeline.Add(new PredictedLabelColumnOriginalValueConverter()
    {
        PredictedLabelColumn = "PredictedLabel"
    }
);

The learning pipeline is complete now. We can train the model:

public void Train()
{
    Model = Pipeline.Train<MulticlassClassificationData, MulticlassClassificationPrediction>();
}

The trained machine learning model can be saved now:

public void Save(string modelName)
{
    var storageFolder = ApplicationData.Current.LocalFolder;
    using (var fs = new FileStream(
        Path.Combine(storageFolder.Path, modelName),
        FileMode.Create,
        FileAccess.Write,
        FileShare.Write))
        Model.WriteAsync(fs);
}

Evaluate model

In supervised learning you can evaluate a trained model by providing a labeled input test data set and see how the predictions compare against it. This gives you an idea of the accuracy of the model and indicates whether you need to retrain it with other parameters or another algorithm.

We create a ClassificationEvaluator for this, and inspect the ClassificationMetrics that return from the Evaluate() call:

public ClassificationMetrics Evaluate(string testDataPath)
{
    var testData = new TextLoader(testDataPath).CreateFrom<MulticlassClassificationData>();

    var evaluator = new ClassificationEvaluator();
    return evaluator.Evaluate(Model, testData);
}

Some of the returned metrics apply to the whole model, some are calculated per label (language). The following diagram presents the Logarithmic Loss of the classifier per language (the PerClassLogLoss field). Loss represents a degree of uncertainty, so lower values are better:

MulticlassClassificationStart

Observe that some languages are harder to detect than others.

Model consumption

The Predict() call takes a piece of text and returns a prediction:

public MulticlassClassificationPrediction Predict(string text)
{
    return Model.Predict(new MulticlassClassificationData(text));
}

The prediction contains the predicted language and a set of scores for each language. Here’s what we do with this information in the sample app:

MulticlassClassification

We are pretty impressed to see how easy it is to build a reliable detector for 6 languages. The trained model would definitely make sense in a lot of .NET applications that we developed in the last couple of years.

Visualizing the results

We decided to use OxyPlot for visualizing the data in the sample app, because it’s light-weight and it does all the graphs we needed. In the previous article in this series we created all the elements programmatically. So this time we’ll focus on the XAML.

Axes and Series

Here’s the declaration of the PlotView with its PlotModel. The model has a CategoryAxis for the languages and a LinearAxis for the log-loss values. The values are represented in a BarSeries:

<oxy:PlotView x:Name="Diagram"
                Background="Transparent"
                BorderThickness="0"
                Margin="0 0 40 60"
                Grid.Column="1">
    <oxy:PlotView.Model>
        <oxyplot:PlotModel Subtitle="Model Quality"
                            PlotAreaBorderColor="{x:Bind OxyForeground}"
                            TextColor="{x:Bind OxyForeground}"
                            TitleColor="{x:Bind OxyForeground}"
                            SubtitleColor="{x:Bind OxyForeground}">
            <oxyplot:PlotModel.Axes>
                <axes:CategoryAxis Position="Left"
                                    ItemsSource="{x:Bind Languages}"
                                    TextColor="{x:Bind OxyForeground}"
                                    TicklineColor="{x:Bind OxyForeground}"
                                    TitleColor="{x:Bind OxyForeground}" />
                <axes:LinearAxis Position="Bottom"
                                    Title="Logarithmic loss per class (lower is better)"
                                    TextColor="{x:Bind OxyForeground}"
                                    TicklineColor="{x:Bind OxyForeground}"
                                    TitleColor="{x:Bind OxyForeground}" />
            </oxyplot:PlotModel.Axes>
            <oxyplot:PlotModel.Series>
                <series:BarSeries LabelPlacement="Inside"
                                    LabelFormatString="{}{0:0.00}"
                                    TextColor="{x:Bind OxyText}"
                                    FillColor="{x:Bind OxyFill}" />
            </oxyplot:PlotModel.Series>
        </oxyplot:PlotModel>
    </oxy:PlotView.Model>
</oxy:PlotView>

Apart from the OxyColor and OxyThickness values we were able to define the whole diagram in XAML. Thats not too bad for a prerelease NuGet package…

When the page is loaded in the sample app, we fill out the missing declarations, and update the diagram’s UI:

var plotModel = Diagram.Model;
plotModel.PlotAreaBorderThickness = new OxyThickness(1, 0, 0, 1);
Diagram.InvalidatePlot();

Adding the data

After the evaluation of the classification model, we iterate through the quality metrics. We create a BarItem for each language. All items are then added to the series:

var bars = new List<BarItem>();
foreach (var logloss in metrics.PerClassLogLoss)
{
    bars.Add(new BarItem { Value = logloss });
}

(plotModel.Series[0] as BarSeries).ItemsSource = bars;
plotModel.InvalidatePlot(true);

The sample app

The sample app lives here on NuGet. We take the opportunity here to proudly mention that it is featured in the ML.NET Machine Learning Community gallery.

Enjoy!

Machine Learning with ML.NET in UWP: Clustering

This is the first in a series of articles on implementing Machine Learning scenarios in UWP apps. We will use

  • ML.NET for defining, training, evaluating and running Machine Learning models,
  • OxyPlot for visualizing the data, and
  • we’re planning to bring in Math.NET for number crunching – if needed.

All of these are cross platform Open Source technologies, all of these are written in C#, all of these are free, and all of these can be used on the UWP platform, albeit with some -hopefully temporary- restrictions.

Currently the large majority of the online samples on ML.NET are straightforward console apps. That’s fine if want to learn the API, but we want to figure out how ML.NET behaves in a more hostile enterprise-ish environment – where calculations should not block the UI, data should be visualized in sexy graphs, and architectural constraints may apply. With that in mind, we created a sample UWP MVVM app on GitHub, with pages covering different Machine Learning use cases, like

  • building and using models for clustering, classification, and regression, and
  • analyzing the input data that is used to train these models (that’s what data scientists call feature engineering).

Here’s a screenshot from that app:

MulticlassClassification

Machine Learning

Machine learning is a data science technique that allows computers to use existing data to forecast future behaviors, outcomes, and trends. Using machine learning, computers learn without being explicitly programmed. The forecasts and predictions are produced by so called ‘models’ that are upfront defined, trained with test data, evaluated, persisted, and then called upon in client apps.

Typical Machine Learning implementations include (source: Princeton):

  • optical character recognition: categorize images of handwritten characters by the letters represented
  • face detection: find faces in images (or indicate if a face is present)
  • spam filtering: identify email messages as spam or non-spam
  • topic spotting: categorize news articles (say) as to whether they are about politics, sports, entertainment, etc.
  • spoken language understanding: within the context of a limited domain, determine the meaning of something uttered by a speaker to the extent that it can be classified into one of a fixed set of categories
  • medical diagnosis: diagnose a patient as a sufferer or non-sufferer of some disease
  • customer segmentation: predict, for instance, which customers will respond to a particular promotion
  • fraud detection: identify credit card transactions (for instance) which may be fraudulent in nature
  • weather prediction: predict, for instance, whether or not it will rain tomorrow

Well, this is exactly what ML.NET covers, and here are the steps to build all of these scenarios:

MachineLearningSteps

The domain of machine learning (and data science in general) is currently dominated by two programming languages: Python (with tools like scikit-learn) and R. These are great environments for data scientists, but are not targeted to developers. Enter ML.NET.

ML.NET

ML.NET is not the first Machine Learning environment from Microsoft (think of the Data Mining components of SQL Server Analysis Services, Cognitive Toolkit, and Azure Machine Learning services), but it is the first framework that targets application developers. ML.NET brings a large set of model-based Machine Learning analytic and prediction capabilities into the .NET world. The framework is built upon .NET Core to run cross-platform on Linux, Windows and MacOS. Developers can define and train a Machine Learning models or reuse an existing models by a 3rd party, and run it on any environment offline.

ML.NET is mature …

The origins of the ML.NET library go back many years. Shortly after the introduction of the Microsoft .NET Framework in 2002, Microsoft Research began a project called TMSN (“text mining search and navigation”) to enable  developers to include ML code in Microsoft products and technologies. The project was very successful, and over the years grew in size and usage internally at Microsoft. Somewhere around 2011 the library was renamed to TLC (“the learning code”). TLC is widely used within Microsoft.

The ML.NET library started a descendant of TLC, with the Microsoft-specific features removed.

… but not finished yet

ML.NET 0.1 was announced at //Build 2018. The core functionality exists since 2002, so the first iterations probably focused on modernizing the C# code and hiding/removing the Microsoft-specific stuff. In version 0.5 a large portion of the original pipeline API with concepts such as Estimators, Transforms and DataView was moved to the Legacy namespace and replaced by one that is more consistent with well-known frameworks like Scikit-Learn, TensorFlow and Spark. Version 0.7 removed the framework’s dependency on x64 devices. ML.NET is and is currently at v0.9. It was extended with feature engineering, further reducing the functionality gap with Python and R environments.

You find the ML.NET source code here on GitHub.

What about UWP?

UWP apps are safe to run and easy to install and uninstall in a clean way. To achieve this, they run in a sandbox

  • with a reduced API surface where some Win32 an COM calls are not accessible or even available, and
  • within a security context that restricts access to the external environment (file system, processes, external devices).

When experimenting with ML.NET in UWP we observed the following handful of issues:

  • ML.NET uses Reflection.Emit, which is not allowed in the native compilation step that occurs when compiling UWP apps in release mode,
  • ML.NET uses the MEF composition container from v0.7 onwards, and not al MEF classes are exposed to UWP yet (a known issue), and
  • ML.NET does file and process manipulations on classic API’s that are not available in UWP because of the sandbox, or because they live in a different namespace.

These issues are known by the different teams at Microsoft and we believe that they will be tackled sooner or later (the issues, not the teams Smile). In the mean time we’ll stick to version v0.6. Don’t worry: the v0.6 NuGet package has most of the core functionality (load and transform training data into memory, create a model, train the model, evaluate and save the model, use the model) and algorithms, but not the latest API.

Clustering

Clustering is a set of Machine Learning algorithms that help to identify the meaningful groups in a larger population. It is used by social networks and search engines for targeting ads and determining relevance rates. Clustering is a subdomain of unsupervised learning. Its algorithms learn from test data that has not been labeled, classified or categorized. Instead of responding to feedback, unsupervised learning algorithms identify commonalities in the data and react based on the presence or absence of such commonalities in each new piece of data.

These diagrams show what clustering does:
CA40FFF2-7AF4-4EF5-9899-ADE67C8EB2A7
Most courses and handbooks on Machine Learning start with clustering, because it’s easy compared to the other algorithms (no labeling, no evaluation, easy visualization if you stick to less than 4D).

So it makes sense to start this ML.NET-OxyPlot-UWP series with a article on clustering.

ML.NET algorithms

There is a few clustering algorithms available, but the K-Means family is the most used, and that’s what ML.NET provides. K-means is often referred to as Lloyd’s algorithm. It takes the number of wanted clusters (‘k’) as its primary input. The algorithm has three steps. It executes the first step and then starts looping between the other two:

  • The first step chooses the initial centroids, with the most basic method being to choose k samples from the dataset.
  • The second step assigns each sample to its nearest centroid.
  • The third step creates new centroids by taking the mean value of all of the samples assigned to each previous centroid.

The algorithm iterates through steps 2 and 3 until the position of the centroids stabilizes.

Enough talking, let’s get our hands dirty

Prepare the Solution

The sample app started as a new UWP app, with the following NuGet packages:

  • ML.NET v0.6 (because higher versions don’t work for UWP)
  • OxyPlot latest UWP prerelease
  • .NET Core latest prerelease (because we want to see if and when bugs are fixed)
  • LottieUWP (because we want to show a fancy animation while calculating)

ProjectConfig

As already mentioned we stay in debug mode to avoid Reflection issues, and we target x64 architecture only since we use a pre-v0.7 version of ML.NET.

Solution Architecture

The Visual Studio solution uses a lightweight MVVM-architecture where the

  • Models contain all business logic, the
  • ViewModels make the models accessible to the Views, the
  • Views focus on UI only, and the
  • Services take care of everything that doesn’t fit the previous categories.

As a result, the MVVM Model classes contain code that is pretty close to the existing ML.NET Console app samples. Here’s for example how a Machine Learning Model is trained:

public void Train(IDataView trainingDataView)
{
    Model = Pipeline.Fit(trainingDataView);
}

In general, MVVM ViewModels make the models accessible to the UI by enabling data binding and change propagation. Machine Learning scenarios typically work with large data files and they run CPU-intensive processes. In the sample app the ViewModels are responsible for pushing all that heavy lifting off the UI thread to keep app responsive while calculating.

Here’s the pattern: all the model’s CPU-bound actions are made awaitable by wrapping them in a Task. Here’s how such a call looks like:

public Task Train(IDataView trainingDataView)
{
    return Task.Run(() =>
    {
        _model.Train(trainingDataView);
    });
}

This allows the Views to remain responsive. It can update controls and start animations while calculations are in progress. Here’s how a XAML page updates its UI and then starts training a model:

TrainingBox.IsChecked = true;
await ViewModel.Train(trainingDataView);

Nothing fancy, right? Well: if you would train the model synchronously, the checkbox would only update after the calculation.

Let’s start with a hack

UWP has its own file API with limited access to the drives and its own logical URI schemes to refer to the files. ML.NET is based on the .NET Standard specification, and uses the classic API’s with physical file paths. So we decided to add a helper method that takes a UWP file reference (think ‘ms-appx:///something’), copies the file to local app storage, and then returns the physical path that classic API’s know and love. Here’s that helper:

public static async Task<string> FilePath(string uwpPath)
{
    var originalFile = await StorageFile.GetFileFromApplicationUriAsync(new Uri(uwpPath));
    var storageFolder = ApplicationData.Current.LocalFolder;
    var localFile = await originalFile.CopyAsync(storageFolder, originalFile.Name, NameCollisionOption.ReplaceExisting);
    return localFile.Path;
}

Let’s now dive into ML.NET.

Load the data

In this first scenario we will divide a group of shopping mall customers (borrowed from this Python sample) into 5 clusters. The raw data contains records with a customer id, gender, age, annual income, and a spending score. Notice that we’re in an unsupervised learning environment, so the data is ‘unlabeled’ – there is no cluster id column to learn from:

Clustering_data

The MLContext class is central in the recent ML.NET API’s (sort of DBContext in Entity Framework). In v0.6 it was still called LocalEnvironment. We need an instance of it to share across components:

private LocalEnvironment _mlContext = new LocalEnvironment(seed: null); // v0.6;

The following code snippet from the MVVM Model shows how a TextLoader is used to describe the content of the file, read it, and return its schema as an IDataView:

public IDataView Load(string trainingDataPath)
{
    var reader = new TextLoader(_mlContext,
                                new TextLoader.Arguments()
                                {
                                    Separator = ",",
                                    HasHeader = true,
                                    Column = new[]
                                        {
                                            new TextLoader.Column("CustomerId", DataKind.I4, 0),
                                            new TextLoader.Column("Gender", DataKind.Text, 1),
                                            new TextLoader.Column("Age", DataKind.I4, 2),
                                            new TextLoader.Column("AnnualIncome", DataKind.R4, 3),
                                            new TextLoader.Column("SpendingScore", DataKind.R4, 4),
                                        }
                                });

    var file = _mlContext.OpenInputFile(trainingDataPath);
    var src = new FileHandleSource(file);
    return reader.Read(src);
}

The ML.NET algorithms only support the data types that are listed in the DataKind enumeration. The properties that we’re interested in (AnnualIncome and SpendingScore) are defined as float. In general, Machine Learning doesn’t like double because its values take more memory and the extra precision does not compensate that.

Here’s the call from the MVVM View. We read a file from the app’s Data folder, make a copy of it to the local storage, and pass the full path to the ML.NET code above:

// Prepare the files.
var trainingDataPath = await MlDotNet.FilePath(@"ms-appx:///Data/Mall_Customers.csv");

// Read training data.
var trainingDataView = await ViewModel.Load(trainingDataPath);

The returned IDataView does deferred execution similar to Enumerable in LINQ. The data is only physically read when it is consumed – in this case when training the model.

Here’s the raw data in a diagram:

Clustering_Raw_Plot

Extract the features

In the second step of this Machine Learning scenario, we prepare the input data and define the algorithm that will be used. These steps are combined into a ‘Learning Pipeline’ that represents the untrained model.

First we plug in a ConcatEstimator to ‘featurize’ the data. We enrich the input data with properties of the name and the type that the algorithm expects. The Clustering model looks for a single column called “Feature” with an expected data type. We create this from the columns that we want to cluster on: AnnualIncome and SpendingScore.

The last step in the pipeline represents the algorithm and defines the type of pipeline – the T in EstimatorChain<T>. in a Clustering scenario we will use a KMeansPlusPlusTrainer. The most important parameter is the number of clusters.

Here’s how the pipeline is created:

public void Build()
{
    Pipeline = new ConcatEstimator(_mlContext, "Features", "AnnualIncome", "SpendingScore")
        .Append(new KMeansPlusPlusTrainer(
            env: _mlContext,
            featureColumn: "Features",
            clustersCount: 5,
            advancedSettings: (a) =>
                {
                    // a.AccelMemBudgetMb = 1;
                    // a.MaxIterations = 1;
                    // a. ...
                }
            ));
}

The advanced settings parameter allows you to fine-tune the algorithm, e.g. to optimize it for speed or memory consumption. Of course this comes with a price: the quality of the model will decrease. Here’s what happens when you restrict the number of iterations to just one:

Clustering_bad_config

The initial centroids were used and never moved, so we ended up with green dots within the red cluster,and a dark blue cluster with only two dots.

In a production environment you would iterate through different algorithms and different configurations to create a model that fits your need and does not kill the hardware. It’s good to see that ML.NET supports this.

Train the model

The Machine Learning model is created and trained by feeding the pipeline with the training data using a Fit() call. In the ML.NET API that’s just a one-liner, but behind it is a very complex and CPU-intensive task:

public void Train(IDataView trainingDataView)
{
    Model = Pipeline.Fit(trainingDataView);
}

The resulting model -a TransformerChain<T>– can be immediately used for prediction, or it can be serialized for later use.

Evaluate the model

In the common Machine Learning scenarios we would use two data sets: one to train the model, and another for evaluation. Since clustering belongs to unsupervised learning, there’s no reference data set to evaluate the model.

That’s why we have used a data set that allows easy visual inspection. When looking at the diagram, it is pretty obvious to figure out where the five clusters should be. Observe that the KMeans algorithm with standard settings did a good job:

Clustering

Persist the model

The trained model can be persisted to a .zip file for later use or for use on another device with a simple call to the (oddly synchronous) SaveTo() method:

public void Save(string modelName)
{
    var storageFolder = ApplicationData.Current.LocalFolder;
    using (var fs = new FileStream(
            Path.Combine(storageFolder.Path, modelName), 
            FileMode.Create, 
            FileAccess.Write, 
            FileShare.Write))
    {
        Model.SaveTo(_mlContext, fs);
    }
}

Use the model

A trained model’s main purpose it to create predictions based on its input, so we need to define some classes to describe that input and output.

The class that describes the predicted result may contain all input fields (in our sample “AnnualIncome” and “SpendingScore”) and the algorithm-specific output. The KMeans algorithm spawns records with a “PredictedLabel” holding the most relevant cluster id and ”Score”, an array with the distances to the respective centroids.

You can use the ColumnName attribute to shape your own data types to these expectations:

    public class ClusteringPrediction
    {
        [ColumnName("PredictedLabel")]
        public uint PredictedCluster;

        [ColumnName("Score")]
        public float[] Distances;

        public float AnnualIncome;

        public float SpendingScore;
    }

Here’s the class that we use to store a single input point:

public class ClusteringData
{
    public float AnnualIncome;

    public float SpendingScore;
}

There are at least two ways to use the model for assigning clusters to input data

  • the Transform(IDataView) applies the transformation to a whole set of data, and there is also
  • the MakePredictionFunction<Tsrc, Tdst> (an extension method that is now called CreatePredictionEngine<Tsrc, Tdst>). This is a strongly typed method to create a prediction from a single input.

Here’s how both calls are used in the sample app:

public IEnumerable<ClusteringPrediction> Predict(IDataView dataView)
{
    var result = Model.Transform(dataView).AsEnumerable<ClusteringPrediction>(_mlContext, false);
    return result;
}

public ClusteringPrediction Predict(ClusteringData clusteringData)
{
    var predictionFunc = Model.MakePredictionFunction<ClusteringData, ClusteringPrediction>(_mlContext);
    return predictionFunc.Predict(clusteringData);
}

To color the graph in the sample, we asked the model to assign a cluster to each record in the training dataset:

var predictions = await ViewModel.Predict(trainingDataView);

This creates a list of predictions with the input fields and the assigned cluster. That’s what we used to populate the diagram.

The sample page also comes with two textboxes where you can fill out values for annual income and spending score. The trained model will assign the cluster to it. Here’s how that call looks like:

int.TryParse(AnnualIncomeInput.Text, out int annualIncome);
int.TryParse(SpendingScoreInput.Text, out int spendingScore);
var output = await ViewModel.Predict(
	new ClusteringData 
	{ 
		AnnualIncome = annualIncome, 
		SpendingScore = spendingScore 
	});

The result is a single prediction.

At this point in the scenario, most op the ML.NET sample app do Console.WriteLine of the predictions and then stop. But we’re in a UWP app, so we can have a lot more fun. Let’s add some sexy diagrams!

Visualizing the results

Introducing OxyPlot

OxyPlot is an open source plot generation library. It started in 2010 as a simple WPF plotting component, focusing on simplicity, performance and visual appearance. Today it targets multiple .NET platforms through a portable library, making it easy to re-use plotting code on different platforms. The library has implementations for WPF, Windows 8, Windows Phone, Windows Phone Silverlight, Windows Forms, Silverlight, GTK#, Xwt, Xamarin.iOS, Xamarin.Android, Xamarin.Forms, Xamarin.Mac … aaaand … UWP.

OxyPlot is primarily focused on two-dimensional coordinate systems, that’s the reason for the ‘xy’ in the name! Some of its restrictions are

  • lack of support for 3D plots,
  • no animations, and
  • supports data binding, but you must manually refresh the plots when changing your data.

The documentation is a work in progress that stalled somewhere back in 2015, but since then the OxyPlot team published more than enough example source code on any diagram type, and there are ‘getting started’ guides for all targeted technology stacks, including UWP and Xamarin.

OxyPlot ‘s API comes with all types of axes (linear, logarithmic, radial, date, time, category, color), series (area, bar, boxplot, candlestick, function, heat map, contour and many more) and annotations (point, rectangle, image, line, arrow, text, etc.) that you can think of. On top of that, it handles the Batman Equation correctly:

FunctionSeries

PlotView

The only UI control in the package is PlotView. It’s not really documented, but apart from the PlotModel which hosts the content it doesn’t have much properties after all – just look at the source code.

Since we will create everything in the sample app programmatically, the XAML is rather lean:

<oxy:PlotView x:Name="Diagram"
                Background="Transparent"
                BorderThickness="0" />

Axes

To represent the clusters, we will use two linear axes: a horizontal for the SpendingScore and a vertical for the AnnualIncome. Here’s how these are defined:

var foreground = OxyColors.SteelBlue;
var plotModel = new PlotModel
{
    PlotAreaBorderThickness = new OxyThickness(1, 0, 0, 1),
    PlotAreaBorderColor = foreground
};

var linearAxisX = new LinearAxis
{
    Position = AxisPosition.Bottom,
    Title = "Spending Score",
    TextColor = foreground,
    TicklineColor = foreground,
    TitleColor = foreground
};

plotModel.Axes.Add(linearAxisX);
var linearAxisY = new LinearAxis
{
    Maximum = 140,
    Title = "Annual Income",
    TextColor = foreground,
    TicklineColor = foreground,
    TitleColor = foreground
};
plotModel.Axes.Add(linearAxisY);

Please note that we could have defined this in XAML too.

Series

The dots that represent the training data are assigned to 5 ScatterSeries instances (one per cluster). Each series comes with its own color:

for (int i = 0; i < 5; i++)
{
    var series = new ScatterSeries
    {
        MarkerType = MarkerType.Circle,
        MarkerFill = _colors[i]
    };

    plotModel.Series.Add(series);
}

All we need to do now to prepare the diagram, is hook the PlotModel into the PlotView:

Diagram.Model = plotModel;

To (re)plot the diagram, we loop through the list of predictions on the training dataset, create a new ScatterPoint for each prediction, and add it to the appropriate cluster series:

foreach (var prediction in predictions)
{
    (Diagram.Model.Series[(int)prediction.PredictedCluster - 1] as ScatterSeries).Points.Add(
        new ScatterPoint
        (
            prediction.SpendingScore,
            prediction.AnnualIncome
        ));
}

When this code is executed you’ll see … nothing. You need to tell the diagram that its content was updated:

Diagram.InvalidatePlot();

Now you end up with the 5 colored clusters that we’ve seen before:

Clustering

Annotations

The extra dot for the single prediction is visualized as a PointAnnotation. We gave it the color of the assigned cluster, but a different shape and an extra label to easily spot it in the diagram:

var annotation = new PointAnnotation
    {
        Shape = MarkerType.Diamond,
        X = output.SpendingScore,
        Y = output.AnnualIncome,
        Fill = _colors[(int)output.PredictedCluster - 1],
        TextColor = OxyColors.SteelBlue,
        Text = "Here"
    };
Diagram.Model.Annotations.Add(annotation);
Diagram.InvalidatePlot();

Here’s how this looks like:

Cluster_Annotation

To finish this article, let’s promote some extra OxyPlot features.

Tracker

OxyPlot comes with a templatable tracker. By default it appears when you left-click on an item in a series, but you can also make it permanently visible. It shows a crosshair and the details of the underlying item, very convenient:

Clustering_bad_config_tracker

Zooming and Panning

By scrolling the mouse wheel you can zoom in and out, and with the right mouse button down you can pan the diagram:

Clustering_MouseActions

You want more?

This was just the first of a list of articles on ML.NET in UWP. Until now, we are very impressed!

If you want to play with the sample app, it lives here on GitHub.

Enjoy!