Category Archives: Math.NET

Machine Learning with ML.NET in UWP: Feature Correlation Analysis

In this article we show how to perform Feature Correlation Analysis and display the results in a Heat Map in the context of Machine Learning in UWP. It’s the fourth in a series that started here, on implementing Machine Learning scenarios in UWP using Open Source frameworks and components such as

All articles in the series revolve around a single UWP sample app that lives here on GitHub. Here’s how the Feature Correlation Analysis page looks like:

HeatMap

It displays the correlation between different properties in the popular Titanic passengers dataset: age, fare, ticket class, whether the passenger was accompanied with siblings, spouses, parents or children, and whether he or she survived the trip.

The darker red or blue squares on the heat map indicate that the corresponding properties on X and Y axis have a higher correlation with each other. Higher correlation is a warning sign for possible negative impact on the classification model when both features would be added to the training data.

Feature Correlation Analysis

Feature Correlation Analysis in Machine Learning

The topic of this article is Feature Correlation Analysis. Just like in the previous article -Feature Distribution Analysis- we are in the “data preparation” phase of a Machine Learning scenario. We’re not training or even defining models yet, we’re selecting the features to train them with. An ideal feature set contains features that are highly correlated with the classification (in ML.NET terminology: the Label Column), yet uncorrelated to each other.

Identifying the right feature set highly impacts the quality and performance of the subsequent learning and generalization steps. Here are two important reasons not to keep on adding feature columns to a training data set:

  • while the predictive power of a classifier first increases with the number of dimensions/features used, there comes a break point where it decreases again (the so-called Curse of Dimensionality). The training data set is always a finite set of samples with discrete values, while the prediction space may be infinite and continuous in all dimensions. So the more features a data set with fixed a number of samples has, the less representative it may become. Secondly,
  • there’s also the cost incurred by adding features. Two features that are highly correlated with each other don’t add much value to a classifier, but they sure add cost at training, persisting and/or inference time. Machine Learning activities can be pretty resource intensive on CPU, memory, and elapsed time, so it sure makes sense to limit the number or features.

The process of converting a set of observations of possibly correlated variables into a smaller set of values of linearly uncorrelated variables is called Principal Component Analysis. This technique was invented by Karl Pearson, the same person that defined its main instrument: the Pearson correlation coefficient – a measure of the linear correlation between two variables X and Y. Pearson’s correlation coefficient is the covariance of the two variables divided by the product of their standard deviations. It has a value between +1 and -1, where 1 is total positive linear correlation, 0 is no linear correlation, and -1 is total negative linear correlation.

During Principal Component Analysis a matrix is calculated with the correlation between each pair of features. Highly correlated feature pairs are then ‘sanitized’ by removing one or combining both (e.g. by creating a new feature by multiplying the variables’ values). At the same time you also calculate the correlation between each attribute and the output variable (the Label), to select only those attributes that have a moderate-to-high positive or negative correlation (close to -1 or 1) and drop those attributes with a low correlation (value close to zero).

Feature Correlation Analysis with ML.NET and Math.NET

Data Preparation is outside the core business of ML.NET itself, but for retrieving and manipulating the candidate training data we can count one on its most important spin-off components: the DataView API.

DataViewCentric

We can fetch the samples, optionally filter and add missing data, and then pivot it into arrays of feature values (exactly what we did in the previous article). Al we need is TextLoader to create an IDataView from the data set, get the column names from its Schema, call GetColumn() to get the array, and ‘upgrade’ the data type to double:

var reader = new TextLoader(_mlContext,
                            new TextLoader.Arguments()
                            {
                                Separator = ",",
                                HasHeader = true,
                                Column = new[]
                                    {
                                    new TextLoader.Column("Survived", DataKind.R4, 1),
                                    new TextLoader.Column("PClass", DataKind.R4, 2),
                                    new TextLoader.Column("Age", DataKind.R4, 5),
                                    new TextLoader.Column("SibSp", DataKind.R4, 6),
                                    new TextLoader.Column("Parch", DataKind.R4, 7),
                                    new TextLoader.Column("Fare", DataKind.R4, 9)
                                    }
                            });
var dataView = reader.Read(src);
var result = new List<List<double>>();
for (int i = 0; i < dataView.Schema.ColumnCount; i++)
{
    var columnName = dataView.Schema.GetColumnName(i);
    result.Add(dataView.GetColumn<float>(_mlContext, columnName).Select(f => (double)f).ToList());
}

return result;

Now that we have all the feature values in arrays, it’s time to calculate the correlations. The MathNet.Numerics.Statistics.Correlation class from Math.NET hosts implementations for several Pearson, Spearman, and other correlation calculations.

We decided to make a copy of the code that calculates the Pearson correlation between two IEnumerable<double> instances:

/// <summary>
/// Computes the Pearson Product-Moment Correlation coefficient.
/// </summary>
/// <param name="dataA">Sample data A.</param>
/// <param name="dataB">Sample data B.</param>
/// <returns>The Pearson product-moment correlation coefficient.</returns>
/// <remarks>Original Source: https://github.com/mathnet/mathnet-numerics/blob/master/src/Numerics/Statistics/Correlation.cs </remarks>
public static double Pearson(IEnumerable<double> dataA, IEnumerable<double> dataB)
{
    var n = 0;
    var r = 0.0;

    var meanA = 0d;
    var meanB = 0d;
    var varA = 0d;
    var varB = 0d;

    using (IEnumerator<double> ieA = dataA.GetEnumerator())
    using (IEnumerator<double> ieB = dataB.GetEnumerator())
    {
        while (ieA.MoveNext())
        {
            if (!ieB.MoveNext())
            {
                throw new ArgumentOutOfRangeException(nameof(dataB), "Array too short.");
            }

            var currentA = ieA.Current;
            var currentB = ieB.Current;

            var deltaA = currentA - meanA;
            var scaleDeltaA = deltaA / ++n;

            var deltaB = currentB - meanB;
            var scaleDeltaB = deltaB / n;

            meanA += scaleDeltaA;
            meanB += scaleDeltaB;

            varA += scaleDeltaA * deltaA * (n - 1);
            varB += scaleDeltaB * deltaB * (n - 1);
            r += (deltaA * deltaB * (n - 1)) / n;
        }

        if (ieB.MoveNext())
        {
            throw new ArgumentOutOfRangeException(nameof(dataA), "Array too short.");
        }
    }

    return r / Math.Sqrt(varA * varB);
}

For the sake of completeness: that same Math.NET class also hosts code to calculate the whole matrix. For the sample app this would require importing a lot more code (linear algebra classes such as Matrix) or adding the whole NuGet package to the project.

Here’s how the sample app calculates the whole correlation matrix:

// Read data
var matrix = await ViewModel.LoadCorrelationData();

// Populate diagram
var data = new double[6, 6];
for (int x = 0; x < 6; ++x)
{
    for (int y = 0; y < 5 - x; ++y)
    {
        var seriesA = matrix[x];
        var seriesB = matrix[5 - y];

        var value = Statistics.Pearson(seriesA, seriesB);

        data[x, y] = value;
        data[5 - y, 5 - x] = value;
    }

    data[x, 5 - x] = 1;
}

All we now need is a way to properly visualize it.

Correlation Heat Maps

A heat map is a representation of data in which the values are represented by colors. They are ideal to highlight patterns and extreme values in rectangular data such as matrixes.

Correlation Heat Maps in Machine Learning

In Machine Learning, heat maps are used to display correlations between feature values. The typical (“Pearson”) color scheme is a gradient that goes from

  • red for high positive correlation (value +1), over
  • white for no correlation (value 0), to
  • blue for high negative correlation (value –1).

Sometimes the values are normalized (brought to a range from 0 to 1) like in the next image. When there are a lot of features, the value labels are omitted in the the diagram. Since correlation is commutative (the correlation between A and B is the same as the correlation between B and A) it suffices to only display half of the matrix, like this:

26821073-15081119795115542

This diagram also omitted the correlations on the diagonal: the red squares that indicate the full positive correlation between each feature and itself.

Here’s an example of a diagram (from here) showing less features. It’s common to display the whole matrix and the value labels:

heatmap_2.994292b9

The above diagram was created with Python (Pandas and Seaborn) and shows the correlation between all the numerical values in the already mentioned Titanic Passengers dataset.

Here’s the UWP sample app version of the very same diagram, calculated with ML.NET and Math.NET and visualized with OxyPlot:

HeatMapZoom

The small differences in correlations for the Age feature are caused by the sample app not compensating missing values. We have the matrix, let’s plot this diagram.

Correlation Heat Maps with OxyPlot

To draw an OxyPlot diagram, you start with placing a PlotView element in your XAML:

<oxy:PlotView x:Name="Diagram"
                Background="Transparent"
                BorderThickness="0" />

Then you can declaratively or programmatically decorate it with a PlotModel and different Axis instances. A correlation heat map uses a CategoryAxis in both dimensions:

plotModel.Axes.Add(new CategoryAxis
{
    Position = AxisPosition.Bottom,
    Key = "HorizontalAxis",
    ItemsSource = new[]
    {
        "Survived",
        "Class",
        "Age",
        "Sib / Sp",
        "Par / Chi",
        "Fare"
    },
    TextColor = foreground,
    TicklineColor = foreground,
    TitleColor = foreground
});

plotModel.Axes.Add(new CategoryAxis
{
    Position = AxisPosition.Left,
    Key = "VerticalAxis",
    ItemsSource = new[]
    {
        "Fare",
        "Parents / Children",
        "Siblings / Spouses",
        "Age",
        "Class",
        "Survived"
    },
    TextColor = foreground,
    TicklineColor = foreground,
    TitleColor = foreground
});

The legend on top of the diagram is an extra LinearColorAxis in the appropriate OxyPalette:

plotModel.Axes.Add(new LinearColorAxis
{
    // Pearson color scheme from blue over white to red.
    Palette = OxyPalettes.BlueWhiteRed31,
    Position = AxisPosition.Top,
    Minimum = -1,
    Maximum = 1,
    TicklineColor = OxyColors.Transparent
});

If you’re not entirely satisfied with the color scheme, feel free to create your own custom OxyPalette instance: it’s just a 3-color gradient.

The matrix itself is a HeatMapSeries with 6 values in each dimension, rendered as rectangles:

var heatMapSeries = new HeatMapSeries
{
    X0 = 0,
    X1 = 5,
    Y0 = 0,
    Y1 = 5,
    XAxisKey = "HorizontalAxis",
    YAxisKey = "VerticalAxis",
    RenderMethod = HeatMapRenderMethod.Rectangles,
    LabelFontSize = 0.12,
    LabelFormatString = ".00"
};

plotModel.Series.Add(heatMapSeries);

Diagram.Model = plotModel;

To display the label in the square you need to set the LabelFormatString. This will only be applied if you also set a value to LabelFontSize.

OxyPlot does not support the triangular version of the heat map. Missing values always get the default value and you can’t make them transparent:

HalfMap

To populate the diagram, assign the Data to the series, and refresh the plot:

(plotModel.Series[0] as HeatMapSeries).Data = data;

// Update diagram
Diagram.InvalidatePlot();

Again, here’s the resulting heat map in the sample app:

HeatMap

Interpretation

The dark blue square on the diagram reveals a relatively high negative correlation between the Passenger Class and Ticket Fare. This means that the value for the one can easily be derived from the other – think “first class tickets cost more than second class tickets”. Adding both as a feature would not add more value than adding only one of them.

Data scientists would probably extract a new feature from these two (something like “Luxury”) or would break up “Passenger Class” and “Ticket Fare” in more basic components, like locations on the ship that passengers had access to. Anyway, the heat map clearly highlights the feature combinations that need further analysis.

Source

In this article we used components from ML.NET, Math.NET and OxyPlot to calculate and visualize the correlation heat map on candidate training data for a classification model.

The UWP sample app host more Machine Learning scenarios. It lives here on GitHub.

Enjoy!

Advertisements

Machine Learning with ML.NET in UWP: Feature Distribution Analysis

Welcome to the third article in this series on implementing Machine Learning scenarios with Open Source technologies in UWP apps. We were using ML.NET for modeling and OxyPlot for data visualization, and we will continue to do so. For this article we added Math.NET to our shopping basket: to calculate some statistics that are important when analyzing the input data. We will analyze the distribution of values for columns in the candidate model training data to detect whether these columns would be a useful as feature, or whether they need filtering, or whether they should be ignored at all.

Feature Analysis in Machine Learning

When it comes to the ‘Garbage in – Garbage out’ principle, Machine Learning is not different from any other process. Before training a model data scientists will perform Feature Analysis to identify the features that are most useful in solving the problem. This includes steps such as:

  • analyzing the distribution of values,
  • looking for missing data and deciding whether to ignore, replace, or reject it,
  • analyzing correlation between columns, and
  • combining columns into new candidate features (so called Feature Engineering)

To illustrate why Feature Analysis is important, just take a look at the following diagram that shows the predicted NBA player salary (red) versus the real salary (blue) in the Regression page of our sample app:

Regression

It’s pretty clear that the trained algorithm is not very useful: it does not look significantly better than a random prediction. Next to the vertical axis you even see the model predicting negative salaries. This bad performance is partly because we did not do any analysis before just feeding the model with raw data.

Feature Analysis in ML.NET

Although the data preparation step is not the core business of ML.NET, we still have good news. The announcement of ML.NET 0.10.0 revealed IDataView to become a shared type across libraries in the .NET ecosystem. If you dive into the IDataView Design Principles you’ll observe that the DataView can be a core component in analyzing raw data sets, since it allows

  • cursoring,
  • large data,
  • caching,
  • peeking,
  • and so on.

You’ll observe that we will use IDataView properties and (extension) methods in this article to stay as close as possible to the following schema:

DataViewCentric

Introducing the Box Plot

The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data. It is based on the five number summary:

  • minimum,
  • first quartile,
  • median,
  • third quartile, and
  • maximum.

In the simplest box plot the central rectangle spans the first quartile to the third quartile: the interquartile range or IQR – the likely range of variation. A segment inside the rectangle shows the median (the typical value) and “whiskers” above and below the box show the locations of the minimum and maximum. The so-called Tukey boxplot uses the lowest value still within 1.5 IQR of the lower quartile, and the highest value still within 1.5 IQR of the upper quartile respectively as minimum and maximum. It represents the values outside that boundary (the extreme values) as ‘outlier’ dots:

BoxPlotParts

The box plot can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

Box plots may reveal whether or not you should use all of the values for training the model, and which algorithm you should prefer (some algorithms assume a symmetrical distribution). Here’s an article with more details on how to interpret box plots.

Box plot in OxyPlot

Most of the data visualization frameworks for UWP support box plots, and OxyPlot is not an exception. All you need to do is insert a PlotView control in your XAML:

<oxy:PlotView 
	x:Name="RegressionDiagram"
	Background="Transparent"
	BorderThickness="0" />

The control’s PlotModel is populated with a BoxPlotSeries that is displayed against a CategoryAxis for the property names and a LinearAxis for the values. Check the previous articles in this blog series on how to do define a model and its axes in XAML and C#.

In our sample app we wanted to highlight the distributions having outliers. We added two series to the model – each with a different color:

var cleanSeries = new BoxPlotSeries
{
    Stroke = foreground,
    Fill = OxyColors.DarkOrange
};
plotModel.Series.Add(cleanSeries);

var outlinerSeries = new BoxPlotSeries
{
    Stroke = foreground,
    Fill = OxyColors.Firebrick,
    OutlierSize = 4
};
plotModel.Series.Add(outlinerSeries);

The next step is to provide BoxPlotItem instances for the series, but first we need to calculate these.

Enter Math.NET.

Boxplot in Math.NET

Math.NET is an Open Source C# library covering fundamental mathematics. Its code is distributed over several NuGet packages covering domains such as numerical computing, algebra, signal processing, and geometry. Math.NET Numerics is the package that contains the functions from descriptive statistics that we’re interested in. It targets .NET 4.0 and higher, including Mono and .NET Standard 1.3 and higher. The sample app does not use the NuGet package however. Because the source code is effective and very well structured -what else did you expect from mathematicians- it was easy to identify and grab the source code for the calculations that we wanted, so that’s what we did.

The SortedArrayStatistics class contains all the functions we need (Median, Quartiles, Quantiles) and more.

Boxplot in the Sample App

To draw a box plot, we first need to get the data. ML.NET uses a TextLoader for this:

var trainingDataPath = await MlDotNet.FilePath(@"ms-appx:///Data/Mall_Customers.csv");
var reader = new TextLoader(_mlContext,
                            new TextLoader.Arguments()
                            {
                                Separator = ",",
                                HasHeader = true,
                                Column = new[]
                                    {
                                    new TextLoader.Column("Age", DataKind.R4, 2),
                                    new TextLoader.Column("AnnualIncome", DataKind.R4, 3),
                                    new TextLoader.Column("SpendingScore", DataKind.R4, 4),
                                    }
                            });

var file = _mlContext.OpenInputFile(trainingDataPath);
var src = new FileHandleSource(file);
var dataView = reader.Read(src);

The result of the Read() method is an IDataView tabular structure. We can query its Schema to find out column names, and with the GetColumn() extension method we can fetch all values for the specified column.

This is how the sample app basically pivots the data view from a list of rows to a list of columns:

var result = new List<List<double>>();
for (int i = 0; i < dataView.Schema.ColumnCount; i++)
{
    var columnName = dataView.Schema.GetColumnName(i);
    result.Add(dataView
	.GetColumn<float>(_mlContext, columnName)
	.Select(f => (double)f)
	.ToList());
}

return result;

Notice that we switched from float (the low-memory favorite type in ML.NET) to double (the high-precision favorite type in Math.NET).

The array of column values is used to build a BoxPlotItem to be added to the PlotModel:

// Read data
var regressionData = await ViewModel.LoadRegressionData();

// Populate diagram
for (int i = 0; i < regressionData.Count; i++)
{
    AddItem(plotModel, regressionData[i], i);
}

Here’s the code to calculate all the box plot constituents. Remember to sort the array first, since we rely on Math.ML’s Sorted Array Statistics here:

values.Sort();
var sorted = values.ToArray();

// Box: Q1, Q2, Q3
var median = sorted.Median();
var firstQuartile = sorted.LowerQuartile();
var thirdQuartile = sorted.UpperQuartile();

// Whiskers
var interQuartileRange = thirdQuartile - firstQuartile;
var step = interQuartileRange * 1.5;
var upperWhisker = thirdQuartile + step;
upperWhisker = sorted.Where(v => v <= upperWhisker).Max();
var lowerWhisker = firstQuartile - step;
lowerWhisker = sorted.Where(v => v >= lowerWhisker).Min();

// Outliers
var outliers = sorted.Where(v => v < lowerWhisker || v > upperWhisker).ToList();

Here’s the creation of the OxyPlot box plot item itself. The first parameter refers to the category index:

var item = new BoxPlotItem(
    x: slot,
    lowerWhisker: lowerWhisker,
    boxBottom: firstQuartile,
    median: median,
    boxTop: thirdQuartile,
    upperWhisker: upperWhisker)
{
    Outliers = outliers
};

In the following code snippet we assign the new item to one of the two series (with and without outliers) to obtain the wanted color scheme:

if (outliers.Any())
{
    (plotModel.Series[1] as BoxPlotSeries).Items.Add(item);
}
else
{
    (plotModel.Series[0] as BoxPlotSeries).Items.Add(item);
}

This is how the final result looks like. The diagram on the left shows the raw data for the Regression sample in the app. Notice that all properties are distributed asymmetrically and come with lots of outliers. That’s not a solid base for creating a prediction model:

BoxPlot

The diagram on the right shows the data for the Clustering sample. This was our very first ML.NET project so we decided to use a proven and cleaned data set, and this shows in the box plot. For the sake of completeness, here’s that Clustering sample again. Its prediction model works very well:

Clustering

One more thing about the box plot. When you right click a shape on the diagram, OxyPlot shows its tracker with the details:

FeatureDistributionAnalysis_Tracker

When outliers are identified in the analysis, you may decide to skip these when training the model, using FilterByColumn(). Check this sample code for more details.

Source

In this article we demonstrated how to build box plot diagrams in UWP, using ML.NET, Math.NET and OxyPlot. Even when ML.NET is not targeting Feature Analysis, its IDataView API is very helpful in getting the column data.

The sample app lives here on GitHub.

Enjoy!