Welcome to the third article in this series on implementing Machine Learning scenarios with Open Source technologies in UWP apps. We were using ML.NET for modeling and OxyPlot for data visualization, and we will continue to do so. For this article we added Math.NET to our shopping basket: to calculate some statistics that are important when analyzing the input data. We will analyze the distribution of values for columns in the candidate model training data to detect whether these columns would be a useful as feature, or whether they need filtering, or whether they should be ignored at all.

# Feature Analysis in Machine Learning

When it comes to the ‘Garbage in – Garbage out’ principle, Machine Learning is not different from any other process. Before training a model data scientists will perform Feature Analysis to identify the features that are most useful in solving the problem. This includes steps such as:

- analyzing the distribution of values,
- looking for missing data and deciding whether to ignore, replace, or reject it,
- analyzing correlation between columns, and
- combining columns into new candidate features (so called Feature Engineering)

To illustrate why Feature Analysis is important, just take a look at the following diagram that shows the predicted NBA player salary (red) versus the real salary (blue) in the Regression page of our sample app:

It’s pretty clear that the trained algorithm is not very useful: it does not look significantly better than a random prediction. Next to the vertical axis you even see the model predicting negative salaries. This bad performance is partly because we did not do any analysis before just feeding the model with raw data.

# Feature Analysis in ML.NET

Although the data preparation step is not the core business of ML.NET, we still have good news. The announcement of ML.NET 0.10.0 revealed IDataView to become a shared type across libraries in the .NET ecosystem. If you dive into the IDataView Design Principles you’ll observe that the DataView can be a core component in analyzing raw data sets, since it allows

- cursoring,
- large data,
- caching,
- peeking,
- and so on.

You’ll observe that we will use IDataView properties and (extension) methods in this article to stay as close as possible to the following schema:

# Introducing the Box Plot

The box plot (a.k.a. box and whisker diagram) is a standardized way of displaying the distribution of data. It is based on the five number summary:

- minimum,
- first quartile,
- median,
- third quartile, and
- maximum.

In the simplest box plot the central rectangle spans the first quartile to the third quartile: the interquartile range or IQR – **the likely range of variation**. A segment inside the rectangle shows the median (**the typical value**) and “whiskers” above and below the box show the locations of the minimum and maximum. The so-called Tukey boxplot uses the lowest value still within 1.5 IQR of the lower quartile, and the highest value still within 1.5 IQR of the upper quartile respectively as minimum and maximum. It represents the values outside that boundary (**the extreme values**) as ‘outlier’ dots:

The box plot can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

Box plots may reveal whether or not you should use all of the values for training the model, and which algorithm you should prefer (some algorithms assume a symmetrical distribution). Here’s an article with more details on how to interpret box plots.

## Box plot in OxyPlot

Most of the data visualization frameworks for UWP support box plots, and OxyPlot is not an exception. All you need to do is insert a PlotView control in your XAML:

<oxy:PlotView x:Name="RegressionDiagram" Background="Transparent" BorderThickness="0" />

The control’s PlotModel is populated with a BoxPlotSeries that is displayed against a CategoryAxis for the property names and a LinearAxis for the values. Check the previous articles in this blog series on how to do define a model and its axes in XAML and C#.

In our sample app we wanted to highlight the distributions having outliers. We added two series to the model – each with a different color:

var cleanSeries = new BoxPlotSeries { Stroke = foreground, Fill = OxyColors.DarkOrange }; plotModel.Series.Add(cleanSeries); var outlinerSeries = new BoxPlotSeries { Stroke = foreground, Fill = OxyColors.Firebrick, OutlierSize = 4 }; plotModel.Series.Add(outlinerSeries);

The next step is to provide BoxPlotItem instances for the series, but first we need to calculate these.

Enter Math.NET.

## Boxplot in Math.NET

Math.NET is an Open Source C# library covering fundamental mathematics. Its code is distributed over several NuGet packages covering domains such as numerical computing, algebra, signal processing, and geometry. Math.NET Numerics is the package that contains the functions from descriptive statistics that we’re interested in. It targets .NET 4.0 and higher, including Mono and .NET Standard 1.3 and higher. The sample app does not use the NuGet package however. Because the source code is effective and very well structured -what else did you expect from mathematicians- it was easy to identify and grab the source code for the calculations that we wanted, so that’s what we did.

The SortedArrayStatistics class contains all the functions we need (Median, Quartiles, Quantiles) and more.

## Boxplot in the Sample App

To draw a box plot, we first need to get the data. ML.NET uses a TextLoader for this:

var trainingDataPath = await MlDotNet.FilePath(@"ms-appx:///Data/Mall_Customers.csv"); var reader = new TextLoader(_mlContext, new TextLoader.Arguments() { Separator = ",", HasHeader = true, Column = new[] { new TextLoader.Column("Age", DataKind.R4, 2), new TextLoader.Column("AnnualIncome", DataKind.R4, 3), new TextLoader.Column("SpendingScore", DataKind.R4, 4), } }); var file = _mlContext.OpenInputFile(trainingDataPath); var src = new FileHandleSource(file); var dataView = reader.Read(src);

The result of the Read() method is an IDataView tabular structure. We can query its Schema to find out column names, and with the GetColumn() extension method we can fetch all values for the specified column.

This is how the sample app basically pivots the data view from a list of rows to a list of columns:

var result = new List<List<double>>(); for (int i = 0; i < dataView.Schema.ColumnCount; i++) { var columnName = dataView.Schema.GetColumnName(i); result.Add(dataView .GetColumn<float>(_mlContext, columnName) .Select(f => (double)f) .ToList()); } return result;

Notice that we switched from float (the low-memory favorite type in ML.NET) to double (the high-precision favorite type in Math.NET).

The array of column values is used to build a BoxPlotItem to be added to the PlotModel:

// Read data var regressionData = await ViewModel.LoadRegressionData(); // Populate diagram for (int i = 0; i < regressionData.Count; i++) { AddItem(plotModel, regressionData[i], i); }

Here’s the code to calculate all the box plot constituents. Remember to sort the array first, since we rely on Math.ML’s Sorted Array Statistics here:

values.Sort(); var sorted = values.ToArray(); // Box: Q1, Q2, Q3 var median = sorted.Median(); var firstQuartile = sorted.LowerQuartile(); var thirdQuartile = sorted.UpperQuartile(); // Whiskers var interQuartileRange = thirdQuartile - firstQuartile; var step = interQuartileRange * 1.5; var upperWhisker = thirdQuartile + step; upperWhisker = sorted.Where(v => v <= upperWhisker).Max(); var lowerWhisker = firstQuartile - step; lowerWhisker = sorted.Where(v => v >= lowerWhisker).Min(); // Outliers var outliers = sorted.Where(v => v < lowerWhisker || v > upperWhisker).ToList();

Here’s the creation of the OxyPlot box plot item itself. The first parameter refers to the category index:

var item = new BoxPlotItem( x: slot, lowerWhisker: lowerWhisker, boxBottom: firstQuartile, median: median, boxTop: thirdQuartile, upperWhisker: upperWhisker) { Outliers = outliers };

In the following code snippet we assign the new item to one of the two series (with and without outliers) to obtain the wanted color scheme:

if (outliers.Any()) { (plotModel.Series[1] as BoxPlotSeries).Items.Add(item); } else { (plotModel.Series[0] as BoxPlotSeries).Items.Add(item); }

This is how the final result looks like. The diagram on the left shows the raw data for the Regression sample in the app. Notice that all properties are distributed asymmetrically and come with lots of outliers. That’s not a solid base for creating a prediction model:

The diagram on the right shows the data for the Clustering sample. This was our very first ML.NET project so we decided to use a proven and cleaned data set, and this shows in the box plot. For the sake of completeness, here’s that Clustering sample again. Its prediction model works very well:

One more thing about the box plot. When you right click a shape on the diagram, OxyPlot shows its tracker with the details:

When outliers are identified in the analysis, you may decide to skip these when training the model, using FilterByColumn(). Check this sample code for more details.

# Source

In this article we demonstrated how to build box plot diagrams in UWP, using ML.NET, Math.NET and OxyPlot. Even when ML.NET is not targeting Feature Analysis, its IDataView API is very helpful in getting the column data.

The sample app lives here on GitHub.

Enjoy!