Accuracy measures for machine learning outputs.

Introduction

This article is a brief introduction to some of the ideas behind measuring the accuracy of supervised machine learning tools. The first half deals with classification algorithms; those that decide which of several classes a sample belongs to. The chief measure of accuracy there is the confusion matrix, from which a whole host of other statistics may be extracted. The second half of this article deals with regression algorithms, which give a predicted numerical value for some outputs from known inputs.

Confusion matrix

A confusion matrix is a method for measuring the accuracy of a classification algorithm. It is most easily understood through an example: imagine that a classification algorithm has been trained to distinguish between carrots, bananas, and apples. We can then draw a table of the results it obtained in testing:

Actual CarrotActual BananaActual Dog
Predicted Carrot520
Predicted Banana332
Predicted Apple0111

The numbers in the matrix represent the number of tests that returned a particular result: for example, for the 8 test cases that were actually carrots, the classifier correctly thought that 5 of them were carrots, and incorrectly thought 3 of them were bananas.

The entries on the diagonal of the matrix are correctly classified, and the entries off the diagonal are incorrectly classified. The matrix provides more information than simply the proportion of results that are correctly classified: in the above example, it can be inferred that the classifier is good at identifying apples (only one false positive apple, and two false negative apples, against 11 correctly identified apples), whilst the classification for bananas is much less accurate. The classifier especially had trouble telling apart carrots and bananas: the corresponding sub-matrix does not appear especially strongly peaked around the diagonal.

A whole range of statistics may be extracted directly from the confusion matrix. These statistics generally refer to the accuracy of classification of one class (binary classification): in this example, bananas.

The condition positive (P) is the number of actual bananas: here, 6. The condition negative (N) is the number of actual not-bananas: here, 21. True positives (TP, 3 here), true negatives (TN, 16 here), false positives (FP, 5 here) and false negatives (FN, 3 here) may be combined to yield information about the accuracy of the classifier. Some particular examples of important statistics are:

  • Sensitivity or true positive rate \frac{TP}{P} (0.5)
  • Specificity or true negative rate \frac{TN}{N} (0.762)
  • Precision \frac{TP}{TP+FP} (0.375)
  • False discovery rate \frac{FP}{FP+TP} (0.625)
  • False omission rate \frac{FN}{TN+FN} (0.158)
  • Accuracy\frac{TP+TN}{P+N} (0.704)
  • F1 score, the harmonic mean of precision and sensitivity \frac{2TP}{2TP+FP+FN} (0.429)

All these statistics may be compared between classifiers. They may also be judged on their own merits: all have a possible range of [0,1], and a required standard may be specified ahead of testing.

In problems where there are very imbalanced numbers of data points in the different classes (here, we have many more not-bananas than we do bananas), none of the statistics above are entirely reliable. A modified form of the F1 score, known as the F\beta score, can be used to weight precision and sensitivity differently, prioritising either the classifier identifying every banana (high sensitivity, >1), or only identifying things it is certain are bananas (high precision, <1). The Fβ score can be expressed as

,

but the value of  (and hence relative importance of sensitivity and precision) need to be set ahead of time, and hence knowledge about the relative sizes of the classes is required.

Matthews correlation coefficient

One further statistic that may be extracted from confusion matrices is the Matthews correlation coefficient.  Again, this measure is primarily designed for examining binary classifications, although it has been extended to cover the multiclass case.  For binary classification, the Matthews correlation coefficient is given by

,

and determines a correlation coefficient that is the geometric mean of the regression coefficients of the problem and its dual.  It has been described as one of the best ways of encapsulating the full confusion matrix in a single number.

The Matthews correlation coefficient takes values in the range [-1,1], with 1 being perfect prediction and -1 completely incorrect prediction: 0 indicates the prediction is no better than random.  For the example above, the Matthews correlation coefficient gives 0.932, indicating that despite the fairly low precision this classifier has done a reasonably good job of recognising dogs.  Again the primary use case for the Matthews correlation coefficient is in comparisons between classifiers.

ROC curve

Another method of analysing the accuracy of a binary classification algorithm is through a Receiver Operating Characteristics (ROC) curve.  This curve plots the true positive rate  against the false positive rate , with the curve being plotted parametrically as a function of a sensitivity threshold.  Example ROC curves from a biological context are shown in the figure: at any point on one of the curves, the sensitivity and (1-specificity) can be read off.  A perfect classifier’s ROC curve would pass through the top left corner of the plot, and a classifier that’s no better than random would have a curve along the diagonal; most classifiers, naturally, are somewhere in between.  The Area Under a ROC Curve (AUC) is another measure of the accuracy of a classifier: a value of 1 indicates that the classifier correctly identifies every sample, a value of 0.5 indicates it cannot distinguish between the two classes, and a value of 0 indicates it gets every sample incorrect (and your classifier should be inverted).

Pearson correlation coefficient

The Pearson correlation coefficient is a metric designed for use comparing two variables.  In a machine-learning context, these can be the target values for a supervised regression algorithm and the predicted values.  Plotting one against the other in a scatter graph, a perfect machine-learning algorithm would give a straight line through the origin (), with inaccuracies in the result giving scatter around this line.  The Pearson correlation coefficient gives a measure of this scatter.  The Pearson correlation coefficient for the relationship between  and can be expressed as

,

where  is the standard deviation of the variable  and  is the covariance of the variables  and , defined as  with  being the expectation value (mean) of , and .  The Pearson correlation coefficient takes values in the range [-1,1], with 1 and -1 being perfect correlation and anti-correlation, and 0 indicating the variables are uncorrelated.

The Pearson correlation coefficient is only designed for comparisons where the relationship between variables is expected to be linear.  This is perfectly adequate for examining the accuracy of a machine learning regression algorithm, but the concept behind the Pearson correlation coefficient may be easily extended to include non-linear relationships.

r2 coefficient

In the case of linear regression, as found for the accuracy of a regression algorithm, the r2 coefficient is just the square of the Pearson correlation coefficient.  More generally, we can define

,

where  are the target values, and  are the predicted values.  This measure of least-squares correlations takes values between 0 and 1, as before with 0 indicating no correlation and 1 perfect correlation.  This correlation coefficient is also referred to as the coefficient of determination.

There are multiple variations on the r2 coefficient, including modifications to remove the unwelcome property of the original coefficient spuriously increasing when extra data points are introduced.  However the base r2 coefficient remains a popular choice for analysing the accuracy of models of data.  

The choice between the Pearson correlation coefficient and the r2 coefficient when analysing the accuracy of a machine learning regression algorithm is to some extent a matter of personal preference.  All the information in the r2 coefficient is contained in the Pearson correlation coefficient, although the inverse is not true; but for analysing the accuracy of a machine learning algorithm, where a linear relationship is expected and large changes in the evaluated gradient are unlikely, both measures give equivalent information.

Mean squared error

The Mean Squared Error (MSE) is similar to the r2 coefficient, and used for analysing a supervised regression algorithm’s accuracy.  Using the same notation as above, MSE is expressed as

,

i.e. the sum of the squared errors, divided by the number of samples to obtain the mean.  For a set of target values with unit variance, the MSE tends to ; but for general (dimensionful) data, the MSE is dimensionful (whilst r2 is dimensionless), and the magnitude of the MSE depends on the magnitude of the data.  This makes the MSE less transferable, and more difficult to interpret, than the r2 coefficient, without prior knowledge of the data.

Relative error

The relative error is another measure of accuracy for a regression algorithm, which combines features of both the r2 coefficient and the MSE.  It is expressed as

,

where  is the absolute value of .  This measure is dimensionless, like the r2 coefficient, but suffers from problems when the expected outcome is 0 (as this appears in the denominator), and only makes sense for measurements in units of a ratio scale (one where zero is a definite lower bound on the possible values), as otherwise shifting every output value will change the measured relative error.  This makes the relative error a much worse measure of accuracy than r2 in most cases.

Multi-dimensional regression

The basic expressions for the Pearson correlation coefficient, r2 coefficient, and MSE above assume that there is only one target variable being optimised in the regression.  However, some more advanced machine learning algorithms are capable of mapping inputs to multiple outputs simultaneously.  The most obvious way to measure the accuracy of multiple outputs together is by simply summing over the MSE of each individually; however, this only makes sense for outputs with the same dimensionality.  Similarly, summing over the relative error is only suitable when all variables are measured on ratio scales.

However, there are alternative measures more suited to multi-dimensional regression.  For  dimensions of data, the average relative root mean square error (aRRMSE) takes the form

.

Similarly, a multi-dimensional version of the Pearson correlation coefficient can be written which contains information about the correlation vs anticorrelation of the result as well.  Either of these expressions could be squared, to obtain a measure similar to a multi-dimensional r2, which could more directly be written as

.

Note that squaring the previous measures does not give the same result as this version of multi-dimensional r2, and they cannot be compared, although either separately would serve as a good measure of accuracy.  The multi-dimensional r2, Pearson correlation coefficient, and aRRMSE, weight each dimension of the output identically; the sum over them could be weighted, but this would need to be justified before the analysis was begun.

Conclusions

The measures used for the accuracy of machine learning algorithms can be split into two classes; those for classification and those for regression, just as with the algorithms themselves.  The confusion matrix is the fundamental object for analysing the accuracy of classification algorithms; even a simple binary classification can be analysed in depth using a confusion matrix.  Extracting statistics from the confusion matrix is not difficult, although the choice of statistic to use is not straightforward: simple measures like the sensitivity and precision lose a lot of information from the confusion matrix, whilst the F1 score is biased in cases with different class sizes.  The Matthews correlation coefficient is a balanced measure that gives a good idea of the accuracy of a classification algorithm, although a single value can never capture the full detail of a matrix of accuracies in the confusion matrix.

For regression algorithms, the choice of statistic to use is more straightforward; a scatter plot of predicted value vs actual value can be interpreted in terms of the equivalent Pearson or r2 correlation coefficients, and these values have absolute scales that can be used a priori to set a required accuracy.  The mean square error, whilst similar to the r2 coefficient, suffers from a lack of transferability between problems, and relative error depends on the measurement units, meaning that a preference choice between the Pearson and r2 coefficients is the main decision to make when it comes to choosing an accuracy measure for machine learning regression algorithms.

For multi-dimensional regression problems, the Pearson correlation coefficient may be extended to capture information about the accuracy of the regression in all output dimensions.  This (or perhaps its square) is probably the most effective measure of multi-dimensional regression accuracy, although as in a single dimension the choice between Pearson and r2 coefficients is mostly personal preferance.

Please get in touch to see how we can help

See how our unique AI can solve real world problems