How accurate is your accuracy?

In binary classification models, we often work with proportions to calculate the accuracy of a model. For example, we use accuracy, precision and recall. But how can we calculate the error on these estimates? Are two models with 95% accuracy actually equivalent?

Well, the answer is no. Let’s see why.

The standard error

Any measure must be followed by an error estimate, that represents the precision of that measure. I have a degree in Physics and physicists are always hated because they usually pretend to have an error estimate after each measurement result. For example, I can say I’m 1.93 meters tall, but this number doesn’t give any information if it’s not followed by an estimate of the error. If I say 1.93 meters with an error of 3% and another guy says 1.93 meters with an error of 30%, which one would you trust more?

That’s why we need to estimate the error of our measurements calculating what is called the standard error.

Standard error in an N points sample is defined as follows:

\mathrm{S.E.} = \frac{\sigma}{\sqrt N}

where σ is the standard deviation calculated on the sample. As you can see, the larger the sample, the lower the standard error and the more accurate the precision of our measure. It’s a natural consequence of the law of large numbers.

Standard error on proportions

Let’s consider a dataset of N points, n of which are related to a successful event (i.e. a correct prediction of our model). The proportion of success over the entire sample is, simply:

p = \frac{n}{N}

This can be the accuracy (where N is the sum of the values of the confusion matrix and n is its trace), the precision (where N is the number of the events that the model has predicted with 1 and n is the number of the true positives) or another proportion.

Now, we have to calculate the standard error. We could use an algorithm like bootstrap to calculate it, but for proportions we can use a simple, closed formula.

First, we need to calculate the standard deviation. Our event can be modeled as a random variable x whose value is 1 with probability p and 0 with probability 1-p.

Its expected value is, then,

E[x] = p\cdot 1 + (1-p) \cdot 0 = p

Its variance is, then:

\sigma^2 = E[x^2] - E[x]^2 = p - p^2 = p(1-p)

So, the standard error becomes:

\mathrm{S.E.} = \frac{\sigma}{\sqrt N} = \sqrt{\frac{p(1-p)}{N}}

Confidence intervals

Standard error can be used to calculate confidence intervals, which are those intervals in which we can expect the real value to be with a certain confidence.

How can we calculate confidence intervals for proportions?

Let’s first calculate the following z variable

z = \frac{x - \mu}{\sqrt{\sigma^2/N}}

where μ and σ are the mean and the standard deviation of our proportion. It can be proven that, if we work with proportions and given reasonably high values for N, this variable can be approximated with a normal variable (i.e. it’s normally distributed). A 95% confidence interval for a normal distribution is (-1.96,1.96). It’s a property of normal distribution.

So, switching back to proportions, we can define a 95% confidence interval as:

\left(p-1.96\sqrt{\frac{p(1-p)}{N}} , p+1.96\sqrt{\frac{p(1-p)}{N}} \right)

As usual, the higher the value of N, the tighter the interval due to law of large numbers.

Let’s now see how to apply these concepts in practice.

A simple example

Let’s say we have two models. One of them is being tested on 100 records and it gives us an accuracy score of 70%. The other model is tested on 400 records and gives us an accuracy score of 67%. Which model is better? Anybody would say the former, because the accuracy score is higher. But let’s see what happens if we calculate the standard error of both accuracies:

\textrm{S.E. model 1} = \sqrt{\frac{70\%(1-70\%)}{100}}=4.58\%
\textrm{S.E. model 2} = \sqrt{\frac{67\%(1-67\%)}{400}}=2.35\%

The second model gives us a more precise estimation of the accuracy because the standard error is lower. If we calculate the confidence intervals, we get:

\textrm{C.I. for model 1}=[70\%-1.96\cdot4.58\%,70\%+1.96\cdot4.58\%] = [61.02\%,78.98\%]
\textrm{C.I. for model 2}=[67\%-1.96\cdot2.35\%,67\%+1.96\cdot2.35\%] = [62.39\%,71.61\%]

As you can see, the model with the highest lower bound of the confidence interval is the second one, not the first one. If we have to consider a conservative approach in which we look at a statistically worst case, we would look for the lower bound of the confidence interval and, in this case, select the second model and not the first one. That’s why we calculate the standard error. The estimate itself is useless if we don’t calculate the error estimate and, as we can see, a larger sample can give us more information than a smaller one, so by calculating the standard error we can reach a better decision.


Calculating the error estimates is often neglected, but can lead to wrong results if we don’t know how accurate our measures are. For proportions, the calculation of the standard error and the confidence intervals is very straightforward and can be very useful to extract as much information as possible from our dataset.

Leave a Reply

Your email address will not be published. Required fields are marked *