How is your data distributed? A practical introduction to the Kolmogorov-Smirnov test

Data Scientists often need to assess the proper distribution of their data. We have already seen the Shapiro-Wilk test for normality, but what about non-normal distributions? There’s another test that can help us, which is the Kolmogorov-Smirnov test.

The need to check the distribution

Data Scientists usually face the problem of checking the distribution of their data comes. They work with samples and need to check if they come from a normal distribution, a lognormal distribution, or even if two datasets come from the same distribution. This is pretty common when you perform a train-test split in Machine Learning.

For example, you may want to see if a sample you take from a population is statistically similar to the population itself. Or if you take several samples from the same population, you want to see if they are similar to each other.

All these problems have a common factor: comparing the distribution of the sample with the distribution of another sample or a known probability distribution.

Here comes the Kolmogorov-Smirnov tests.

Kolmogorov-Smirnov tests

KS test comes in two versions, each with its own null hypothesis:

• The sample has been generated from a given probability distribution
• Two samples have been generated from the same probability distribution

The former is the 1-sample KS test, the latter is the 2-sample KS test.

Both tests compare the cumulative distribution functions of the sample (or the samples) with the given cumulative distribution functions. Mathematically speaking, the distance between such distribution is calculated and used as a value for a statistic that can be used to calculate the p-value.

These tests are pretty powerful, although they suffer from some approximations in the calculation of the p-values and from the presence of outliers. However, are a very useful tool that must be present in a Data Scientist’s toolbox.

Let’s see how they work in Python.

An example in Python

Let’s create two datasets from a normal and uniform distribution respectively. They don’t need to have the same size.

Order my book on pre-processing!

In this book, I show the practical use of Python programming language to perform pre-processing tasks in machine learning projects.

Available in paperback and eBook formats.

import numpy as np

x = np.random.normal(size=100)
y = np.random.uniform(size=200)

Now, we can perform a KS test to assess whether the first sample comes from a normal or a uniform distribution. To perform this test, we need to import the cumulative distribution functions of the distributions we want to check and a proper function from SciPy that performs the test (the “ks_1samp” function).

from scipy.stats import ks_1samp,norm,uniform

Now, we can run the test that compares the distribution of “x” dataset with a normal cumulative distribution.

ks_1samp(x,norm.cdf)
# KstestResult(statistic=0.05164007841056789, pvalue=0.9398483559210086)


As expected, the p-value is pretty large and we cannot reject the null hypothesis that states that the dataset has been generated from a normal distribution.

If we perform the same check with a uniform distribution, the result is very different:

ks_1samp(x,uniform.cdf)
# KstestResult(statistic=0.5340516556530323, pvalue=3.580965283851709e-27)


A very small p-value lets us reject the null hypothesis that states that the sample has been generated from a uniform distribution, which is actually what we expected.

The two-sample version of the test is very straightforward. We have to import the “ks_2samp” function from SciPy and pass the two samples as arguments.

from scipy.stats import ks_2samp

ks_2samp(x,y)
# KstestResult(statistic=0.53, pvalue=9.992007221626409e-16)


As expected, the p-value is very low because we have artificially built these datasets from very different distributions.

So, Kolmogorov-Smirnov tests are confirming our assumptions.

Conclusion

Kolmogorov-Smirnov tests are very powerful tools in a Data Scientist’s toolbox and must be used properly each time we want to see if our data comes from a given distribution or if two datasets share the same distribution. However, the calculation of the p-value may suffer from some approximations and must be properly handled, just like any p-value.