Data scientists usually have to check if data is normally distributed. An example is the normality check on the residuals of linear regression in order to correctly use the F-test. Let’s see how we can check the normality of a dataset.
What is normality?
Normality means that a particular sample has been generated from a Gaussian distribution. It doesn’t necessarily have to be a standardized normal distribution (with 0 mean and variance equal to 1).
There are several situations in which data scientists may need normally distributed data:
- To compare the residuals of linear regression in the training test with the residuals in the test set using an F-test
- To compare the mean value of a variable across different groups using a One-Way ANOVA test or a Student’s test
- To assess the linear correlation between two variables using a proper test on their Pearson’s correlation coefficient
- To assess if the likelihood of a feature against a target in a Naive Bayes model allows us to use a Gaussian Naive Bayes classification model
These are all different examples that may occur frequently in a data scientist’s everyday job.
Unfortunately, data is not always normally distributed, although we can apply some particular transformation to make a distribution more symmetrical (for example, a power transformation).
A good way to assess the normality of a dataset would be to use a Q-Q plot, which gives us a graphical visualization of normality. But we often need a quantitative result to check and a chart couldn’t be enough.
That’s why we can use a hypothesis test to assess the normality of a sample.
The Shapiro-Wilk test is a hypothesis test that is applied to a sample and whose null hypothesis is that the sample has been generated from a normal distribution. If the p-value is low, we can reject such a null hypothesis and say that the sample has not been generated from a normal distribution.
It’s a pretty easy-to-use statistical tool that can help us have an answer to the normality check we need, but it has a flaw: it doesn’t work well with large datasets. The maximum allowed size for a dataset depends on the implementation, but in Python, we see that a sample size larger than 5000 will give us an approximate calculation for the p-value.
However, this test is still a very powerful tool we can use. Let’s see a practical example in Python.
An example in Python
First of all, let’s import NumPy and matplotlib.
import numpy as np import matplotlib.pyplot as plt
Now, we have to import the function that calculates the p-value of a Shapiro-Wilk test. It’s the “shapiro” function in scipy.stats
from scipy.stats import shapiro
Let’s now simulate two datasets: one generated from a normal distribution and another one generated from a uniform distribution.
x = np.random.normal(size=300) y = np.random.uniform(size=300)
This is the histogram for “x”:
We can clearly see that the distribution is very similar to a normal distribution.
And this is the histogram for “y”:
As expected, the distribution is very far from a normal one.
So, we expect a Shapiro-Wilk test to give us a pretty large p-value for the “x” sample and a small p-value for the “y” sample (because it’s not normally distributed).
Let’s calculate such p-values:
shapiro(x) # ShapiroResult(statistic=0.9944895505905151, pvalue=0.35326337814331055)
As we can see, the p-value for “x” sample is not so low to allow us to reject the null hypothesis.
If we calculate the p-value on “y”, we get a quite different result.
shapiro(y) # ShapiroResult(statistic=0.9485685229301453, pvalue=9.571677672681744e-09)
The p-value is lower than 5%, so we can reject the null hypothesis of the normality of the dataset.
If we try to calculate the p-value on a sample larger than 5000 points, we get a warning:
shapiro(np.random.uniform(size=6000)) # /usr/local/lib/python3.7/dist-packages/scipy/stats/morestats.py:1760: # UserWarning: p-value may not be accurate for N > 5000. # warnings.warn("p-value may not be accurate for N > 5000.") # ShapiroResult(statistic=0.9526152014732361, pvalue=2.6791145079733313e-40)
So, here’s how we can perform the Shapiro-Wilk test for normality in Python. Just make sure to use a properly shaped dataset in order not to work with approximated p-values.
The Shapiro-Wilk test for normality is a very simple-to-use tool of statistics to assess the normality of a dataset. I usually apply it after a proper data visualization made by a histogram and/or a Q-Q plot. It’s a very useful tool to ensure that a normality requirement is satisfied every time we need it and it must be present in a data scientist’s toolbox.