Data scientists usually have to check if data is normally distributed. An example is the normality check on the residuals of linear regression in order to correctly use the F-test. Let’s see how we can check the normality of a dataset.

## What is normality?

Normality means that a particular sample has been generated from a Gaussian distribution. It doesn’t necessarily have to be a standardized normal distribution (with 0 mean and variance equal to 1).

There are several situations in which data scientists may need normally distributed data:

- To compare the residuals of linear regression in the training test with the residuals in the test set using an F-test
- To compare the mean value of a variable across different groups using a One-Way ANOVA test or a Student’s test
- To assess the linear correlation between two variables using a proper test on their Pearson’s correlation coefficient
- To assess if the likelihood of a feature against a target in a Naive Bayes model allows us to use a Gaussian Naive Bayes classification model

These are all different examples that may occur frequently in a data scientist’s everyday job.

Unfortunately, data is not always normally distributed, although we can apply some particular transformation to make a distribution more symmetrical (for example, a power transformation).

A good way to assess the normality of a dataset would be to use a Q-Q plot, which gives us a graphical visualization of normality. But we often need a quantitative result to check and a chart couldn’t be enough.

That’s why we can use a hypothesis test to assess the normality of a sample.

## Shapiro-Wilk test

The Shapiro-Wilk test is a hypothesis test that is applied to a sample and whose null hypothesis is that the sample has been generated from a normal distribution. If the p-value is low, we can reject such a null hypothesis and say that the sample has not been generated from a normal distribution.

It’s a pretty easy-to-use statistical tool that can help us have an answer to the normality check we need, but it has a flaw: it doesn’t work well with large datasets. The maximum allowed size for a dataset depends on the implementation, but in Python, we see that a sample size larger than 5000 will give us an approximate calculation for the p-value.

However, this test is still a very powerful tool we can use. Let’s see a practical example in Python.

## An example in Python

First of all, let’s import NumPy and matplotlib.

```
import numpy as np
import matplotlib.pyplot as plt
```

Now, we have to import the function that calculates the p-value of a Shapiro-Wilk test. It’s the “shapiro” function in scipy.stats

```
from scipy.stats import shapiro
```

Let’s now simulate two datasets: one generated from a normal distribution and another one generated from a uniform distribution.

```
x = np.random.normal(size=300)
y = np.random.uniform(size=300)
```

This is the histogram for “x”:

We can clearly see that the distribution is very similar to a normal distribution.

And this is the histogram for “y”:

As expected, the distribution is very far from a normal one.

So, we expect a Shapiro-Wilk test to give us a pretty large p-value for the “x” sample and a small p-value for the “y” sample (because it’s not normally distributed).

Let’s calculate such p-values:

```
shapiro(x)
# ShapiroResult(statistic=0.9944895505905151, pvalue=0.35326337814331055)
```

As we can see, the p-value for “x” sample is not so low to allow us to reject the null hypothesis.

If we calculate the p-value on “y”, we get a quite different result.

```
shapiro(y)
# ShapiroResult(statistic=0.9485685229301453, pvalue=9.571677672681744e-09)
```

The p-value is lower than 5%, so we can reject the null hypothesis of the normality of the dataset.

If we try to calculate the p-value on a sample larger than 5000 points, we get a warning:

```
shapiro(np.random.uniform(size=6000))
# /usr/local/lib/python3.7/dist-packages/scipy/stats/morestats.py:1760:
# UserWarning: p-value may not be accurate for N > 5000.
# warnings.warn("p-value may not be accurate for N > 5000.")
# ShapiroResult(statistic=0.9526152014732361, pvalue=2.6791145079733313e-40)
```

So, here’s how we can perform the Shapiro-Wilk test for normality in Python. Just make sure to use a properly shaped dataset in order not to work with approximated p-values.

## Conclusion

The Shapiro-Wilk test for normality is a very simple-to-use tool of statistics to assess the normality of a dataset. I usually apply it after a proper data visualization made by a histogram and/or a Q-Q plot. It’s a very useful tool to ensure that a normality requirement is satisfied every time we need it and it must be present in a data scientist’s toolbox.