Data is everything and models are, sometimes, lazy in front of data they don’t understand. That’s why we need to apply a particular set of transformations that fall under the “pre-processing” name.
Scaling of the feature is such a transformation. Let’s see how it works and why we should use it.
What is scaling?
Scaling is a set of linear transformations that make all the features comparable. Imagine you have a feature A that spans around 10 and a feature B that spans around 1000. We don’t want our model to consider B more important than A only because it has a higher order of magnitude. That’s why we need to change the units of our features in order to make them comparable.
Scaling applies a linear transformation that perform such an action. Linear transformations don’t change the correlation, so the intrinsic predictive power of a feature remains the same.
Scaling is a very important part of pre-processing and it must be performed carefully.
Why bother about scaling?
Because some models won’t work properly if the features have different orders of magnitude. For example, all the models based on distances like K-Nearest Neighbors.
Other models won’t work at all with unscaled features. It’s the case of logistic regression and neural networks, for example.
Let’s see some scaling techniques.
The simplest kind of scaling is normalization. With this algorithm, every feature is scaled into a 0-1 interval.
Although this kind of scaling seems simple and powerful, it has some flaws. It’s very sensitive to outliers and, if the features have different variances, the transformed features will have different variances as well. This is not a good point, because a model could consider features with low variance as less relevant. Moreover, if a feature has several outliers, the normalized feature will have a shrinked distribution that will make our model think that it has a constant value, so we fall again in the low-variance trap.
However, Normalization works pretty fine for features that have a small number of outliers and it’s always a good first choice when we have to start with a new model.
Standardization is one of the most important scaling algorithms. It makes every feature have zero mean and unit variance.
It’s actually the most used scaling technique because it’s simple and makes the statistics of the features comparable. Keeping the variance equal to 1 and the mean equal to 0 makes every feature have an order of magnitude equal to 1 and that’s very important for models like logistic regression and neural networks, suffering from the gradient vanishing problem.
This kind of scaling is less used than standardization but can be very useful when we have to deal with features that have several outliers. It transforms every feature in order to make it have 0 median and Interquartile Range (IQR) equal to 1. remember, the median is the 50th percentile, while the Interquartile Range is the interval between the 25th and the 75th percentile.
Outliers are defined as those values that are greater than the 75th percentile plus 1.5 multiplied by IQR or lower than 25th percentile minus 1.5 multiplied by IQR, so robust scaling doesn’t suffer from the presence of outliers, since it’s calculated using measures that are independent from outliers.
Honestly, I didn’t see robust scaled used so much, maybe because it’s useful only when we have several outliers (that we can detect, for example, using a boxplot).
Which scaler is best?
There’s no clear question. I suggest starting with normalization, then try standardization if the features have very different variances. If there are several outliers, we can use robust scaling. If you don’t have much time and want to have a good result quickly, a cross-validation grid search could help you.
Generally speaking, don’t forget that each feature should be managed independently and might have scaling needs that are different from the other features.
In the following video, I explain how to perform Normalization and Standardization in Python.
If you are interested in feature scaling, register for my upcoming webinars on data pre-processing.[MEC id=”648″]