Data pre-processing for machine learning in Python

Data pre-processing for Machine Learning in Python

Online course

How to transform a dataset before using a machine learning model


After we have explored our dataset, we think that we’re ready to feed our model with data. This is wrong because if we don’t transform our dataset properly, our models won’t work at all.

Each model requires a dataset with the proper shape and the features must be transformed in order to make the model extract the information properly. Moreover, if we don’t select the best features in advance, the training phase becomes slow and the model won’t learn anything useful.

So, before giving our dataset to a model, we must first pre-process it to make it suitable for our algorithm. That’s the purpose of the Data pre-processing phase.

Pre-processing is an important step of data science and machine learning. It helps us giving the right shape to our dataset and selecting the best features to give to our model.

Moreover, it lets us figure out whether our features have predictive power or not, determining if the machine learning project we are working on has chances to be successful. Without proper pre-processing, the training procedure may become slow or, in the worst case, may not even converge. That’s why I’ve created an entire course about this topic.

Sometimes, aspiring Data Scientists start studying neural networks and other complex models and forget to study how to manipulate a dataset in order to make it used by their models. So, they fail in creating good models and only in the end they realize that good pre-processing would make them save a lot of time and increase the performance of their models. So, handling pre-processing techniques is a very important skill.

Pre-processing is often included in wider courses involving machine learning. I think that including these lessons inside a larger machine learning course would reduce the perceived value of such topics. Some people think that pre-processing is boring and useless and start with machine learning without caring about how to manage data for their model. That’s a great mistake because they don’t understand how pre-processing can make their models produce better results. That’s why I have created an entire course that focuses only on data pre-processing.

Advantages of pre-processing

  • Clean the dataset

    Without filling the blanks, models may not work at all.

  • Encode the categorical variables

    Not every model is able to deal with categorical variables directly, so we must first encode them.

  • Give the same order of magnitude to all the features

    Several models (like neural networks) need that the features share the same order of magnitude.

  • Dimensionality reduction

    Removing the useless features will make our model work faster and better.

My online course

In my online course, I outline all the main topics about Data pre-processing. All the lessons will be covered with practical examples in Python programming language.

These are the topics of the course:

Data cleaning

Some techniques to fill the blanks in our dataset.

One-hot encoding and ordinal encoding

The two main techniques for dealing with categorical data.

Power transform and other numerical transformations

Power transform, binning, binarizing and arbitrary transformations can give the right shape to our dataset.

Normalization, standardization and robust scaling

The three main techniques to give the same order of magnitude to all the features.

Principal Component Analysis (PCA)

A very powerful algorithm for dimensionality reduction.

Univariate feature selection

The most common dimensionality reduction technique according to the predictive power of the features.


An oversampling technique to apply when one target class of classification problem has too few records.

Python libraries 

Everything will be done using scikit-learn library, imblearn and matplotlib. 

Sklearn’s ColumnTransformer and Pipeline objects

Two of the main objects in the sklearn library, very often used for performing pre-processing. In the course, they will be studied deeply.

What's inside the course?

The course contains:

5 hours of video lessons

Python code for each example

Discussion area to interact with the teacher and the students

Certificate of completion at the end


This course is excellent. I recommend it to anyone who wants to become a data scientist. The teacher covers the topics with clarity and synthesis, both in theory and in practice. I particularly liked the practical examples in Python.


So far, its amazing

Javed Shaikh

This is very good course


This course is really good.

Ojo Babalola

The teacher

My name is Gianluca Malato, I'm Italian and have a Master's Degree cum laude in Theoretical Physics of disordered systems at "La Sapienza" University of Rome. I'm a Data Scientist who has been working for years in the banking and insurance sector. I have extensive experience in software programming and project management and I have been dealing with data analysis and machine learning in the corporate environment for several years. I am also skilled in data analysis (e.g. relational databases and SQL language), numerical algorithms (e.g. ODE integration, optimization algorithtms) and simulation (e.g. Monte Carlo techniques). I've written many articles about Machine Learning, R and Python and I've been a Top Writer on in Artificial Intelligence category.

Frequently Asked Questions

Does the course have a start and a finish date?

No. Once you enroll, you can follow the recorded video lessons when you want.

How can I pay for the course?

You can pay by Paypal or Credit card.

How can I follow the lessons?

Once you pay for your enrollment, you can access the recorded video lessons of the course when you want from your computer using this website. These videos are given in streaming, so you’ll need to connect to this website and have an Internet connection in order to watch them. After you create your account and log in, you can use the My Courses link in top of every page to see all the courses you have enrolled in.

What language will be used?

During this course, the spoken and written language is English.

I can’t afford the whole price. What can I do?

You can subscribe to the school membership. By paying a monthly or yearly fee, you’ll access this and the other courses of the school.

What if I’m not satisfied?

If you are not satisfied with the course, we apply a 30-day refund policy. Just contact us within 30 days from the date of purchase to get a full refund.


You can join the course paying a one-time payment or joining the school membership.

30-day money back guarantee

If you are not satisfied with the course, we apply a 30-day refund policy. Just contact us within 30 days from the date of purchase to get a full refund.

One-time payment

$50 (+ VAT)
  • Complete access to the course lessons (5+ hours)
  • Access to the discussion area of each lesson
  • Python code for each example
  • 30-day refund policy

School membership

You can access all the courses of my school by subscribing to my school membership.

Go to the school membership

Local taxes (e.g. VAT) may apply

Icons made by Flat Icons from