Gianluca Malato, Author at Your Data Teacher

March 13, 2023

Are you still using the mean and median values to clean the blanks?

Working with missing values is a common task in machine learning. We can say that it’s the very first task to accomplish before starting a pre-processing pipeline. The most common approach to blank filling is to use the mean and the median values. Although this is a very common practice, maybe there’s a more data-driven approach we can use.

February 15, 2023

A practical introduction to sequential feature selection

Feature selection is always a challenging task for data scientists. Identifying the right set of features is crucial for the success of a model. There are several techniques that make use of the performance that a set of features gives to a model. One of them is the sequential feature selection.

January 23, 2023

How to compare variances of non-normal samples

The Levene test for variance is a statistical test that is used to determine whether or not the variances of two or more groups are equal. This test is often used in experimental design to ensure that the groups being compared are similar in terms of their variability. In this post, I will discuss the benefits of the Levene test and provide an example of how to use it in Python to compare the variance of a sample drawn from a normal distribution and a sample drawn from a uniform distribution.

November 14, 2022

How is your data distributed? A practical introduction to the Kolmogorov-Smirnov test

Data Scientists often need to assess the proper distribution of their data. We have already seen the Shapiro-Wilk test for normality, but what about non-normal distributions? There’s another test that can help us, which is the Kolmogorov-Smirnov test.

October 25, 2022

3 easy hypothesis tests for the mean value

Data scientists and analysts often have to work with mean values and need to compare the mean value of a …

October 17, 2022

A beginner’s guide to statistical hypothesis tests

Statistics is a data scientist’s best friend. Hypothesis tests are a family of very useful tools for any kind of analysis. They can really help us assess the statistical significance of some phenomena. However, we have to master such tools properly in order to benefit from their advantages.

September 14, 2022

How to create a voice expense manager using Make and AssemblyAI

I have always been looking for a simple expense manager app I could use with my wife to collect our daily expenses. This app should be a voice app that parses my voice and adds a line to a spreadsheet or something. Since I could never find something like that, I’ve decided to build such an app of my own.

August 1, 2022

How to create a voice diary with Telegram, Python and AssemblyAI

Recently, I stumbled upon a tool called AssemblyAI, which implements some APIs for speech recognition and analysis. I’ve decided to try it in order to create a small voice diary using Telegram, Python and Notion.

July 11, 2022

Why you shouldn’t use PCA in a supervised machine learning project

Principal Component Analysis is a very useful dimensionality reduction tool. It can really help you reduce the number of features of a model. Although it may seem a powerful tool for a data scientist, there are some drawbacks that I think make it unsuitable for supervised machine learning projects.

July 4, 2022

Does your model beat the baseline?

Every time we train a model we should check if its performance beats some baseline, which is a trivial model that doesn’t take the inputs into account. Comparing our model with a baseline model, we can actually figure out whether it actually learns or not.