Working with missing values is a common task in machine learning. We can say that it’s the very first task to accomplish before starting a pre-processing pipeline. The most common approach to blank filling is to use the mean and the median values. Although this is a very common practice, maybe there’s a more data-driven approach we can use.
Feature selection is always a challenging task for data scientists. Identifying the right set of features is crucial for the success of a model. There are several techniques that make use of the performance that a set of features gives to a model. One of them is the sequential feature selection.
The Levene test for variance is a statistical test that is used to determine whether or not the variances of two or more groups are equal. This test is often used in experimental design to ensure that the groups being compared are similar in terms of their variability. In this post, I will discuss the benefits of the Levene test and provide an example of how to use it in Python to compare the variance of a sample drawn from a normal distribution and a sample drawn from a uniform distribution.
Data Scientists often need to assess the proper distribution of their data. We have already seen the Shapiro-Wilk test for normality, but what about non-normal distributions? There’s another test that can help us, which is the Kolmogorov-Smirnov test.
Data scientists and analysts often have to work with mean values and need to compare the mean value of a …
Statistics is a data scientist’s best friend. Hypothesis tests are a family of very useful tools for any kind of analysis. They can really help us assess the statistical significance of some phenomena. However, we have to master such tools properly in order to benefit from their advantages.
I have always been looking for a simple expense manager app I could use with my wife to collect our daily expenses. This app should be a voice app that parses my voice and adds a line to a spreadsheet or something. Since I could never find something like that, I’ve decided to build such an app of my own.
Recently, I stumbled upon a tool called AssemblyAI, which implements some APIs for speech recognition and analysis. I’ve decided to try it in order to create a small voice diary using Telegram, Python and Notion.
Principal Component Analysis is a very useful dimensionality reduction tool. It can really help you reduce the number of features of a model. Although it may seem a powerful tool for a data scientist, there are some drawbacks that I think make it unsuitable for supervised machine learning projects.
Every time we train a model we should check if its performance beats some baseline, which is a trivial model that doesn’t take the inputs into account. Comparing our model with a baseline model, we can actually figure out whether it actually learns or not.