Data scientists usually search for a model that has the highest accuracy possible. However, they should focus on another term too, which is stability. In this article, I explain what it is and how to increase it using a technique called “bagging”.
Feature selection has always been a great problem in machine learning.
In this article, I’ll show how to perform feature selection using a random forest model in Python.
Data Science and machine learning are two wonderful and exciting disciplines and are a great part of our lives. Sometimes people confuse them, but they are quite different things.
I’m very glad to announce that I’ve published my new free course! The topic is the Exploratory Data Analysis using …
Data Science has entered the world of big companies, where data is. Managers of such companies often ask things that they don’t actually need and forget to pretend the only useful things to have.
I’ve just introduced monthly payments for my learning paths and have other plans and ideas for the last part of this year.
Collinearity is a very common problem in machine learning projects. It is the correlation between the features of a dataset and it can reduce the performance of our models because it increases variance and the number of dimensions. It becomes worst when you have to work with unsupervised models. In order to solve this problem, I’ve created a Python library that removes the collinear features.
Data scientists usually need to check the statistics of their datasets, particularly against known distributions or comparing them with other datasets. There are several hypothesis tests we can run for this goal, but I often prefer using a simple, graphical representation. I’m talking about Q-Q plot.
In binary classification problems, we usually convert the score given by a model into a predicted class applying a threshold. If the score is greater than the threshold, we predict 1, otherwise, we predict 0. This threshold is usually set to 0.5, but is it correct?
When we have to work with a binary classification problem, we often have to choose the performance metric that represents the generalizing capability of our model. There’s no universal metric we can use, since it strongly depends on the problem we are building our model for.