In binary classification models, we often work with proportions to calculate the accuracy of a model. For example, we use accuracy, precision and recall. But how can we calculate the error on these estimates? Are two models with 95% accuracy actually equivalent? Well, the answer is no. Let’s see why.
I often meet students that start their journey towards data science with Keras, Tensorflow and, generally speaking, Deep Learning. They build tons of neural networks like crazy, but in the end they fail with their models because they don’t know machine learning enough nor they are able to apply the necessary pre-processing techniques needed for making neural networks work.
Hyperparameter tuning is one of the most important parts of a machine learning pipeline. A wrong choice of the hyperparameters’ values may lead to wrong results and a model with poor performance.
Neural networks are fascinating and very efficient tools for data scientists, but they have a very huge flaw: they are unexplainable black boxes. In fact, they don’t give us any information about feature importance. Fortunately, there is a powerful approach we can use to interpret every model, even neural networks. It is the SHAP approach.
Google Sheets is a very powerful (and free) tool for creating spreadsheets. I’ve almost replaced LibreOffice Calc with Sheets, because it’s very comfortable to work with. Sometimes, a data scientist has to pull some data from a Google Sheet into a Python notebook. In this article, I’ll show you how to do it using just Pandas.
Neural networks are a fascinating field of machine learning. Let’s see how to find the best number of neurons of a neural network for our dataset.
The first thing I have learned as a data scientist is that feature selection is one of the most important steps of a machine learning pipeline. Fortunately, some models may help us accomplish this goal by giving us their own interpretation of feature importance. One of such models is the Lasso regression.
In the machine learning world, data scientists are often told to train a supervised model on a large training dataset and test it on a smaller amount of data. The reason why training dataset is always chosen larger than the test one is that somebody says that the larger the data used for training, the better the model learns.
Language detection (or identification) is a fascinating branch of Natural Language Processing. Its goal is to create a model that is able to detect the language a text is written in. Data Scientists usually employ neural network models to accomplish such a goal. In this article, I show how to create a simple language detection model in Python using a Naive Bayes model.
I’m glad to introduce my new free online workshop about feature importance in machine learning. In this workshop, feature importance in supervised machine learning is presented both in theory and in practice using Python programming language and its powerful scikit-learn library.