Data Scientists often need to assess the proper distribution of their data. We have already seen the Shapiro-Wilk test for normality, but what about non-normal distributions? There’s another test that can help us, which is the Kolmogorov-Smirnov test.
Data scientists usually have to check if data is normally distributed. An example is the normality check on the residuals of linear regression in order to correctly use the F-test. Let’s see how we can check the normality of a dataset.
Data scientists and analysts often have to work with mean values and need to compare the mean value of a …
Statistics is a data scientist’s best friend. Hypothesis tests are a family of very useful tools for any kind of analysis. They can really help us assess the statistical significance of some phenomena. However, we have to master such tools properly in order to benefit from their advantages.
I have always been looking for a simple expense manager app I could use with my wife to collect our daily expenses. This app should be a voice app that parses my voice and adds a line to a spreadsheet or something. Since I could never find something like that, I’ve decided to build such an app of my own.
Recently, I stumbled upon a tool called AssemblyAI, which implements some APIs for speech recognition and analysis. I’ve decided to try it in order to create a small voice diary using Telegram, Python and Notion.
Principal Component Analysis is a very useful dimensionality reduction tool. It can really help you reduce the number of features of a model. Although it may seem a powerful tool for a data scientist, there are some drawbacks that I think make it unsuitable for supervised machine learning projects.
Every time we train a model we should check if its performance beats some baseline, which is a trivial model that doesn’t take the inputs into account. Comparing our model with a baseline model, we can actually figure out whether it actually learns or not.
Dealing with unbalanced datasets is always hard for a data scientist. Such datasets can create trouble for our machine learning models if we don’t deal with them properly. So, measuring how much our dataset is unbalanced is important before taking the proper precautions. In this article, I suggest some possible techniques.
Training a model is a complex process requiring much effort and analysis. Once a model is ready, we know that it won’t be valid forever and that we’ll need to train it again. How can we decide if a model needs to be retrained? There are some techniques that help us.