Exploratory Data Analysis in Python

exploratory data analysis in python

Free online course

How to approach a dataset for the first time

Introduction

When we put our hands on a dataset for the first time, we can’t wait to test several models and algorithms. This is wrong because if we don’t know the information before feeding our model, the results will be unreliable and the model itself will surely fail. Moreover, if we don’t select the best features in advance, the training phase becomes slow and the model won’t learn anything useful.

So, the first approach we must have is to take a look at our dataset and visualize the information it contains. In other words, we have to explore it.

That’s the purpose of the Exploratory Data Analysis.

EDA is an important step of data science and machine learning. It helps us explore the information hidden inside a dataset before applying any model or algorithm. It makes heavy use of data visualization, it’s bias-free.

Moreover, it lets us figure out whether our features have predictive power or not, determining if the machine learning project we are working on has chances to be successful. Without EDA, we may give the wrong data to a model without reaching any success.

Advantages of Exploratory Data Analysis

  • Visualize information

    A good histogram or boxplot can give us more information than every possible model

  • Visualize collinearity

    Correlation between the features can be difficult to handle and must be identified in advance.

  • Visualize feature importance

    Correlation between the features and the target variable of a supervised machine learning problem can be visualized using proper visualization techniques.

  • Extract information from data

    Remember: a data scientist must extract information that is hidden behind data. Exploratory Data Analysis can help with this goal.

My online course

In my free online course, I outline all the main topics about Exploratory Data Analysis. All the lessons will be covered with practical examples in Python programming language.

These are the topics of the course:

A first sight to the dataset

The very first approach to use when we work with a dataset for the first time and want to take a look at it.

Summarization

Summarization helps to calculate some high-level metrics that help a data scientist figure out the overall statistics of our features.

Histograms and kernel density estimation

With the use of histograms and KDE, we can easily take a look at the distribution of our features.

Boxplot

Boxplots are very useful to represent the shape of a distribution and check for the presence of outliers.

Scatterplot

With a pair scatterplot made, for example, using matplotlib or seaborn Python libraries, we can easily assess the correlation between the columns of a dataset, visualizing collinearity and feature importance with respect to a target variable.

Correlation matrix and heatmap

Correlation can be measured and visualized in several ways. A good choice is to use a seaborn heatmap, which gives us a good representation of the underlying phenomena.

Python libraries

Finally, Sweetviz and Pandas profiling libraries are introduced with some examples.

What's inside the course?

The course contains:

2 hours of video lessons

Python code for each example

Discussion area to interact with the teacher and the students

Certificate of completion at the end

The teacher

My name is Gianluca Malato, I'm Italian and have a Master's Degree cum laude in Theoretical Physics of disordered systems at "La Sapienza" University of Rome. I'm a Data Scientist who has been working for years in the banking and insurance sector. I have extensive experience in software programming and project management and I have been dealing with data analysis and machine learning in the corporate environment for several years. I am also skilled in data analysis (e.g. relational databases and SQL language), numerical algorithms (e.g. ODE integration, optimization algorithtms) and simulation (e.g. Monte Carlo techniques). I've written many articles about Machine Learning, R and Python and I've been a Top Writer on Medium.com in Artificial Intelligence category.

Frequently Asked Questions

Does the course have a start and a finish date?

No. Once you enroll, you can follow the recorded video lessons when you want.

How can I pay for the course?

This course is for free, so you don’t have to pay anything to join.

How can I follow the lessons?

Once you pay for your enrollment, you can access the recorded video lessons of the course when you want from your computer using this website. These videos are given in streaming, so you’ll need to connect to this website and have an Internet connection in order to watch them. After you create your account and log in, you can use the My Courses link in top of every page to see all the courses you have enrolled in.

What language will be used?

During this course, the spoken and written language is English.

Pricing

You can join the course for free.