Exploratory Data Analysis in Python
When we put our hands on a dataset for the first time, we can’t wait to test several models and algorithms. This is wrong because if we don’t know the information before feeding our model, the results will be unreliable and the model itself will surely fail. Moreover, if we don’t select the best features in advance, the training phase becomes slow and the model won’t learn anything useful.
So, the first approach we must have is to take a look at our dataset and visualize the information it contains. In other words, we have to explore it.
That’s the purpose of the Exploratory Data Analysis.
EDA is an important step of data science and machine learning. It helps us explore the information hidden inside a dataset before applying any model or algorithm. It makes heavy use of data visualization, it’s bias-free.
Moreover, it lets us figure out whether our features have predictive power or not, determining if the machine learning project we are working on has chances to be successful. Without EDA, we may give the wrong data to a model without reaching any success.
Advantages of Exploratory Data Analysis
My online course
In my free online course, I outline all the main topics about Exploratory Data Analysis. All the lessons will be covered with practical examples in Python programming language.
These are the topics of the course:
A first sight to the dataset
The very first approach to use when we work with a dataset for the first time and want to take a look at it.
Summarization helps to calculate some high-level metrics that help a data scientist figure out the overall statistics of our features.
Histograms and kernel density estimation
With the use of histograms and KDE, we can easily take a look at the distribution of our features.
Boxplots are very useful to represent the shape of a distribution and check for the presence of outliers.
With a pair scatterplot made, for example, using matplotlib or seaborn Python libraries, we can easily assess the correlation between the columns of a dataset, visualizing collinearity and feature importance with respect to a target variable.
Correlation matrix and heatmap
Correlation can be measured and visualized in several ways. A good choice is to use a seaborn heatmap, which gives us a good representation of the underlying phenomena.
Finally, Sweetviz and Pandas profiling libraries are introduced with some examples.
What's inside the course?
The course contains:
My name is Gianluca Malato, I'm Italian and have a Master's Degree cum laude in Theoretical Physics of disordered systems at "La Sapienza" University of Rome. I'm a Data Scientist who has been working for years in the banking and insurance sector. I have extensive experience in software programming and project management and I have been dealing with data analysis and machine learning in the corporate environment for several years. I am also skilled in data analysis (e.g. relational databases and SQL language), numerical algorithms (e.g. ODE integration, optimization algorithtms) and simulation (e.g. Monte Carlo techniques). I've written many articles about Machine Learning, R and Python and I've been a Top Writer on Medium.com in Artificial Intelligence category.
Frequently Asked Questions
Does the course have a start and a finish date?
No. Once you enroll, you can follow the recorded video lessons when you want.
How can I pay for the course?
This course is for free, so you don’t have to pay anything to join.
How can I follow the lessons?
Once you pay for your enrollment, you can access the recorded video lessons of the course when you want from your computer using this website. These videos are given in streaming, so you’ll need to connect to this website and have an Internet connection in order to watch them. After you create your account and log in, you can use the My Courses link in top of every page to see all the courses you have enrolled in.
What language will be used?
During this course, the spoken and written language is English.
You can join the course for free.