JupyterCon 2023 - "Machine learning with dirty tables encoding, joining and deduplicating"
Recording of my presentation:
Abstract:
Data scientists and analysts working with Jupyter are too often forced to deal with dirty data (with typos, abbreviations, duplicates, missing values…) that comes from various sources.
Let us step in the shoes of a data scientist, and with a Jupyter Notebook try to perform a classification or regression task on data coming from a collection of raw tables.
In this tutorial, we will demonstrate how dirty_cat, an open source Python package developed in our team, can help with table preparation for machine learning tasks and improve results of prediction tasks in the presence of dirty data.
Common problems we will be tackling:
- joining groups of tables on inexact matches;
- de-duplicating values;
- encoding dirty categories with interpretable results.
And all of this on dirty categorical columns that will be transformed into numerical arrays ready for machine learning.
More details about the conference:
Enjoy Reading This Article?
Here are some more articles you might like to read next: