Sunday, June 8, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

5 Simple Steps to Automate Data Cleaning with Python

May 3, 2024
in Data Science & ML
Reading Time: 4 mins read
0 0
A A
0
Share on FacebookShare on Twitter


Image by Author

It is a widely spread fact among Data Scientists that data cleaning makes up a big proportion of our working time. However, it is one of the least exciting parts as well. So this leads to a very natural question:

Is there a way to automate this process?

Automating any process is always easier said than done since the steps to perform depend mostly on the specific project and goal. But there are always ways to automate, at least, some of the parts.

This article aims to generate a pipeline with some steps to make sure our data is clean and ready to be used.

Data Cleaning Process

Before proceeding to generate the pipeline, we need to understand what parts of the processes can be automated.

Since we want to build a process that can be used for almost any data science project, we need to first determine what steps are performed over and over again.

So when working with a new data set, we usually ask the following questions:

What format does the data come in?
Does the data contain duplicates?
Does the data contain missing values?
What data types does the data contain?
Does the data contain outliers?

These 5 questions can easily be converted into 5 blocks of code to deal with each of the questions:

1.Data Format

Data can come in different formats, such as JSON, CSV, or even XML. Every format requires its own data parser. For instance, pandas provide read_csv for CSV files, and read_json for JSON files.

By identifying the format, you can choose the right tool to begin the cleaning process.

We can easily identify the format of the file we are dealing with using the path.plaintext function from the os library. Therefore, we can create a function that first determines what extension we have, and then applies directly to the corresponding parser.

2. Duplicates

It happens quite often that some rows of the data contain the same exact values as other rows, what we know as duplicates. Duplicated data can skew results and lead to inaccurate analyses, which is not good at all.

This is why we always need to make sure there are no duplicates.

Pandas got us covered with the drop_duplicated() method, which erases all duplicated rows of a dataframe.

We can create a straightforward function that utilizes this method to remove all duplicates. If necessary, we add a columns input variable that adapts the function to eliminate duplicates based on a specific list of column names.

3. Missing Values

Missing data is a common issue when working with data as well. Depending on the nature of your data, we can simply delete the observations containing missing values, or we can fill these gaps using methods like forward fill, backward fill, or substituting with the mean or median of the column.

Pandas offers us the .fillna() and .dropna() methods to handle these missing values effectively.

The choice of how we handle missing values depends on:

The type of values that are missing
The proportion of missing values relative to the number of total records we have.

Dealing with missing values is a quite complex task to perform – and usually one of the most important ones! – you can learn more about it in the following article.

For our pipeline, we will first check the total number of rows that present null values. If only 5% of them or less are affected, we will erase these records. In case more rows present missing values, we will check column by column and will proceed with either:

Imputing the median of the value.
Generate a warning to further investigate.

In this case, we are assessing the missing values with a hybrid human validation process. As you already know, assessing missing values is a crucial task that can not be overlooked.

When working with regular data types we can proceed to transform the columns directly with the pandas .astype() function, so you could actually modify the code to generate regular conversations.

Otherwise, it is usually too risky to assume that a transformation will be performed smoothly when working with new data.

5. Dealing with Outliers

Outliers can significantly affect the results of your data analysis. Techniques to handle outliers include setting thresholds, capping values, or using statistical methods like Z-score.

In order to determine if we have outliers in our dataset, we use a common rule and consider any record outside of the following range as an outlier. [Q1 — 1.5 * IQR , Q3 + 1.5 * IQR]

Where IQR stands for the interquartile range and Q1 and Q3 are the 1st and the 3rd quartiles. Below you can observe all the previous concepts displayed in a boxplot.

XXXImage by Author

To detect the presence of outliers, we can easily define a function that checks what columns present values that are out of the previous range and generate a warning.

Final Thoughts

Data Cleaning is a crucial part of any data project, however, it is usually the most boring and time-wasting phase as well. This is why this article effectively distills a comprehensive approach into a practical 5-step pipeline for automating data cleaning using Python and.

The pipeline is not just about implementing code. It integrates thoughtful decision-making criteria that guide the user through handling different data scenarios.

This blend of automation with human oversight ensures both efficiency and accuracy, making it a robust solution for data scientists aiming to optimize their workflow.

You can go check my whole code in the following GitHub repo.

Josep Ferrer is an analytics engineer from Barcelona. He graduated in physics engineering and is currently working in the data science field applied to human mobility. He is a part-time content creator focused on data science and technology. Josep writes on all things AI, covering the application of the ongoing explosion in the field.



Source link

Tags: AutomateCleaningdataPythonSimplesteps
Previous Post

Google DeepMind at ICLR 2024

Next Post

ZKasino Scam Suspect Arrested, $12.2M in Assets Confiscated

Related Posts

AI Compared: Which Assistant Is the Best?
Data Science & ML

AI Compared: Which Assistant Is the Best?

June 10, 2024
5 Machine Learning Models Explained in 5 Minutes
Data Science & ML

5 Machine Learning Models Explained in 5 Minutes

June 7, 2024
Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’
Data Science & ML

Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’

June 7, 2024
How to Learn Data Analytics – Dataquest
Data Science & ML

How to Learn Data Analytics – Dataquest

June 6, 2024
Adobe Terms Of Service Update Privacy Concerns
Data Science & ML

Adobe Terms Of Service Update Privacy Concerns

June 6, 2024
Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart
Data Science & ML

Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

June 6, 2024
Next Post
ZKasino Scam Suspect Arrested, $12.2M in Assets Confiscated

ZKasino Scam Suspect Arrested, $12.2M in Assets Confiscated

Google DeepMind Introduces Med-Gemini: A Groundbreaking Family of AI Models Revolutionizing Medical Diagnosis and Clinical Reasoning

Google DeepMind Introduces Med-Gemini: A Groundbreaking Family of AI Models Revolutionizing Medical Diagnosis and Clinical Reasoning

Stretchable e-skin could give robots human-level touch sensitivity

Stretchable e-skin could give robots human-level touch sensitivity

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
Accenture creates a regulatory document authoring solution using AWS generative AI services

Accenture creates a regulatory document authoring solution using AWS generative AI services

February 6, 2024
Managing PDFs in Node.js with pdf-lib

Managing PDFs in Node.js with pdf-lib

November 16, 2023
Graph neural networks in TensorFlow – Google Research Blog

Graph neural networks in TensorFlow – Google Research Blog

February 6, 2024
13 Best Books, Courses and Communities for Learning React — SitePoint

13 Best Books, Courses and Communities for Learning React — SitePoint

February 4, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In