Saturday, May 17, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

Consider these questions when managing data during AI development

March 19, 2024
in AI Technology
Reading Time: 4 mins read
0 0
A A
0
Share on FacebookShare on Twitter


Trustworthy AI is dependent on a solid foundation of data.

If you bake a cake with missing, expired or otherwise low-quality ingredients, it will result in a subpar dessert. The same holds for developing AI systems to handle large amounts of data.

Data is at the heart of every AI model. Using biased, sensitive or incorrect data in an AI system will produce results reflecting those issues. If your inputs are low quality, your results will be similar. These flaws can easily doom an AI project.

Responsible and trustworthy AI systems do not happen by accident. They are a product of thoughtful design and consideration. Managing data is the second step in a series of blog posts detailing questions to be asked at each of the five pivotal steps of the AI life cycle. These steps – questioning, managing data, developing the models, deploying insights and decisioning – represent the stages where thoughtful consideration paves the way for an AI ecosystem that aligns with ethical and societal expectations.

To ensure we are using the right data, we need to ask questions about the data used in an AI system. Is this the right data to be using? Are we using data that includes protected classes (e.g., race, gender) we are legally prohibited from using? Do we need to perform transformations or imputations on the data? These questions and more must be asked, such as:

Does your data contain any sensitive or privileged information?

Not all data is created equal and not all data needs to be protected equally. Data classifications range from publicly available (freely available for anyone) to restricted or sensitive data where data stewards must protect it from improper use or dissemination. Some examples of this would be any data that specifies health conditions, personally identifiable information (PII), race, religion, government IDs, etc.

Just because your organization collects some of this information during its normal course of business does not mean you can freely use it in your AI system. Just because we can do something does not mean we should do something.

There may even be legal prohibitions from using some data. Determine if there is a valid reason to include all available sensitive information and consider minimizing the amount needed for your model. Also, the data should be anonymized by stripping out or masking all PII.

Fig 1: Learn how SAS Information Catalog indicates whether the column contains potentially private information that could be linked to an individual.

Have you checked for potential sources of bias?

AI systems can certainly make our lives more productive and convenient. However, this power and speed also mean that a biased model can cause harm at scale, continuing to disadvantage certain groups or individuals.

Over the years, there have been documented cases of biased models related to facial recognition, health care, policing, etc. On a lighter note, there was even a case where an AI-powered camera repeatedly tracked the referee’s bald head instead of the soccer ball.

Checking for bias should always be part of your early development process because many types of bias can creep into your AI system: measurement bias (variables are inaccurately classified or measured), pre-processing bias (when an operation such as missing value treatment, data cleansing, outlier treatment, encoding, scaling or data transformations for unstructured data causes or contributes to systematic disadvantage), exclusion bias (systematically excluding certain groups), and availability bias (overreliance on information easily accessed).

While this is not an exhaustive list, asking these questions can get you on the right path of checking for and eliminating bias.

Is your data hiding bias in proxy variables?

You may not realize that innocent-looking variables lurking in your data are proxies for sensitive variables. Consider an example of a lending organization making credit decisions. These organizations cannot consider certain sensitive variables (e.g., race, gender, religion) when making credit decisions.

However, certain other variables, like zip code, may seem benign but can inadvertently correlate to one or more sensitive variables, acting as a stand-in or proxy variable. Aggregating values, such as aggregating zip codes into larger geographic areas, may be necessary to avoid using proxy variables.

Have you documented how your data moved and transformed from the source?

AI systems require clean, properly formatted data. Getting the right data in the correct format involves data preparation, which may require one or more of the following pre-processing steps:

  • Normalization: The process of transforming features in a data set to a common scale.
  • Dealing with outliers: Outliers are the data points that fall outside the expected data range. Transforming or removing them is a way to pre-process them.
  • Imputation: The specific technique of filling in missing data points within a data set.
  • Aggregation: The process of gathering and expressing raw data in a summary form for statistical analysis.
  • Data augmentation: A process of artificially increasing the amount of data by generating new data points from existing data.

Don’t stop after you create your data set. Document the process. Documenting how your data transformed from source to AI system input is crucial for the transparency of the process. Understanding and documenting the original data sources and how the data transformed enables others to understand and even recreate the process. You should also document assumptions, rationale, constraints, and any legal or regulatory approval you received for using the data.

Have you checked if your data represents the population the system is being designed for?

Having AI systems trained on representative data is crucial for building fair and effective AI systems. Just like you need a solid foundation to construct your house, representative data forms the bedrock for an AI system.

When we use the correct training data that accurately reflects the characteristics of the population the AI system will be deployed on, it helps reduce bias, improve generalizability and foster fairness. It will be worthwhile to validate whether the data quality issues are not the source of underrepresentation.

Building a solid data foundation to pave the way for trustworthy AI

As we go through the data management phase of the AI life cycle, care must be taken to use the right data at the right time and in the right way. We must be vigilant to provide transparency, root out bias and protect the privacy of individuals, especially those of vulnerable populations.

With the intense scrutiny given to AI systems and their results, a clear plan for managing all aspects of your data is essential.

Want more? Read our comprehensive approach to trustworthy AI governance

Vrushali Sawant contributed to this article



Source link

Tags: datadevelopmentManagingQuestions
Previous Post

Sukanya Samriddhi Yojana: Here’s how you can get Rs 70 lakh for your daughter when she turns 21

Next Post

A new resource for representative dermatology images – Google Research Blog

Related Posts

How insurance companies can use synthetic data to fight bias
AI Technology

How insurance companies can use synthetic data to fight bias

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset
AI Technology

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
How Game Theory Can Make AI More Reliable
AI Technology

How Game Theory Can Make AI More Reliable

June 9, 2024
Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper
AI Technology

Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper

June 9, 2024
Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs
AI Technology

Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs

June 9, 2024
Deciphering Doubt: Navigating Uncertainty in LLM Responses
AI Technology

Deciphering Doubt: Navigating Uncertainty in LLM Responses

June 9, 2024
Next Post
A new resource for representative dermatology images – Google Research Blog

A new resource for representative dermatology images – Google Research Blog

Have Questions About Coding Bootcamps? Come to Our Free Webinars!

Have Questions About Coding Bootcamps? Come to Our Free Webinars!

Unlock the potential of generative AI in industrial operations

Unlock the potential of generative AI in industrial operations

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

November 20, 2023
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In