Saturday, May 17, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

The importance of data ingestion and integration for enterprise AI

January 9, 2024
in Blockchain
Reading Time: 4 mins read
0 0
A A
0
Share on FacebookShare on Twitter


The emergence of generative AI prompted several prominent companies to restrict its use because of the mishandling of sensitive internal data. According to CNN, some companies imposed internal bans on generative AI tools while they seek to better understand the technology and many have also blocked the use of internal ChatGPT.

Companies still often accept the risk of using internal data when exploring large language models (LLMs) because this contextual data is what enables LLMs to change from general-purpose to domain-specific knowledge. In the generative AI or traditional AI development cycle, data ingestion serves as the entry point. Here, raw data that is tailored to a company’s requirements can be gathered, preprocessed, masked and transformed into a format suitable for LLMs or other models. Currently, no standardized process exists for overcoming data ingestion’s challenges, but the model’s accuracy depends on it.

4 risks of poorly ingested data

Misinformation generation: When an LLM is trained on contaminated data (data that contains errors or inaccuracies), it can generate incorrect answers, leading to flawed decision-making and potential cascading issues.

Increased variance: Variance measures consistency. Insufficient data can lead to varying answers over time, or misleading outliers, particularly impacting smaller data sets. High variance in a model may indicate the model works with training data but be inadequate for real-world industry use cases.

Limited data scope and non-representative answers: When data sources are restrictive, homogeneous or contain mistaken duplicates, statistical errors like sampling bias can skew all results. This may cause the model to exclude entire areas, departments, demographics, industries or sources from the conversation.

Challenges in rectifying biased data: If the data is biased from the beginning, “the only way to retroactively remove a portion of that data is by retraining the algorithm from scratch.” It is difficult for LLM models to unlearn answers that are derived from unrepresentative or contaminated data when it’s been vectorized. These models tend to reinforce their understanding based on previously assimilated answers.

Data ingestion must be done properly from the start, as mishandling it can lead to a host of new issues. The groundwork of training data in an AI model is comparable to piloting an airplane. If the takeoff angle is a single degree off, you might land on an entirely new continent than expected.

The entire generative AI pipeline hinges on the data pipelines that empower it, making it imperative to take the correct precautions.

4 key components to ensure reliable data ingestion

Data quality and governance: Data quality means ensuring the security of data sources, maintaining holistic data and providing clear metadata. This may also entail working with new data through methods like web scraping or uploading. Data governance is an ongoing process in the data lifecycle to help ensure compliance with laws and company best practices.

Data integration: These tools enable companies to combine disparate data sources into one secure location. A popular method is extract, load, transform (ELT). In an ELT system, data sets are selected from siloed warehouses, transformed and then loaded into source or target data pools. ELT tools such as IBM® DataStage® facilitate fast and secure transformations through parallel processing engines. In 2023, the average enterprise receives hundreds of disparate data streams, making efficient and accurate data transformations crucial for traditional and new AI model development.

Data cleaning and preprocessing: This includes formatting data to meet specific LLM training requirements, orchestration tools or data types. Text data can be chunked or tokenized while imaging data can be stored as embeddings. Comprehensive transformations can be carried out using data integration tools. Also, there may be a need to directly manipulate raw data by deleting duplicates or changing data types.

Data storage: After data is cleaned and processed, the challenge of data storage arises. Most data is hosted either on cloud or on-premises, requiring companies to make decisions about where to store their data. It’s important to caution using external LLMs for handling sensitive information such as personal data, internal documents or customer data. However, LLMs play a critical role in fine-tuning or implementing a retrieval-augmented generation (RAG) based- approach. To mitigate risks, it’s important to run as many data integration processes as possible on internal servers. One potential solution is to use remote runtime options like.

Start your data ingestion with IBM

IBM DataStage streamlines data integration by combining various tools, allowing you to effortlessly pull, organize, transform and store data that is needed for AI training models in a hybrid cloud environment. Data practitioners of all skill levels can engage with the tool by using no-code GUIs or access APIs with guided custom code.

The new DataStage as a Service Anywhere remote runtime option provides flexibility to run your data transformations. It empowers you to use the parallel engine from anywhere, giving you unprecedented control over its location. DataStage as a Service Anywhere manifests as a lightweight container, allowing you to run all data transformation capabilities in any environment. This allows you to avoid many of the pitfalls of poor data ingestion as you run data integration, cleaning and preprocessing within your virtual private cloud. With DataStage, you maintain complete control over security, data quality and efficacy, addressing all your data needs for generative AI initiatives.

While there are virtually no limits to what can be achieved with generative AI, there are limits on the data a model uses—and that data may as well make all the difference.

Book a meeting to learn more

Try DataStage with the data integration trial

Product Manager, Innovations Lead



Source link

Tags: dataEnterpriseImportanceingestionintegration
Previous Post

Researchers developing AI to make the internet more accessible

Next Post

Can Large Language Models Retain Old Skills While Learning New Ones? This Paper Introduces LLaMA Pro-8.3B: A New Frontier in AI Adaptability

Related Posts

5 SLA metrics you should be monitoring
Blockchain

5 SLA metrics you should be monitoring

June 10, 2024
10BedICU Leverages OpenAI’s API to Revolutionize Critical Care in India
Blockchain

10BedICU Leverages OpenAI’s API to Revolutionize Critical Care in India

June 9, 2024
Arkham: US Government Seizes $300M from Alameda Research Accounts
Blockchain

Arkham: US Government Seizes $300M from Alameda Research Accounts

June 8, 2024
Fake Musk Live Streams Flood YouTube During SpaceX Launch
Blockchain

Fake Musk Live Streams Flood YouTube During SpaceX Launch

June 7, 2024
How to Track Crypto Transactions for Taxes?
Blockchain

How to Track Crypto Transactions for Taxes?

June 7, 2024
NVIDIA Enhances Low-Resolution SDR Video with RTX Video SDK Release
Blockchain

NVIDIA Enhances Low-Resolution SDR Video with RTX Video SDK Release

June 7, 2024
Next Post
Can Large Language Models Retain Old Skills While Learning New Ones? This Paper Introduces LLaMA Pro-8.3B: A New Frontier in AI Adaptability

Can Large Language Models Retain Old Skills While Learning New Ones? This Paper Introduces LLaMA Pro-8.3B: A New Frontier in AI Adaptability

Can Large Language Models Learn New Tricks? This Machine Learning Research from Google Introduces ‘CALM’: A Novel Approach for Enhancing AI Capabilities Through Composition

Can Large Language Models Learn New Tricks? This Machine Learning Research from Google Introduces 'CALM': A Novel Approach for Enhancing AI Capabilities Through Composition

China and cybercriminals are targeting American AI companies

China and cybercriminals are targeting American AI companies

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

November 20, 2023
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In