Saturday, May 17, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

Researchers from the University of Toronto Unveil a Surprising Redundancy in Large Materials Datasets and the Power of Informative Data for Enhanced Machine Learning Performance

November 20, 2023
in Data Science & ML
Reading Time: 4 mins read
0 0
A A
0
Share on FacebookShare on Twitter


With the advent of AI, its use is being felt in all spheres of our lives. AI is finding its application in all walks of life. But AI needs data for the training. AI’s effectiveness relies heavily on data availability for training purposes.

Conventionally, achieving accuracy in training AI models has been linked to the availability of substantial amounts of data. Addressing this challenge in this field involves navigating an extensive potential search space. For example, The Open Catalyst Project, uses more than 200 million data points related to potential catalyst materials.

The computation resources required for analysis and model development on such datasets are a big problem. Open Catalyst datasets used 16,000 GPU days for analyzing and developing models. Such training budgets are only available to some researchers, often limiting model development to smaller datasets or a portion of the available data. Consequently, model development is frequently restricted to smaller datasets or a fraction of the available data.

A study by University of Toronto Engineering researchers, published in Nature Communications, suggests that the belief that deep learning models require a lot of training data may not be always true.

The researchers said that we need to find a way to identify smaller datasets that can be used to train models on. Dr. Kangming Li, a postdoctoral scholar at Hattrick-Simpers, used an example of a model that forecasts students’ final scores and emphasized that it performs best on the dataset of Canadian students on which it is trained, but it might not be able to predict grades for students from other countries.

One possible solution is finding subsets of data inside incredibly huge datasets to address the issues raised. These subsets should contain all the diversity and information in the original dataset but be easier to handle during processing.

Li developed methods for locating high-quality subsets of information from materials datasets that have already been made public, such as JARVIS, The Materials Project, and Open Quantum Materials. The goal was to gain more insight into how dataset properties affect the models they train.

To create his computer program, he used the original dataset and a much smaller subset with 95% fewer data points. The model trained on 5% of the data performed comparably to the model trained on the entire dataset when predicting the properties of materials within the dataset’s domain. According to this, machine learning training can safely exclude up to 95% of the data with little to no effect on the accuracy of in-distribution predictions. The overrepresented material is the main subject of the redundant data.

According to Li, the study’s conclusions provide a way to gauge how redundant a dataset is. If adding more data doesn’t improve model performance, it is redundant and doesn’t provide the models with any new information to learn.

The study supports a growing body of knowledge among experts in AI across multiple domains: models trained on relatively small datasets can perform well, provided the data quality is high.

In conclusion, the significance of information richness is stressed more than the volume of data alone. The quality of the information should be prioritized over gathering enormous volumes of data.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

Rachit Ranjan is a consulting intern at MarktechPost . He is currently pursuing his B.Tech from Indian Institute of Technology(IIT) Patna . He is actively shaping his career in the field of Artificial Intelligence and Data Science and is passionate and dedicated for exploring these fields.

🔥 Join The AI Startup Newsletter To Learn About Latest AI Startups



Source link

Tags: dataDatasetsEnhancedInformativeLargeLearningMachineMaterialsPerformancepowerRedundancyResearcherssurprisingTorontoUniversityUnveil
Previous Post

Why Web Developers Are Losing Their Jobs: The Dark Side of Tech & Mass Layoffs

Next Post

A New AI Research Releases SWIM-IR: A Large-Scale Synthetic Multilingual Retrieval Dataset with 28 Million Training Pairs over 33 Languages

Related Posts

AI Compared: Which Assistant Is the Best?
Data Science & ML

AI Compared: Which Assistant Is the Best?

June 10, 2024
5 Machine Learning Models Explained in 5 Minutes
Data Science & ML

5 Machine Learning Models Explained in 5 Minutes

June 7, 2024
Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’
Data Science & ML

Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’

June 7, 2024
How to Learn Data Analytics – Dataquest
Data Science & ML

How to Learn Data Analytics – Dataquest

June 6, 2024
Adobe Terms Of Service Update Privacy Concerns
Data Science & ML

Adobe Terms Of Service Update Privacy Concerns

June 6, 2024
Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart
Data Science & ML

Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

June 6, 2024
Next Post
A New AI Research Releases SWIM-IR: A Large-Scale Synthetic Multilingual Retrieval Dataset with 28 Million Training Pairs over 33 Languages

A New AI Research Releases SWIM-IR: A Large-Scale Synthetic Multilingual Retrieval Dataset with 28 Million Training Pairs over 33 Languages

Emmett Shear Appointed Interim CEO of OpenAI as Sam Altman Steps Down

Emmett Shear Appointed Interim CEO of OpenAI as Sam Altman Steps Down

How to Build A Decentralized Web3 Ecosystem?

How to Build A Decentralized Web3 Ecosystem?

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

November 20, 2023
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In