Sunday, June 1, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

This AI Paper from Databricks and MIT Propose Perplexity-Based Data Pruning: Improving 3B Parameter Model Performance and Enhancing Language Models

June 5, 2024
in AI Technology
Reading Time: 6 mins read
0 0
A A
0
Share on FacebookShare on Twitter


In machine learning, the focus is often on enhancing the performance of large language models (LLMs) while reducing the associated training costs. This endeavor frequently involves improving the quality of pretraining data, as the data’s quality directly impacts the efficiency and effectiveness of the training process. One prominent method to achieve this is data pruning, which involves selecting high-quality subsets from larger datasets to train the models more effectively. This process ensures that the models are kept from noisy and irrelevant data, streamlining the training process and improving overall model performance.

A challenge in training LLMs is the presence of vast and often noisy datasets. Poor-quality data can significantly degrade the performance of these models, making it crucial to develop methods to filter out low-quality data. The goal is to retain only the most relevant and high-quality information. Effective data pruning is essential to optimize the training of these models, ensuring that only the best data is used and enhancing the model’s accuracy and efficiency.

Traditional data pruning methods include simple rules-based filtering and basic classifiers to identify high-quality samples. While useful, these methods are often limited in handling large-scale and diverse datasets. Advanced techniques have emerged, utilizing neural network-based heuristics to assess data quality based on various metrics such as feature similarity or sample difficulty. Despite their advantages, these methods can be computationally expensive and may not perform consistently across different data domains, necessitating the development of more efficient and universally applicable techniques.

Researchers from Databricks, MIT, and DatologyAI have introduced an innovative approach to data pruning using small reference models to compute the perplexity of text samples. This approach begins with training a small model on a random subset of the data, which then evaluates the perplexity of each sample. Perplexity, in this context, measures how well a probability model predicts a sample. Lower perplexity scores indicate higher-quality data. By focusing on samples with the lowest perplexity scores, researchers can prune the dataset to retain only the most relevant data, thus improving the performance of the larger models trained on this pruned data.

The proposed method involves splitting the dataset into training and validation sets for the small reference model. This model is trained on the standard next-token prediction objective, computing perplexity scores for each sample in the dataset. The dataset is then pruned based on these scores, selecting samples within a specific range of perplexities. For example, samples with the lowest perplexity are chosen using a low selection criterion. This pruned dataset is subsequently used to train the final, larger model, which benefits from the high-quality data. The effectiveness of this method is demonstrated across different dataset compositions, including the Pile, which is composed of diverse curated domains, and Dolma, a dataset derived mainly from web scrapes.

Perplexity-based data pruning significantly improves the performance of LLMs on downstream tasks. For instance, pruning based on perplexity scores computed with a 125 million parameter model improved the average performance on downstream functions of a 3 billion parameter model by up to 2.04%. Moreover, it achieved up to a 1.45 times reduction in pretraining steps required to reach comparable baseline performance. The method also proved effective in various scenarios, including over-trained and data-constrained regimes. In over-training scenarios, the absolute gain in average downstream normalized accuracy was similar for both compute optimal and over-trained models, demonstrating the method’s robustness.

This research underscores the utility of small reference models in perplexity-based data pruning, offering a significant step forward in optimizing LLM training. Researchers can improve model performance and training efficiency by leveraging smaller models to filter out low-quality data. This method presents a promising tool for data researchers, which showed a 1.89 improvement in downstream performance for the Pile and 1.51 for Dolma when training for a compute optimal duration. It enhances the performance of large-scale language models and reduces the computational resources required, making it a valuable addition to the modern data researcher’s toolkit.

In conclusion, the study presents a novel and effective method for data pruning using small reference models to compute perplexity. This approach improves the performance & efficiency of large language models by ensuring high-quality pretraining data. The method’s robustness across different data compositions and training regimes highlights its potential as a primary technique for modern data research. By optimizing data quality, researchers can achieve better model performance with fewer resources, making perplexity-based data pruning a valuable technique for future advancements in machine learning.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 43k+ ML SubReddit | Also, check out our AI Events Platform

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.

🐝 Join the Fastest Growing AI Research Newsletter Read by Researchers from Google + NVIDIA + Meta + Stanford + MIT + Microsoft and many others…



Source link

Tags: dataDatabricksEnhancingImprovinglanguageMITmodelmodelsPaperParameterPerformancePerplexityBasedProposePruning
Previous Post

BlackRock, Citadel-backed group to start new national stock exchange in Texas, WSJ reports By Reuters

Next Post

How to Add the ‘Add to Cart’ Shopify Button on Pinterest

Related Posts

How insurance companies can use synthetic data to fight bias
AI Technology

How insurance companies can use synthetic data to fight bias

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset
AI Technology

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper
AI Technology

Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper

June 9, 2024
How Game Theory Can Make AI More Reliable
AI Technology

How Game Theory Can Make AI More Reliable

June 9, 2024
Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs
AI Technology

Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs

June 9, 2024
Deciphering Doubt: Navigating Uncertainty in LLM Responses
AI Technology

Deciphering Doubt: Navigating Uncertainty in LLM Responses

June 9, 2024
Next Post
How to Add the ‘Add to Cart’ Shopify Button on Pinterest

How to Add the ‘Add to Cart’ Shopify Button on Pinterest

FTX and IRS Agree on $200M Settlement for Tax Liabilities

FTX and IRS Agree on $200M Settlement for Tax Liabilities

Insider Review of DataCamp’s AI-Powered DataLab Tool

Insider Review of DataCamp’s AI-Powered DataLab Tool

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
Accenture creates a regulatory document authoring solution using AWS generative AI services

Accenture creates a regulatory document authoring solution using AWS generative AI services

February 6, 2024
Managing PDFs in Node.js with pdf-lib

Managing PDFs in Node.js with pdf-lib

November 16, 2023
Best headless UI libraries in React Native

Best headless UI libraries in React Native

September 28, 2023
NousResearch Released Nous-Hermes-2-Mixtral-8x7B: An Open-Source LLM with SFT and DPO Versions

NousResearch Released Nous-Hermes-2-Mixtral-8x7B: An Open-Source LLM with SFT and DPO Versions

January 25, 2024
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In