Friday, May 16, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

Self-Play Preference Optimization (SPPO): An Innovative Machine Learning Approach to Finetuning Large Language Models (LLMs) from Human/AI Feedback

May 7, 2024
in Data Science & ML
Reading Time: 4 mins read
0 0
A A
0
Share on FacebookShare on Twitter


Large Language Models (LLMs) have demonstrated remarkable abilities in generating human-like text, answering questions, and coding. However, they face hurdles requiring high reliability, safety, and ethical adherence. Reinforcement Learning from Human Feedback (RLHF), or Preference-based Reinforcement Learning (PbRL), emerges as a promising solution. This framework has shown significant success in fine-tuning LLMs to align with human preferences, enhancing their usefulness.

Existing RLHF approaches, like InstructGPT, rely on explicit or implicit reward models, e.g., the Bradley-Terry model. Recent research explores direct preference probabilities to better represent human preferences. Some researchers formulate RLHF as finding Nash equilibriums in constant-sum games, proposing mirror descent and Self-play Preference Optimization (SPO) methods. Direct Nash Optimization (DNO) was also introduced based on win rate gaps, yet its practical implementation still relies on iterative DPO frameworks.

Researchers from the University of California, Los Angeles and Carnegie Mellon University introduce a robust self-play framework, Self-Play Preference Optimization (SPPO), for language model alignment addressing RLHF challenges. It offers provable guarantees for solving two-player constant-sum games and scalability for large language models. In formulating RLHF as such a game, the objective is to identify the Nash equilibrium policy, ensuring consistently preferred responses. They propose an adaptive algorithm based on multiplicative weights, employing a self-play mechanism where the policy fine-tunes itself on synthetic data annotated by the preference model.

The self-play framework aims to solve two-player constant-sum games efficiently and at scale for large language models. It adopts an iterative framework based on multiplicative weight updates and a self-play mechanism. The algorithm asymptotically converges to the optimal policy, identifying the Nash equilibrium. Theoretical analysis ensures convergence, providing provable guarantees. Compared to existing methods like DPO and IPO, SPPO demonstrates improved convergence and addresses data sparsity issues efficiently.

The researchers evaluate models using GPT-4 for automatic evaluation, presenting results on AlpacaEval 2.0 and MT-Bench. SPPO models consistently improve across iterations, with SPPO Iter3 showing the highest win rate. Compared to DPO and IPO, SPPO achieves superior performance and effectively controls output length. Test-time reranking with the PairRM reward model consistently improves model performance without over-optimization. SPPO outperforms many state-of-the-art chatbots on AlpacaEval 2.0 and remains competitive with GPT-4 on MT-Bench.

To conclude, the paper introduces Self-Play Preference Optimization (SPPO), a robust method for fine-tuning LLMs using Human/AI Feedback. By employing self-play in a two-player game and a preference-based learning objective, SPPO significantly improves over existing methods like DPO and IPO across various benchmarks. Integrating a preference model and batched estimation, SPPO aligns LLMs closely with human preferences, addressing issues like “length bias” reward hacking. These findings suggest SPPO’s potential for enhancing generative AI system alignment, advocating for its broader adoption in LLMs and beyond.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter.

Don’t Forget to join our 41k+ ML SubReddit

Asjad is an intern consultant at Marktechpost. He is persuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.

✅ [FREE AI WEBINAR Alert] Live RAG Comparison Test: Pinecone vs Mongo vs Postgres vs SingleStore: May 9, 2024 10:00am – 11:00am PDT



Source link

Tags: ApproachFeedbackFineTuningHumanAIInnovativelanguageLargeLearningLLMsMachinemodelsoptimizationPreferenceSelfPlaySPPO
Previous Post

Alto Ingredients, Inc. (ALTO) Q1 2024 Earnings Call Transcript

Next Post

Leader Spotlight: Evoking the right feelings in-store and online, with Payton White

Related Posts

AI Compared: Which Assistant Is the Best?
Data Science & ML

AI Compared: Which Assistant Is the Best?

June 10, 2024
5 Machine Learning Models Explained in 5 Minutes
Data Science & ML

5 Machine Learning Models Explained in 5 Minutes

June 7, 2024
Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’
Data Science & ML

Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’

June 7, 2024
How to Learn Data Analytics – Dataquest
Data Science & ML

How to Learn Data Analytics – Dataquest

June 6, 2024
Adobe Terms Of Service Update Privacy Concerns
Data Science & ML

Adobe Terms Of Service Update Privacy Concerns

June 6, 2024
Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart
Data Science & ML

Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

June 6, 2024
Next Post
Leader Spotlight: Evoking the right feelings in-store and online, with Payton White

Leader Spotlight: Evoking the right feelings in-store and online, with Payton White

Apple working on AI chips for data centers, WSJ reports By Reuters

Apple working on AI chips for data centers, WSJ reports By Reuters

The NIS2 Compliance Deadline Is Nearing. Are You Prepared?

The NIS2 Compliance Deadline Is Nearing. Are You Prepared?

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
How To Build A Quiz App With JavaScript for Beginners

How To Build A Quiz App With JavaScript for Beginners

February 22, 2024
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In