Saturday, May 17, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

Stanford and UT Austin Researchers Propose Contrastive Preference Learning (CPL): A Simple Reinforcement Learning RL-Free Method for RLHF that Works with Arbitrary MDPs and off-Policy Data

October 31, 2023
in AI Technology
Reading Time: 4 mins read
0 0
A A
0
Share on FacebookShare on Twitter


The challenge of matching human preferences to big pretrained models has gained prominence in the study as these models have grown in performance. This alignment becomes particularly challenging when there are unavoidably poor behaviours in bigger datasets. For this issue, reinforcement learning from human input, or RLHF has become popular. RLHF approaches use human preferences to distinguish between acceptable and bad behaviours to improve a known policy. This approach has demonstrated encouraging outcomes when used to adjust robot rules, enhance image generation models, and fine-tune large language models (LLMs) using less-than-ideal data. There are two stages to this procedure for the majority of RLHF algorithms.

First, user preference data is gathered to train a reward model. An off-the-shelf reinforcement learning (RL) algorithm optimizes that reward model. Regretfully, there needs to be a correction in the foundation of this two-phase paradigm. Human preferences must be allocated by the discounted total of rewards or partial return of each behaviour segment for algorithms to develop reward models from preference data. Recent research, however, challenges this theory, suggesting that human preferences should be based on the regret of each action under the ideal policy of the expert’s reward function. Human evaluation is probably intuitively focused on optimality rather than whether situations and behaviours provide greater rewards.

Therefore, the optimal advantage function, or the negated regret, may be the ideal number to learn from feedback rather than the reward. Two-phase RLHF algorithms use RL in their second phase to optimize the reward function known in the first phase. In real-world applications, temporal credit assignment presents a variety of optimization difficulties for RL algorithms, including the instability of approximation dynamic programming and the high variance of policy gradients. As a result, earlier works restrict their reach to avoid these problems. For example, contextual bandit formulation is assumed by RLHF approaches for LLMs, where the policy is given a single reward value in response to a user question.

The single-step bandit assumption is broken because user interactions with LLMs are multi-step and sequential, even while this lessens the requirement for long-horizon credit assignment and, as a result, the high variation of policy gradients. Another example is the application of RLHF to low-dimensional state-based robotics issues, which works well for approximation dynamic programming. However, it has yet to be scaled to higher-dimensional continuous control domains with picture inputs, which are more realistic. In general, RLHF approaches require reducing the optimisation constraints of RL by making restricted assumptions about the sequential nature of problems or dimensionality. They generally mistakenly believe that the reward function alone determines human preferences.

In contrast to the widely used partial return model, which considers the total rewards, researchers from Stanford University, UMass Amherst and UT Austin provide a novel family of RLHF algorithms in this study that employs a regret-based model of preferences. In contrast to the partial return model, the regret-based approach gives precise information on the best course of action. Fortunately, this removes the necessity for RL, enabling us to tackle RLHF issues with high-dimensional state and action spaces in the generic MDP framework. Their fundamental finding is to create a bijection between advantage functions and policies by combining the regret-based preference framework with the Maximum Entropy (MaxEnt) principle.

They can establish a purely supervised learning objective whose optimum is the best policy under the expert’s reward by trading optimization over advantages for optimization over policies. Because their method resembles widely recognized contrastive learning objectives, they call it Contrastive Preference Learning—three main benefits of CPL over earlier efforts. First, because CPL matches the optimal advantage exclusively using supervised goals—rather than using dynamic programming or policy gradients—it can scale as well as supervised learning. Second, CPL is completely off-policy, making using any offline, less-than-ideal data source possible. Lastly, CPL enables preference searches over sequential data for learning on arbitrary Markov Decision Processes (MDPs).

As far as they know, previous techniques for RLHF have yet to satisfy all three of these requirements simultaneously. They illustrate CPL’s performance on sequential decision-making issues using sub-optimal and high-dimensional off-policy inputs to prove that it adheres to the abovementioned three tenets. Interestingly, they demonstrate that CPL may learn temporally extended manipulation rules in the MetaWorld Benchmark by efficiently utilising the same RLHF fine-tuning process as dialogue models. To be more precise, they use supervised learning from high-dimensional picture observations to pre-train policies, which they then fine-tune using preferences. CPL can match the performance of earlier RL-based techniques without the need for dynamic programming or policy gradients. It is also four times more parameter efficient and 1.6 times quicker simultaneously. On five tasks out of six, CPL outperforms RL baselines when utilizing denser preference data. Researchers can avoid the necessity for reinforcement learning (RL) by employing the concept of maximum entropy to create Contrastive Preference Learning (CPL), an algorithm for learning optimum policies from preferences without learning reward functions.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 32k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, you will love our newsletter..

We are also on Telegram and WhatsApp.



Source link

Tags: ArbitraryAustinContrastiveCPLdataLearningMDPsMethodoffPolicyPreferenceProposereinforcementResearchersRLFreeRLHFSimpleStanfordworks
Previous Post

Programmers That Don’t Blog Should Start Right Now

Next Post

AAAI Fall Symposium: Patrícia Alves-Oliveira on human-robot interaction design

Related Posts

How insurance companies can use synthetic data to fight bias
AI Technology

How insurance companies can use synthetic data to fight bias

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset
AI Technology

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
How Game Theory Can Make AI More Reliable
AI Technology

How Game Theory Can Make AI More Reliable

June 9, 2024
Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper
AI Technology

Decoding Decoder-Only Transformers: Insights from Google DeepMind’s Paper

June 9, 2024
Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs
AI Technology

Buffer of Thoughts (BoT): A Novel Thought-Augmented Reasoning AI Approach for Enhancing Accuracy, Efficiency, and Robustness of LLMs

June 9, 2024
Deciphering Doubt: Navigating Uncertainty in LLM Responses
AI Technology

Deciphering Doubt: Navigating Uncertainty in LLM Responses

June 9, 2024
Next Post
AAAI Fall Symposium: Patrícia Alves-Oliveira on human-robot interaction design

AAAI Fall Symposium: Patrícia Alves-Oliveira on human-robot interaction design

Midjourney v4 vs v5 Key Differences

Midjourney v4 vs v5 Key Differences

How Convert achieved 500 qualified product trials with a free online tool

How Convert achieved 500 qualified product trials with a free online tool

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
Porfo: Revolutionizing the Crypto Wallet Landscape

Porfo: Revolutionizing the Crypto Wallet Landscape

October 9, 2023
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

A Complete Guide to BERT with Code | by Bradney Smith | May, 2024

May 19, 2024
Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

Part 1: ABAP RESTful Application Programming Model (RAP) – Introduction

November 20, 2023
Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

Saginaw HMI Enclosures and Suspension Arm Systems from AutomationDirect – Library.Automationdirect.com

December 6, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In