Researchers from NVIDIA and the University of Maryland Propose ODIN: A Reward Disentangling Technique that Mitigates Hacking in Reinforcement Learning from Human Feedback (RLHF)

The well-known Artificial Intelligence (AI)-based chatbot, ChatGPT, has been developed using GPT’s transformer architecture and utilizes Reinforcement Learning from Human Feedback (RLHF) technique. This method is crucial for leveraging pre-trained Large Language Models (LLMs) to generate more accurate and helpful responses aligning with human preferences.

RLHF involves training a language model to produce responses that maximize the learned reward through reinforcement learning. A reward model is then trained based on human preferences for specific prompts. This approach simplifies the data collection process as gathering human ratings is typically easier than gathering demos for supervised fine-tuning.

However, a challenge with RLHF is reward hacking, where the policy receives a high reward without meeting the actual objectives. This occurs due to limited Out-Of-Distribution (OOD) generalization of the reward model and potential imperfections in representing human preferences. The language model can exploit flaws in the reward model by providing OOD examples.

Human preference data, often skewed and inconsistent, adds complexity to the scenario due to task subjectivity, defects in rating standards, and low rater quality. Verbosity is a common example of reward hacking, where models generate more tokens to appear thorough or well-formatted, without a real improvement in quality.

To tackle these issues, recent research from NVIDIA and the University of Maryland focuses on mitigating reward hacking by examining RL algorithms and incentive models’ impact on verbosity and performance. An evaluation technique has been presented to compare different training setups and address biases in model-based evaluations. This technique provides insights into various response durations by evaluating performance on the Pareto front of evaluation score vs. length.

By analyzing the trade-off between the LLM’s assessment score and response duration, different training settings can be systematically compared. Variations in training hyperparameters can determine how these changes affect the verbosity to answer quality ratio.

The study explores RL hyperparameters and techniques like reward clipping and length penalty to reduce reward hacking related to response length. The goal is to eliminate the misleading length signal from the reward, even though different tuning methods can yield better results. The team proposes a two-head reward model that separates length representations from true preferences, with the length head removed during RL.

The suggested reward disentangling technique, ODIN, has been effective in expanding the policy’s Pareto front compared to previous results, even with a higher tuning budget. This technique benefits other RL-tuning methods like Proximal Policy Optimisation (PPO) and ReMax, indicating its potential to enhance performance and reduce length hacking.

Overall, experimental results demonstrate a significant decrease in the reward model’s association with response duration using this method. By prioritizing information quality over verbosity, the strategy successfully addresses response length-related reward hacking, enhancing the reliability and utility of LLMs trained with the RLHF paradigm.

Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

If you like our work, you will love our newsletter.

Don’t Forget to join our Telegram Channel

Source link