The challenge of matching human preferences to big pretrained models has gained prominence in the study as these models have grown in performance. This alignment becomes particularly challenging when there are unavoidably poor behaviours in bigger datasets. For this issue, reinforcement learning from human input, or RLHF has become popular. RLHF approaches use human preferences to distinguish between acceptable and bad behaviours to improve a known policy. This approach has demonstrated encouraging outcomes when used to adjust robot rules, enhance image generation models, and fine-tune large language models (LLMs) using less-than-ideal data. There are two stages to this procedure for the majority of RLHF algorithms.
First, user preference data is gathered to train a reward model. An off-the-shelf reinforcement learning (RL) algorithm optimizes that reward model. Regretfully, there needs to be a correction in the foundation of this two-phase paradigm. Human preferences must be allocated by the discounted total of rewards or partial return of each behaviour segment for algorithms to develop reward models from preference data. Recent research, however, challenges this theory, suggesting that human preferences should be based on the regret of each action under the ideal policy of the expert’s reward function. Human evaluation is probably intuitively focused on optimality rather than whether situations and behaviours provide greater rewards.
Therefore, the optimal advantage function, or the negated regret, may be the ideal number to learn from feedback rather than the reward. Two-phase RLHF algorithms use RL in their second phase to optimize the reward function known in the first phase. In real-world applications, temporal credit assignment presents a variety of optimization difficulties for RL algorithms, including the instability of approximation dynamic programming and the high variance of policy gradients. As a result, earlier works restrict their reach to avoid these problems. For example, contextual bandit formulation is assumed by RLHF approaches for LLMs, where the policy is given a single reward value in response to a user question.
The single-step bandit assumption is broken because user interactions with LLMs are multi-step and sequential, even while this lessens the requirement for long-horizon credit assignment and, as a result, the high variation of policy gradients. Another example is the application of RLHF to low-dimensional state-based robotics issues, which works well for approximation dynamic programming. However, it has yet to be scaled to higher-dimensional continuous control domains with picture inputs, which are more realistic. In general, RLHF approaches require reducing the optimisation constraints of RL by making restricted assumptions about the sequential nature of problems or dimensionality. They generally mistakenly believe that the reward function alone determines human preferences.
In contrast to the widely used partial return model, which considers the total rewards, researchers from Stanford University, UMass Amherst and UT Austin provide a novel family of RLHF algorithms in this study that employs a regret-based model of preferences. In contrast to the partial return model, the regret-based approach gives precise information on the best course of action. Fortunately, this removes the necessity for RL, enabling us to tackle RLHF issues with high-dimensional state and action spaces in the generic MDP framework. Their fundamental finding is to create a bijection between advantage functions and policies by combining the regret-based preference framework with the Maximum Entropy (MaxEnt) principle.
They can establish a purely supervised learning objective whose optimum is the best policy under the expert’s reward by trading optimization over advantages for optimization over policies. Because their method resembles widely recognized contrastive learning objectives, they call it Contrastive Preference Learning—three main benefits of CPL over earlier efforts. First, because CPL matches the optimal advantage exclusively using supervised goals—rather than using dynamic programming or policy gradients—it can scale as well as supervised learning. Second, CPL is completely off-policy, making using any offline, less-than-ideal data source possible. Lastly, CPL enables preference searches over sequential data for learning on arbitrary Markov Decision Processes (MDPs).
As far as they know, previous techniques for RLHF have yet to satisfy all three of these requirements simultaneously. They illustrate CPL’s performance on sequential decision-making issues using sub-optimal and high-dimensional off-policy inputs to prove that it adheres to the abovementioned three tenets. Interestingly, they demonstrate that CPL may learn temporally extended manipulation rules in the MetaWorld Benchmark by efficiently utilising the same RLHF fine-tuning process as dialogue models. To be more precise, they use supervised learning from high-dimensional picture observations to pre-train policies, which they then fine-tune using preferences. CPL can match the performance of earlier RL-based techniques without the need for dynamic programming or policy gradients. It is also four times more parameter efficient and 1.6 times quicker simultaneously. On five tasks out of six, CPL outperforms RL baselines when utilizing denser preference data. Researchers can avoid the necessity for reinforcement learning (RL) by employing the concept of maximum entropy to create Contrastive Preference Learning (CPL), an algorithm for learning optimum policies from preferences without learning reward functions.
Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 32k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
We are also on Telegram and WhatsApp.