Artificial intelligence has seen remarkable advancements with the development of large language models (LLMs). Thanks to techniques like reinforcement learning from human feedback (RLHF), they have significantly improved performing various tasks. However, the challenge lies in synthesizing novel content solely based on human feedback.
One of the core challenges in advancing LLMs is optimizing their learning process from human feedback. This feedback is obtained through a process where models are presented with prompts and generate responses, with human raters indicating their preferences. The goal is to refine the models’ responses to align more closely with human preferences. However, this method requires many interactions, posing a bottleneck for rapid model improvement.
Current methodologies for training LLMs involve passive exploration, where models generate responses based on predefined prompts without actively seeking to optimize the learning from feedback. One such approach is to use Thompson sampling, where queries are generated based on uncertainty estimates represented by an epistemic neural network (ENN). The choice of exploration scheme is critical, and double Thompson sampling has shown effective in generating high-performing queries. Others include Boltzmann Exploration and Infomax. While these methods have been instrumental in the initial stages of LLM development, they must be optimized for efficiency, often requiring an impractical number of human interactions to achieve notable improvements.
Researchers at Google Deepmind and Stanford University have introduced a novel approach to active exploration, utilizing double Thompson sampling and ENN for query generation. This method allows the model to actively seek out feedback that is most informative for its learning, significantly reducing the number of queries needed to achieve high-performance levels. The ENN provides uncertainty estimates that guide the exploration process, enabling the model to make more informed decisions on which queries to present for feedback.
In the experimental setup, agents generate responses to 32 prompts, forming queries evaluated by a preference simulator. The feedback is used to refine their reward models at the end of each epoch. Agents explore the response space by selecting the most informative pairs from a pool of 100 candidates, utilizing a multi-layer perceptron (MLP) architecture with two hidden layers of 128 units each or an ensemble of 10 MLPs for epistemic neural networks (ENN).
The results highlight the effectiveness of double Thompson sampling (TS) over other exploration methods like Boltzmann exploration and infomax, especially in utilizing uncertainty estimates for improved query selection. While Boltzmann’s exploration shows promise at lower temperatures, double TS consistently outperforms others by making better use of uncertainty estimates from the ENN reward model. This approach accelerates the learning process and demonstrates the potential for efficient exploration to dramatically reduce the volume of human feedback required, marking a significant advance in training large language models.
In conclusion, this research showcases the potential for efficient exploration to overcome the limitations of traditional training methods. The team has opened new avenues for rapid and effective model enhancement by leveraging advanced exploration algorithms and uncertainty estimates. This approach promises to accelerate innovation in LLMs and highlights the importance of optimizing the learning process for the broader advancement of artificial intelligence.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our Telegram Channel