By additional pre-training using image-text pairings or fine-tuning them with specialized visual instruction tuning datasets, Large Language Models may dive into the multimodal domain, giving rise to potent Large Multimodal Models. However, there are obstacles to building LMMs, chief among them the disparity between the quantity and quality of multimodal data and text-only datasets. Take the LLaVA model, initialized from a pre-trained visual encoder and a language model tweaked for instructions. It is trained on far fewer instances than text-only models, which use over 100M examples over 1800 tasks. It is only trained on 150K artificial image-based conversations. Due to such data restrictions, the visual and language modalities may not be aligned.
As a result, LMMs could generate hallucinatory outputs that are inaccurately tied to the context that pictures give. Researchers from UC Berkeley, CMU, UIUC, UW–Madison, UMass Amherst Microsoft Research, and MIT-IBM Watson AI Lab present LLaVA-RLHF, a vision-language model trained for enhanced multimodal alignment, to address the issues brought on by the absence of high-quality visual instruction tuning data for LMM training. One of their major contributions is adapting the multimodal alignment for LMMs to the universal and scalable alignment paradigm known as Reinforcement Learning from Human Feedback, which has demonstrated remarkable effectiveness for text-based AI agents. To fine-tune LMM, it collects human preferences focusing on recognizing hallucinations and uses those preferences in reinforcement learning.
This strategy may improve the multimodal alignment at a relatively cheap annotation cost, such as $3000 for gathering 10K human preferences for image-based discussions. As far as they know, this strategy is the first effective use of RLHF for multimodal alignment. Gaining high ratings from the reward model only sometimes equates to improving human judgments, which is reward hacking. It is a possible problem with the present RLHF paradigm. Previous research suggested iteratively gathering “fresh” human feedback to stop incentive hacking, but this method is typically expensive and cannot properly use existing human preference data. This study suggests a more data-efficient option, attempting to make the reward model capable of using the knowledge and data already present in bigger language models that humans have annotated.
Figure 1: A diagram illustrating the possibility of hallucinations during the Supervised Fine-Tuning (SFT) phase of LMM training and the way Factually Augmented RLHF addresses the problem of low capacity in the reward model, which is initialized from the SFT model.
First, they use a superior visual encoder with higher resolutions and a bigger language model to enhance the reward model’s overall functionality. Second, they present the Factually Augmented RLHF algorithm, which, as shown in Fig. 1, calibrates the reward signals by supplementing them with extra information like picture descriptions or a ground-truth multi-choice option. They further augment the synthetic vision instruction tuning data with existing high-quality human-annotated multimodal data in the conversation format to enhance the general capabilities of LMMs during the Supervised Fine-Tuning stage. They specifically transform Flickr30k into a Spotting Captioning assignment, VQA-v2, and A-OKVQA into a multi-round QA task, and both train the LLaVA-SFT+ models using the new data set.
Finally, they consider how to evaluate the multimodal alignment of LMMs in situations of real-world creation, paying particular attention to penalizing any hallucinations. The benchmark questions they develop, MMHAL-BENCH, cover all 12 of COCO’s key object categories and comprise eight job kinds. According to their analysis, this benchmark dataset closely matches human assessments, especially if scores are considered for anti-hallucinations. As the first LMM trained with RLHF, LLaVA-RLHF performs admirably in their experimental assessment. They saw an improvement of 94% on the LLaVA-Bench, a 60% improvement on the MMHAL-BENCH, and they set new performance records for LLaVA with 52.4% on MMBench and 82.7% F1 on POPE. On GitHub, they have made their code, model, and data accessible to the public.
Check out the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.