AI development is shifting from static, task-centric models to dynamic, adaptable agent-based systems suitable for various applications. AI systems aim to gather sensory data and effectively engage with environments, a longstanding research goal. Developing generalist AI offers advantages, including training a single neural model across multiple tasks and data types. This approach is highly scalable through data, computational resources, and model parameters.
Recent works highlight the advantages of developing generalist AI systems by training a single neural model across various tasks and data types, offering scalability through data, compute, and model parameters. However, challenges persist, as large foundation models often produce hallucinations and infer incorrect information due to insufficient grounding in training environments. Current multimodal system approaches, relying on frozen pre-trained models for each modality, may perpetuate errors without cross-modal pre-training.
Researchers from Stanford University, Microsoft Research, Redmond, and the University of California, Los Angeles, have proposed the Interactive Agent Foundation Model, which introduces a unified pre-training framework for processing text, visual data, and actions, treating each as separate tokens. It utilizes pre-trained language and visual-language models to predict masked tokens across all modalities. It enables interaction with humans and environments, incorporating visual-language understanding. With 277M parameters jointly pre-trained across diverse domains, it engages effectively in multi-modal settings across various virtual environments.
Evaluation across robotics, gaming, and healthcare tasks demonstrates promising results. Despite being outperformed in certain tasks by other models due to less data for pre-training, the method showcases competitive performance, especially in robotics, where it significantly surpasses a comparative model. Fine-tuning the pre-trained model proves notably effective in gaming tasks compared to training from scratch. In healthcare applications, the method outperforms several baselines leveraging CLIP and OPT for initialization, demonstrating the efficacy of its diverse pre-training approach.
In conclusion, Researchers proposed the Interactive Agent Foundation Model, which is adept at processing text, action, and visual inputs and demonstrates effectiveness across diverse domains. Pre-training on a mixture of robotics and gaming data enables the model to proficiently model actions, even exhibiting positive transfer to healthcare tasks during fine-tuning. Its broad applicability across decision-making contexts suggests potential for generalist agents in multimodal systems, unlocking new opportunities for AI advancement.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 37k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter.
Don’t Forget to join our Telegram Channel
Asjad is an intern consultant at Marktechpost. He is pursuing B.Tech in mechanical engineering at the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who is always researching the applications of machine learning in healthcare.