A variety of Large Language Models (LLMs) have demonstrated their capabilities in recent times. With the constantly advancing fields of Artificial Intelligence (AI), Natural Language Processing (NLP), and Natural Language Generation (NLG), these models have evolved and have stepped into almost every industry. In the growing field of AI, it has become essential to have text, image, and sound integration to create complex models that can handle and analyze a variety of input sources.
In response to this, Fireworks.ai has released FireLLaVA, the first open-source multi-modality model under the Llama 2 Community Licence that is commercially permissive. The team has shared that Vision-Language Models (VLMs) will be much more versatile with FireLLaVA’s technique for comprehending both text prompts and visual content.
Vision-Language Models (VLMs) have been shown to be extremely useful in a variety of applications, including the creation of chatbots that can comprehend graphical data and the creation of marketing descriptions based on product photos. The well-known Visual Language Model (VLM), LLaVA, is notable for its remarkable performance on 11 benchmarks. However, because of its non-commercial licensing, the open-source version, LLaVA v1.5 13B, has restrictions on its commercial use.
This restriction has been addressed by FireLLaVA, which is available for free download, experimentation, and project integration under a commercially permissive license. Working further on the LLaVA’s potential, FireLLaVA uses a generic architecture and training methodology to enable the language model to understand and respond to textual and visual inputs with equal efficiency.
FireLLaVA has been developed with the idea of working with a wide range of real-world applications, such as answering questions based on photos and deciphering intricate data sources, which improves the precision and breadth of AI-driven insights.
The training data is a major obstacle in developing models that can be used commercially. Despite being open-source, the original LLaVA model had limitations because it was licensed under non-commercial terms and was trained using data provided by the GPT-4. In FireLLaVA, the team has adopted a unique strategy of generating and training data using solely Open-Source Software (OSS) models.
To balance the quality and efficiency of the model, the team has used the language-only OSS CodeLlama 34B Instruct model to replicate the training data. Upon evaluation, the team has shared that the resultant FireLLaVA model performed comparably to the original LLaVA model on a number of benchmarks. FireLLaVA performed better than the original model on four of the seven benchmarks, demonstrating the effectiveness of bootstrapping a Language-Only Model for the creation of high-quality VLM model training data.
The team has shared that FireLLaVA allows developers to easily incorporate vision-capable features into their apps using its completions and chat completions APIs, as the API interface is compatible with OpenAI Vision models. The team has shared some demo examples of using the model on the project’s website. In one example, an image of a train traveling across a bridge was provided to the model with the prompt of describing the scene in the image, which the model perfectly explained and provided an accurate description of the image and the scene.
The release of FireLLaVA is a noteworthy advancement in multi-modal Artificial Intelligence. FireLLaVA’s performance on benchmarks indicates a bright future for the creation of flexible, profitable vision-language models.
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.