
The field of Generative Artificial Intelligence is rapidly evolving with the introduction of large multimodal models (LMMs). These models are reshaping the way we interact with AI systems by enabling the use of both text and images as inputs. One prominent example is OpenAI’s GPT-4 Vision, but its closed-source and commercial nature can limit its application.
In response to this challenge, the open-source community is providing an alternative in the form of LLaVA 1.5, offering a promising blueprint for open-source counterparts to GPT-4 Vision. LLaVA 1.5 integrates various generative AI components, resulting in a computationally efficient model that excels in various tasks with high accuracy.
Components of Large Multimodal Models
LMMs typically comprise several key components, including a pre-trained model for encoding visual features, a pre-trained large language model (LLM) for comprehending user instructions and generating responses, and a vision-language cross-modal connector to align visual encoders with language models.
The training process for instruction-following LMMs generally consists of two stages. The first stage, known as vision-language alignment pre-training, involves using image-text pairs to align visual features with the language model’s word embedding space.
The second stage, visual instruction tuning, equips the model to understand and respond to prompts involving visual content. This stage often presents challenges due to its computational intensity and the need for an extensive dataset of well-curated examples.
The Unique Approach
What sets LLaVA 1.5 apart is its choice of a CLIP (Contrastive Language-Image Pre-training) model as its visual encoder. Developed by OpenAI in 2021, CLIP learns to associate images and text through training on a substantial dataset of image-description pairs. It’s also employed in advanced text-to-image models like DALL-E 2.
For its language model, LLaVA relies on Vicuna, a variant of Meta’s open-source LLaMA model fine-tuned for instruction-following tasks. The original LLaVA model employed text-only versions of ChatGPT and GPT-4 to generate training data for visual fine-tuning, proving its effectiveness.
LLaVA 1.5 further enhances the model by connecting the language model and vision encoder through a multi-layer perceptron (MLP). In this model, all neurons are fully interconnected. Additionally, the researchers integrated several open-source visual question-answering datasets into their training data, scaled the input image resolution, and gathered data from ShareGPT, an online platform where users share their interactions with ChatGPT. The entire training dataset encompassed approximately 600,000 examples and required about a day on eight A100 GPUs, with a cost of only a few hundred dollars.
Researchers have found that LLaVA 1.5 outperforms other open-source LMMs on 11 out of 12 multimodal benchmarks. However, it’s crucial to note that measuring LMM performance is complex, and benchmarks may not always reflect real-world application performance. An online demo of LLaVA 1.5 is available, showcasing impressive results from a budget-friendly, smaller model. Both the code and datasets are accessible, fostering further development and customization. Users are sharing examples demonstrating LLaVA 1.5’s ability to handle complex prompts.
The Limitations
However, LLaVA 1.5 comes with a caveat as it cannot be used for commercial purposes due to ChatGPT’s terms of use, which prohibit developers from using it to train competing commercial models. Building an AI product involves various challenges beyond model training, and LLaVA is not yet a competitor to GPT-4 Vision in terms of convenience, ease of use, and integration with other OpenAI tools like DALL-E 3 and external plugins. Nonetheless, LLaVA 1.5 offers appealing features, such as cost-effectiveness and scalability in generating training data for visual instruction tuning with LLMs. Several open-source ChatGPT alternatives can fulfill this purpose, and it’s only a matter of time before others replicate LLaVA 1.5’s success and explore new directions, including permissive licensing and application-specific models.
The Future
LLaVA 1.5 provides a glimpse of what to expect in the coming months within the open-source LMM landscape. As the open-source community continues to innovate, we can anticipate the emergence of more efficient and accessible models that will further democratize the new wave of generative AI technologies. Open-source LMMs like LLaVA 1.5 mark a significant advancement in generative artificial intelligence.
While commercial models like GPT-4 Vision have their strengths, the open-source community is making substantial progress in providing accessible and efficient alternatives. These models, though not yet on par with commercial offerings, demonstrate the potential for democratising AI technologies and expanding their applications. As the open-source community continues to innovate, we can anticipate more advancements in the near future, further pushing the boundaries of generative AI.
