Building Multimodal Autoregressive Models for Time-Aligned and Contextual Modalities
By Isaac Noble, Software Engineer, Google Research, and Anelia Angelova, Research Scientist, Google DeepMind
When developing machine learning models for real-life applications, it’s important to consider inputs from multiple modalities to capture different aspects of the world around us. Modalities such as audio, video, and text provide diverse and complementary information about visual inputs. However, building multimodal models is challenging due to the heterogeneity of these modalities.
In our recent research paper, “Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities”, we introduce a multimodal autoregressive model (Mirasol3B) that addresses these challenges. Our model consists of separate autoregressive components for time-synchronized modalities (audio and video) and modalities that are not necessarily time-aligned but still sequential, such as text inputs. Additionally, the time-aligned modalities are partitioned in time to facilitate joint feature learning.
By decoupling the multimodal modeling into separate autoregressive models and leveraging partitioned time-aligned modalities, our approach allows for the efficient handling of longer videos compared to other multimodal models. Mirasol3B, with 3B parameters, is also more compact than prior models like Flamingo (80B) and PaLI-X (55B).
In our experiments, Mirasol3B outperformed state-of-the-art approaches on various benchmarks, including video question answering (video QA), long video QA, and audio-video-text benchmarks.
Model Architecture
The Mirasol3B architecture includes separate autoregressive models for time-aligned modalities (audio and video) and a combiner module for contextual modalities (text). The combiner module learns compact but informative features to process long video/audio inputs.
Coordinating Time-Aligned and Contextual Modalities
Due to the distinct characteristics of video, audio, and text, we employ cross-attention mechanisms to coordinate learning between the time-aligned and contextual modalities. This allows the two components to exchange information without the need for time synchronization.
Time-Aligned Autoregressive Modeling of Video and Audio
To preserve temporal information in long videos, we adopt an autoregressive modeling strategy. The video is partitioned into smaller chunks, which are processed by a combiner module to generate joint audio and video feature representations. These representations are then used in an autoregressive Transformer to learn the temporal relationships between the chunks.
Modeling Long Videos with a Modality Combiner
To combine video and audio signals, we introduce a learning module called the Combiner. The Combiner processes video and audio inputs and generates a joint feature representation. It also reduces the dimensionality of the input features to handle the large volume of data in video and audio signals.
Combiner Styles
The Combiner can be implemented as a simple causal Transformer or with a learnable memory component, such as the Token Turing Machine (TTM). The TTM compresses previous features to reduce computation.
Results
We evaluated our approach on various benchmarks, including video QA and audio-video classification tasks. Our model achieved improved performance compared to state-of-the-art approaches, even with a smaller number of parameters. Notably, our model can efficiently process longer videos without increasing the model size.
Conclusion
Our multimodal autoregressive model, Mirasol3B, offers an effective solution for building models that capture information from multiple modalities. By considering the distinct characteristics of different modalities and leveraging partitioned time-aligned modalities, we achieve improved performance on various tasks. Our model is compact, efficient, and capable of handling longer videos.