A visual language model for UI and visually-situated language understanding

A visual language model for UI and visually-situated language understanding – Google Research Blog

Srinivas Sunkara and Gilles Baechler, Software Engineers at Google Research, discuss the importance of screen user interfaces (UIs) and infographics in facilitating rich and interactive user experiences in human communication and human-machine interaction. UIs and infographics share design principles and visual language, offering an opportunity to build a single model that can understand, reason, and interact with these interfaces. However, due to their complexity and varied presentation formats, infographics and UIs present a unique modeling challenge. To address this challenge, they introduce “ScreenAI: A Vision-Language Model for UI and Infographics Understanding”.

ScreenAI improves upon the PaLI architecture with the flexible patching strategy from pix2struct. The model is trained on a unique mixture of datasets and tasks, including a Screen Annotation task that requires the model to identify UI element information on a screen. Text annotations provide large language models (LLMs) with screen descriptions, enabling them to automatically generate question-answering, UI navigation, and summarization training datasets at scale. With only 5B parameters, ScreenAI achieves state-of-the-art results on UI- and infographic-based tasks and outperforms models of similar size on Chart QA, DocVQA, and InfographicVQA.

The ScreenAI architecture is based on PaLI, composed of a multimodal encoder block and an autoregressive decoder. The model is trained in two stages: a pre-training stage using self-supervised learning and a fine-tuning stage with manually labeled data. Data generation for ScreenAI includes compiling screenshots from various devices, applying a layout annotator, icon classification, and OCR text extraction. Prompt engineering is used in combination with large language models to generate synthetic data for diverse tasks.

Experiments and results show that the fine-tuned ScreenAI model achieves state-of-the-art performance on various UI and infographic-based tasks and competitive performance on screen summarization and OCR-VQA. The model’s performance improves with increasing size, indicating scalability. Future research is needed to further improve model performance and bridge the gap with larger models.

The authors acknowledge the contributions of various team members and collaborators in the project, as well as the insightful feedback and support received during the development of ScreenAI.

Source link