This paper has been accepted to the UniReps Workshop in NeurIPS 2023.
Contrastive language image pretraining has become the standard approach for training vision language models. Despite the utility of CLIP visual features as global representations for images, they have limitations when it comes to tasks involving object localization, pixel-level understanding of the image, or 3D perception. Multi-task training is a popular solution to address this drawback, but collecting a large-scale annotated multi-task dataset incurs significant costs. Furthermore, training on separate task-specific datasets is also challenging from an optimization and training perspective due to aligning gradients and knowledge coming from different input distributions and tasks. To overcome these shortcomings, we study pseudo-labeling with task-specific experts to improve CLIP features for more challenging downstream tasks. In our approach, we leverage multiple existing open-source pretrained models and pseudo-label an uncurated web-scale image-caption dataset with the experts. We then train CLIP with contrastive loss and task-specific losses with pseudo labels through the lightweight heads that we attach to the vision backbone.