Contrastive learning is now recognized as a revolutionary approach to developing high-quality visual representations by aligning image and text embeddings. Yet, calculating pairwise similarities in contrastive loss between image and text pairs can be computationally demanding. This study introduces a new method for weakly supervised pre-training of vision models using large-scale image-text data from the web. The innovative approach reframes pre-training on image-text data as a classification task, eliminating the necessity for pairwise similarity calculations in contrastive loss. The results show a significant improvement with a remarkable 2.7…
Source link