Contrastive language image pretraining (CLIP) is a standard method for training vision-language models. Although CLIP is scalable, promptable, and robust to distribution shifts on image classification tasks, it lacks object localization capabilities. This study aims to investigate whether augmenting CLIP training with task-specific vision models from model zoos can enhance its visual representations. To achieve this, we utilize open-source task-specific vision models to create pseudo-labels for an uncurated and noisy image-text dataset. Subsequently, we train CLIP models on these pseudo-labels alongside the contrastive training on image and text pairs. This straightforward approach results in significant improvements of up to 16.3% across various vision tasks such as segmentation, detection, depth estimation, and surface normal estimation. Importantly, these enhancements are attained without compromising CLIP’s existing strengths, including its proficiency in promptable zero-shot classification.