Developers promote their Apps by creating product pages featuring App images and bidding on search terms. It is essential for the App images to closely match the search terms. One solution to this issue involves using an image-text matching model to assess the alignment between the selected image and the search terms. In this study, we introduce a new method for matching an App image with search terms by fine-tuning a pre-trained LXMERT model. Our results demonstrate a significant enhancement in matching accuracy compared to the CLIP model and a baseline approach utilizing a Transformer model for search terms and a ResNet model for images. We assess our method using two types of labels: advertiser-provided (image, search term) pairs for a specific application, and human ratings on the relevance of (image, search term) pairs. Our method achieves a 0.96 AUC score for the advertiser-provided data, surpassing the transformer+ResNet baseline and the fine-tuned CLIP model by 8% and 14%, respectively. For human-rated data, our method achieves a 0.95 AUC score, outperforming the transformer+ResNet baseline and the fine-tuned CLIP model by 16% and 17%.