Generating fashion product descriptions by fine-tuning a vision-language model with SageMaker and Amazon Bedrock

In the world of online retail, creating high-quality product descriptions for millions of products is a crucial, but time-consuming task. Using machine learning (ML) and natural language processing (NLP) to automate product description generation has the potential to save manual effort and transform the way ecommerce platforms operate.

One of the main advantages of high-quality product descriptions is the improvement in searchability. Customers can more easily locate products that have correct descriptions, because it allows the search engine to identify products that match not just the general category but also the specific attributes mentioned in the product description.

For example, a product that has a description that includes words such as “long sleeve” and “cotton neck” will be returned if a consumer is looking for a “long sleeve cotton shirt.” Furthermore, having factoid product descriptions can increase customer satisfaction by enabling a more personalized buying experience and improving the algorithms for recommending more relevant products to users, which raise the probability that users will make a purchase.

With the advancement of Generative AI, we can use vision-language models (VLMs) to predict product attributes directly from images. Pre-trained image captioning or visual question answering (VQA) models perform well on describing every-day images but can’t to capture the domain-specific nuances of ecommerce products needed to achieve satisfactory performance in all product categories.

To solve this problem, this post shows you how to predict domain-specific product attributes from product images by fine-tuning a VLM on a fashion dataset using Amazon SageMaker, and then using Amazon Bedrock to generate product descriptions using the predicted attributes as input. So you can follow along, we’re sharing the code in a GitHub repository.

Amazon Bedrock is a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities you need to build generative AI applications with security, privacy, and responsible AI.

You can use a managed service, such as Amazon Rekognition, to predict product attributes as explained in Automating product description generation with Amazon Bedrock. However, if you’re trying to extract specifics and detailed characteristics of your product or your domain (industry), fine-tuning a VLM on Amazon SageMaker is necessary.

Vision-language models Since 2021, there has been a rise in interest in vision-language models (VLMs), which led to the release of solutions such as Contrastive Language-Image Pre-training (CLIP) and Bootstrapping Language-Image Pre-training (BLIP). When it comes to tasks such as image captioning, text-guided image generation, and visual question-answering, VLMs have demonstrated state-of-the art performance.

In this post, we use BLIP-2, which was introduced in BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models, as our VLM. BLIP-2 consists of three models: a CLIP-like image encoder, a Querying Transformer (Q-Former) and a large language model (LLM). We use a version of BLIP-2, that contains Flan-T5-XL as the LLM. The following diagram illustrates the overview of BLIP-2: Figure 1: BLIP-2 overview

The pre-trained version of the BLIP-2 model has been demonstrated in Build an image-to-text generative AI application using multimodality models on Amazon SageMaker and Build a generative AI-based content moderation solution on Amazon SageMaker JumpStart. In this post, we demonstrate how to fine-tune BLIP-2 for a domain-specific use case.

Solution overview The following diagram illustrates the solution architecture. Figure 2: High-level solution architecture

The high-level overview of the solution is: An ML scientist uses Sagemaker notebooks to process and split the data into training and validation data. The datasets are uploaded to Amazon Simple Storage Service (Amazon S3) using the S3 client (a wrapper around an HTTP call). Then the Sagemaker client is used to launch a Sagemaker Training job, again a wrapper for an HTTP call. The training job manages the copying of the datasets from S3 to the training container, the training of the model, and the saving of its artifacts to S3. Then, through another call of the Sagemaker client, an endpoint is generated, copying the model artifacts into the endpoint hosting container. The inference workflow is then invoked through an AWS Lambda request, which first makes an HTTP request to the Sagemaker endpoint, and then uses that to make another request to Amazon Bedrock. In the following sections, we demonstrate how to: Set up the development environment Load and prepare the dataset Fine-tune the BLIP-2 model to learn product attributes using SageMaker Deploy the fine-tuned BLIP-2 model and predict product attributes using SageMaker Generate product descriptions from predicted product attributes using Amazon Bedrock

Set up the development environment An AWS account is needed with an AWS Identity and Access Management (IAM) role that has permissions to manage resources created as part of the solution. For details, see Creating an AWS account. We use Amazon SageMaker Studio with the ml.t3.medium instance and the Data Science 3.0 image. However, you can also use an Amazon SageMaker notebook instance or any integrated development environment (IDE) of your choice. Note: Be sure to set up your AWS Command Line Interface (AWS CLI) credentials correctly. For more information, see Configure the AWS CLI. An ml.g5.2xlarge instance is used for SageMaker Training jobs, and an ml.g5.2xlarge instance is used for SageMaker endpoints. Ensure sufficient capacity for this instance in your AWS account by requesting a quota increase if required. Also check the pricing of the on-demand instances. You need to clone this GitHub repository for replicating the solution demonstrated in this post. First, launch the notebook main.ipynb in SageMaker Studio by selecting the Image as Data Science and Kernel as Python 3. Install all the required libraries mentioned in the requirements.txt.

Load and prepare the dataset For this post, we use the Kaggle Fashion Images Dataset, which contain 44,000 products with multiple category labels, descriptions, and high resolution images. In this post we want to demonstrate how to fine-tune a model to learn attributes such as fabric, fit, collar, pattern, and sleeve length of a shirt using the image and a question as inputs. Each product is identified by an ID such as 38642, and there is a map to all the products in styles.csv. From here, we can fetch the image for this product from images/38642.jpg and the complete metadata from styles/38642.json. To fine-tune our model, we need to convert our structured examples into a collection of question and answer pairs. Our final dataset has the following format after processing for each attribute: Id | Question | Answer38642 | What is the fabric of the clothing in this picture? | Fabric: Cotton After we process the dataset, we split it into training and validation sets, create CSV files, and upload the dataset to Amazon S3.

Fine-tune the BLIP-2 model to learn product attributes using SageMaker To launch a SageMaker Training job, we need the HuggingFace Estimator. SageMaker starts and manages all of the necessary Amazon Elastic Compute Cloud (Amazon EC2) instances for us, supplies the appropriate Hugging Face container, uploads the specified scripts, and downloads data from our S3 bucket to /opt/ml/input/data. We fine-tune BLIP-2 using the Low-Rank Adaptation (LoRA) technique, which adds trainable rank decomposition matrices to every Transformer structure layer while keeping the pre-trained model weights in a static state. This technique can increase training throughput and reduce the amount of GPU RAM required by 3 times and the number of trainable parameters by 10,000 times. Despite using fewer trainable parameters, LoRA has been demonstrated to perform as well as or better than the full fine-tuning technique. We prepared entrypoint_vqa_finetuning.py which implements fine-tuning of BLIP-2 with the LoRA technique using Hugging Face Transformers, Accelerate, and Parameter-Efficient Fine-Tuning (PEFT). The script also merges the LoRA weights into the model weights after training. As a result, you can deploy the model as a normal model without any additional code.

“`python
from peft import LoraConfig, get_peft_model
from transformers import Blip2ForConditionalGeneration

model = Blip2ForConditionalGeneration.from_pretrained(“Salesforce/blip2-flan-t5-xl”, device_map=”auto”, cache_dir=”/tmp”, load_in_8bit=True)

config = LoraConfig(
r=8, # Lora attention dimension.
lora_alpha=32, # the alpha parameter for Lora scaling.
lora_dropout=0.05, # the dropout probability for Lora layers.
bias=”none”, # the bias type for Lora.
target_modules=[“q”, “v”],
)

model = get_peft_model(model, config)
“`

We reference entrypoint_vqa_finetuning.py as the entry_point in the Hugging Face Estimator.

“`python
from sagemaker.huggingface import HuggingFace

hyperparameters = {
‘epochs’: 10,
‘file-name’: “vqa_train.csv”,
}

estimator = HuggingFace(
entry_point=”entrypoint_vqa_finetuning.py”,
source_dir=”../src”,
role=role,
instance_count=1,
instance_type=”ml.g5.2xlarge”,
transformers_version=’4.26′,
pytorch_version=’1.13′,
py_version=’py39′,
hyperparameters = hyperparameters,
base_job_name=”VQA”,
sagemaker_session=sagemaker_session,
output_path=f”{output_path}/models”,
code_location=f”{output_path}/code”,
volume_size=60,
metric_definitions=[
{‘Name’: ‘batch_loss’, ‘Regex’: ‘Loss: ([0-9\.]+)’},
{‘Name’: ‘epoch_loss’, ‘Regex’: ‘Epoch Loss: ([0-9\.]+)’}
],
…
“`

By following these steps, you can set up your development environment, load and prepare the dataset, fine-tune the BLIP-2 model, deploy the fine-tuned model, and generate product descriptions using Amazon Bedrock.

Source link