Fine-tune and deploy Llama 2 models cost-effectively in Amazon SageMaker JumpStart with AWS Inferentia and AWS Trainium

Today, we are thrilled to announce that Llama 2 inference and fine-tuning support is now available on AWS Trainium and AWS Inferentia instances in Amazon SageMaker JumpStart. By leveraging AWS Trainium and Inferentia based instances through SageMaker, users can reduce fine-tuning costs by up to 50% and deployment costs by 4.7 times, all while lowering per token latency.

Llama 2 is an auto-regressive generative text language model that utilizes an optimized transformer architecture. It is designed for various NLP tasks such as text classification, sentiment analysis, language translation, language modeling, text generation, and dialogue systems. However, fine-tuning and deploying LLMs like Llama 2 can be expensive and challenging to achieve real-time performance for a better customer experience.

With the help of Trainium and AWS Inferentia, powered by the AWS Neuron software development kit (SDK), a high-performance and cost-effective option is provided for training and inference of Llama 2 models. In this post, we will demonstrate how to deploy and fine-tune Llama 2 on Trainium and AWS Inferentia instances in SageMaker JumpStart.

Solution Overview:

In this blog, we will cover the following scenarios:

1. Deploying Llama 2 on AWS Inferentia instances in both the Amazon SageMaker Studio UI and the SageMaker Python SDK.
2. Fine-tuning Llama 2 on Trainium instances in both the SageMaker Studio UI and the SageMaker Python SDK.
3. Comparing the performance of the fine-tuned Llama 2 model with the pre-trained model to showcase the effectiveness of fine-tuning.

To get hands-on experience, please refer to the example notebook on GitHub.

Deploy Llama 2 on AWS Inferentia instances using the SageMaker Studio UI and the Python SDK:

In this section, we will demonstrate how to deploy Llama 2 on AWS Inferentia instances using the SageMaker Studio UI for a one-click deployment and the Python SDK.

To access the Llama 2 foundation models, you can use SageMaker JumpStart in the SageMaker Studio UI and the SageMaker Python SDK. In SageMaker Studio, a web-based visual interface, you can perform all machine learning (ML) development steps, from data preparation to model building, training, and deployment.

After accessing SageMaker Studio, you can find SageMaker JumpStart, which provides pre-trained models, notebooks, and prebuilt solutions under the section “Prebuilt and automated solutions.” If you don’t see the Llama 2 models, you may need to update your SageMaker Studio version by restarting it.

To deploy the Llama-2-13b model with SageMaker JumpStart, you can select the model card to view detailed information about the model, including the license, training data, and instructions on how to use it. You will also find buttons to deploy or open a notebook for using the model with a no-code example. Before deploying, you will need to acknowledge the End User License Agreement and Acceptable Use Policy.

If you prefer to deploy the Llama 2 Neuron model using the Python SDK, you can choose the “Deploy” button and acknowledge the terms. Alternatively, you can open the example notebook and follow the instructions provided for deploying the model and cleaning up resources.

To deploy or fine-tune a model on Trainium or AWS Inferentia instances, you will first need to use PyTorch Neuron (torch-neuronx) to compile the model into a Neuron-specific graph, optimizing it for Inferentia’s NeuronCores. SageMaker JumpStart has pre-compiled Neuron graphs for different configurations, allowing for faster fine-tuning and deployment.

If you want more control over deployment configurations, such as context length, tensor parallel degree, and maximum rolling batch size, you can modify them using environmental variables. The underlying DLC for deployment is the Large Model Inference (LMI) NeuronX DLC.

For specific environmental variables and their configurations, please refer to the provided table in the original content.

By utilizing AWS Trainium and Inferentia instances, users can benefit from cost-effective and high-performance training and inference for Llama 2 models. Whether deploying through the SageMaker Studio UI or the Python SDK, Llama 2 can be easily deployed and fine-tuned for optimal performance.

Source link