AWS Inferentia and AWS Trainium deliver lowest cost to deploy Llama 3 models in Amazon SageMaker JumpStart

Today, we are thrilled to announce that Meta Llama 3 inference is now available on AWS Trainium and AWS Inferentia based instances in Amazon SageMaker JumpStart. The Meta Llama 3 models consist of pre-trained and fine-tuned generative text models. With Amazon Elastic Compute Cloud (Amazon EC2) Trn1 and Inf2 instances powered by AWS Trainium and AWS Inferentia2, deploying Llama 3 models on AWS has become more cost-effective. These instances offer up to 50% lower deployment costs compared to similar Amazon EC2 instances. Not only do they reduce the time and cost of training and deploying large language models (LLMs), but they also provide developers with easier access to high-performance accelerators for real-time applications like chatbots and AI assistants.

In this post, we will demonstrate how simple it is to deploy Llama 3 on AWS Trainium and AWS Inferentia based instances in SageMaker JumpStart.

**Meta Llama 3 model on SageMaker Studio**

SageMaker JumpStart gives access to both publicly available and proprietary foundation models (FMs). These FMs are onboarded and maintained from various third-party and proprietary providers, each released under different licenses as indicated by the model source. It is important to review the license of any FM used, ensuring compliance with applicable terms before downloading or using the content. Meta Llama 3 FMs can be accessed through SageMaker JumpStart on the Amazon SageMaker Studio console and the SageMaker Python SDK.

To discover the models in SageMaker Studio, navigate to the SageMaker Studio console and choose JumpStart in the navigation pane. If using SageMaker Studio Classic, refer to Open and use JumpStart in Studio Classic to access the SageMaker JumpStart models. By searching for “Meta” in the search box on the SageMaker JumpStart landing page, you can find the Meta model card listing all models from Meta. Additionally, relevant model variants can be found by searching for “neuron.” If Meta Llama 3 models are not visible, update the SageMaker Studio version by shutting down and restarting SageMaker Studio.

**No-code deployment of the Llama 3 Neuron model on SageMaker JumpStart**

By selecting the model card, users can view details such as license, training data, and usage instructions. The model card also provides two buttons, “Deploy” and “Preview notebooks,” to facilitate model deployment. Choosing “Deploy” will prompt the user to acknowledge the end-user license agreement (EULA) and acceptable use policy before providing endpoint settings and deploying the model. Alternatively, deployment can be done through the example notebook by selecting “Open Notebook,” which guides through the deployment process and resource cleanup.

**Meta Llama 3 deployment on AWS Trainium and AWS Inferentia using the SageMaker JumpStart SDK**

In SageMaker JumpStart, the Meta Llama 3 model has been pre-compiled for various configurations to avoid runtime compilation during deployment and fine-tuning. Two deployment options are available using the SageMaker JumpStart SDK: a simple deployment with two lines of code for ease or a more customizable deployment for finer control over configurations.

The provided code snippet demonstrates the simpler mode of deployment, where the accept_eula argument must be set to True in the model.deploy() call to initiate inference. It signifies that the end-user has read and accepted the EULA of the model. Additional model IDs for deployment are listed, each with specific configurations tailored for different use cases.

For customization of deployment configurations such as sequence length, tensor parallel degree, and maximum rolling batch size, the second code snippet showcases how to set these parameters while deploying the model.

After deploying the Meta Llama 3 neuron model, inference can be performed by invoking the endpoint with the desired input payload. The output will provide the predicted text generated by the model based on the input parameters.

To clean up resources after completing the training job, the provided code snippet outlines the steps to delete the fine-tuned model and its associated endpoint.

In conclusion, the deployment of Meta Llama 3 models on AWS Trainium and AWS Inferentia through SageMaker JumpStart offers a cost-effective solution for deploying large-scale generative AI models like Llama 3 on AWS. With variants like Meta-Llama-3-8B, Meta-Llama-3-8B-Instruct, Meta-Llama-3-70B, and Meta-Llama-3-70B-Instruct, users can leverage AWS Neuron for inference on AWS Trainium and Inferentia, ensuring efficient and scalable deployment. The detailed guide provided demonstrates the simplicity and flexibility of deploying these models through the SageMaker JumpStart console and Python SDK, encouraging developers to explore the possibilities of building innovative generative AI applications.

Source link