Sunday, June 29, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

Enable faster training with Amazon SageMaker data parallel library

December 5, 2023
in Data Science & ML
Reading Time: 2 mins read
0 0
A A
0
Share on FacebookShare on Twitter



Large language model (LLM) training has gained popularity recently, with the introduction of models like Llama2, Falcon, and StarCoder. Customers are now training LLMs of unprecedented size, ranging from 1 billion to over 175 billion parameters. However, training these large models requires significant compute resources and time, as hundreds to thousands of GPUs are needed to handle the massive datasets and model sizes.

One bottleneck in distributed training is GPU communication, which is handled by the NVIDIA Collective Communication Library (NCCL). In some cases, more time is spent on inter-GPU communication than on actual GPU computation. To address this issue and enable faster training, Amazon SageMaker has introduced an optimized AllGather collective operation as part of the SageMaker distributed data parallel library (SMDDP). AllGather is a commonly used collective operation in memory-efficient data parallelism solutions like DeepSpeed Zero Redundancy Optimizer (ZeRO) and Fully Sharded Data Parallelism (FSDP), and it contributes to the GPU communication overhead.

In this post, we provide an overview of how SMDDP works, how to enable it in your Amazon SageMaker training scripts, and the performance improvements you can expect. We also discuss the solution overview, benchmarks, and usage of SMDDP collectives in PyTorch.

Traditional data parallel training involves replicating the entire model across multiple GPUs, with each GPU training on different shards of data. During the backward pass, gradients are averaged among GPU workers to update each model replica with the same gradient values, enabling faster training on large datasets. However, some models are too large to fit in GPU memory, making traditional data parallelism impractical. Sharded data parallel solutions like DeepSpeed ZeRO, PyTorch FSDP, and the Amazon SageMaker model parallelism library have emerged to overcome this limitation.

In sharded data parallelism, the model parameters, gradients, and optimizer states are distributed across GPUs. Forward and backward pass computations involve gathering parameters from shards on other GPU workers to form one or more model layers, and the layers are then freed from memory for the next set of layers. AllGather is used in this process to gather model parameters updated by different shards. However, the standard implementation of AllGather in NCCL is not optimized for the networking infrastructure of Amazon EC2 instances, slowing down training. The SMDDP library provides an optimized implementation of AllGather for p4d/p4de instance types, leveraging the Elastic Fabric Adapter (EFA) network, GDRCopy for coordinating local traffic, and reduced usage of GPU streaming multiprocessors.

SMDDP collectives integrate with PyTorch through the process group abstraction, allowing users to write distributed code and choose the backend (nccl or smddp) based on the compute device used. SMDDP outperforms NCCL in standalone AllGather performance benchmarks, achieving higher peak performance and faster execution at smaller buffer sizes. In large-scale training jobs, SMDDP can significantly improve training speeds, as shown by benchmarks on Llama2Seq and GPT-NeoX models.

In conclusion, SMDDP can accelerate sharded data parallel training jobs on Amazon SageMaker with just two lines of code change. By reducing the GPU communication bottleneck, SMDDP enables faster training at scale, resulting in cost savings and improved performance.



Source link

Tags: AmazondataEnablefasterLibraryParallelSageMakertraining
Previous Post

Stability AI Introduces SDXL Turbo: A Real-Time Text-to-Image Generation Model

Next Post

Sending Transactional Emails with Remix and Amazon AWS SES

Related Posts

AI Compared: Which Assistant Is the Best?
Data Science & ML

AI Compared: Which Assistant Is the Best?

June 10, 2024
5 Machine Learning Models Explained in 5 Minutes
Data Science & ML

5 Machine Learning Models Explained in 5 Minutes

June 7, 2024
Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’
Data Science & ML

Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’

June 7, 2024
How to Learn Data Analytics – Dataquest
Data Science & ML

How to Learn Data Analytics – Dataquest

June 6, 2024
Adobe Terms Of Service Update Privacy Concerns
Data Science & ML

Adobe Terms Of Service Update Privacy Concerns

June 6, 2024
Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart
Data Science & ML

Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

June 6, 2024
Next Post
Sending Transactional Emails with Remix and Amazon AWS SES

Sending Transactional Emails with Remix and Amazon AWS SES

Six tips for an exceptional customer service strategy

Six tips for an exceptional customer service strategy

Hospitality co Selina founders set to lose controlling stake

Hospitality co Selina founders set to lose controlling stake

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
How ‘Chain of Thought’ Makes Transformers Smarter

How ‘Chain of Thought’ Makes Transformers Smarter

May 13, 2024
Amazon’s Bedrock and Titan Generative AI Services Enter General Availability

Amazon’s Bedrock and Titan Generative AI Services Enter General Availability

October 2, 2023
The Importance of Choosing a Reliable Affiliate Network and Why Olavivo is Your Ideal Partner

The Importance of Choosing a Reliable Affiliate Network and Why Olavivo is Your Ideal Partner

October 30, 2023
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
Managing PDFs in Node.js with pdf-lib

Managing PDFs in Node.js with pdf-lib

November 16, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In