Enable faster training with Amazon SageMaker data parallel library

Large language model (LLM) training has gained popularity recently, with the introduction of models like Llama2, Falcon, and StarCoder. Customers are now training LLMs of unprecedented size, ranging from 1 billion to over 175 billion parameters. However, training these large models requires significant compute resources and time, as hundreds to thousands of GPUs are needed to handle the massive datasets and model sizes.

One bottleneck in distributed training is GPU communication, which is handled by the NVIDIA Collective Communication Library (NCCL). In some cases, more time is spent on inter-GPU communication than on actual GPU computation. To address this issue and enable faster training, Amazon SageMaker has introduced an optimized AllGather collective operation as part of the SageMaker distributed data parallel library (SMDDP). AllGather is a commonly used collective operation in memory-efficient data parallelism solutions like DeepSpeed Zero Redundancy Optimizer (ZeRO) and Fully Sharded Data Parallelism (FSDP), and it contributes to the GPU communication overhead.

In this post, we provide an overview of how SMDDP works, how to enable it in your Amazon SageMaker training scripts, and the performance improvements you can expect. We also discuss the solution overview, benchmarks, and usage of SMDDP collectives in PyTorch.

Traditional data parallel training involves replicating the entire model across multiple GPUs, with each GPU training on different shards of data. During the backward pass, gradients are averaged among GPU workers to update each model replica with the same gradient values, enabling faster training on large datasets. However, some models are too large to fit in GPU memory, making traditional data parallelism impractical. Sharded data parallel solutions like DeepSpeed ZeRO, PyTorch FSDP, and the Amazon SageMaker model parallelism library have emerged to overcome this limitation.

In sharded data parallelism, the model parameters, gradients, and optimizer states are distributed across GPUs. Forward and backward pass computations involve gathering parameters from shards on other GPU workers to form one or more model layers, and the layers are then freed from memory for the next set of layers. AllGather is used in this process to gather model parameters updated by different shards. However, the standard implementation of AllGather in NCCL is not optimized for the networking infrastructure of Amazon EC2 instances, slowing down training. The SMDDP library provides an optimized implementation of AllGather for p4d/p4de instance types, leveraging the Elastic Fabric Adapter (EFA) network, GDRCopy for coordinating local traffic, and reduced usage of GPU streaming multiprocessors.

SMDDP collectives integrate with PyTorch through the process group abstraction, allowing users to write distributed code and choose the backend (nccl or smddp) based on the compute device used. SMDDP outperforms NCCL in standalone AllGather performance benchmarks, achieving higher peak performance and faster execution at smaller buffer sizes. In large-scale training jobs, SMDDP can significantly improve training speeds, as shown by benchmarks on Llama2Seq and GPT-NeoX models.

In conclusion, SMDDP can accelerate sharded data parallel training jobs on Amazon SageMaker with just two lines of code change. By reducing the GPU communication bottleneck, SMDDP enables faster training at scale, resulting in cost savings and improved performance.

Source link