Large language models such as GPT-3 require substantial energy due to their computational needs during training and inference. The energy usage varies significantly based on factors like the model’s size, task complexity, hardware specifications, and operational duration.
Training these models demands extensive computational resources, often involving high-performance GPUs or TPUs, leading to substantial energy consumption over prolonged periods. Estimates indicate that training a large language model like GPT-3 can use electricity equivalent to what multiple households consume over several days or weeks.
Optimizing energy consumption is crucial and needs to be done without slowing down the model’s efficiency. Researchers aim to reduce the energy consumption that can be removed without throughput loss in large language model training. The amount of computation in each pipeline stage is an important problem for distributed execution planning. Balancing every stage is impossible as Deep Neural Networks(DNN) are coarse-grained tensor operations with varying amounts of computation.
Researchers at the University of Michigan and the University of Washington find that not all energy consumed during the training directly contributes to end-to-end training throughput, and it can be removed significantly without slowing down training. They find intrinsic and extrinsic sources of energy bloat and propose a single optimization framework called Perseus that minimizes both.
The cause of Intrinsic energy bloat is computation imbalance, and Extrinsic energy bloat is multiple pipelines running in parallel, synchronized to scale out training in massive datasets. Pipelines running faster than the straggler pipeline are fast and waste energy that does not affect the overall training throughput.
Perseus efficiently pre-characterizes the entire iteration time energy, minimizing intrinsic energy bloat under normal operating conditions. It mitigates the extrinsic energy bloats through suboptimal energy reduction. It finds the energy-optimal iteration time for the non-straggler pipeline by precisely slowing down computations in the pipeline.
Researchers simulate stragglers in training large models with hybrid parallelism in various strong scaling configurations. They measure the amount of energy bloat and Perseus’ extrinsic energy savings. After finishing their computation, other non-stragglers wait until the straggler completes the computation, leading to extrinsic energy bloat. They decrease the number of micro-batches and the ratio of pipeline bubbles at the beginning and end of each pipeline iteration. This eliminates the intrinsic energy bloat, resulting in less energy.
Integrating Perseus into the training workflow has strong implications for the future of AI development. Their work has the potential to greatly enhance the sustainability of distributed training in the proliferation of LLMs and GenAI.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to join our 34k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.
If you like our work, you will love our newsletter..
Arshad is an intern at MarktechPost. He is currently pursuing his Int. MSc Physics from the Indian Institute of Technology Kharagpur. Understanding things to the fundamental level leads to new discoveries which lead to advancement in technology. He is passionate about understanding the nature fundamentally with the help of tools like mathematical models, ML models and AI.