New techniques efficiently accelerate sparse tensors for massive AI models

New techniques efficiently accelerate sparse tensors for massive AI models | MIT News

Researchers from MIT and NVIDIA have developed two techniques that enhance the processing of sparse tensors, a type of data structure commonly used in high-performance computing tasks. These techniques have the potential to significantly improve the performance and energy efficiency of systems such as large machine-learning models used in generative artificial intelligence.

Sparse tensors are data structures utilized by machine-learning models. The two methods developed by the researchers aim to effectively utilize sparsity, which refers to zero values in the tensors. By skipping over the zeros during processing, both computation and memory can be saved. Additionally, the tensors can be compressed to reduce the amount of on-chip memory required for storage.

However, exploiting sparsity poses several challenges. Locating nonzero values in large tensors is a complex task. Existing approaches often restrict the locations of nonzero values by enforcing a sparsity pattern, limiting the efficient processing of various types of sparse tensors. Moreover, the number of nonzero values can vary across different regions of the tensor, making it difficult to determine the required memory space for storing these regions. As a result, more memory space is often allocated than necessary, leading to underutilization of the storage buffer and increased energy consumption.

To address these challenges, the MIT and NVIDIA researchers devised two solutions. Firstly, they developed a technique called HighLight, which allows hardware to efficiently locate nonzero values for a wide range of sparsity patterns. This technique employs hierarchical structured sparsity to represent different sparsity patterns composed of simple patterns. By dividing the tensor values into smaller blocks with their own sparsity pattern and combining them into hierarchies, HighLight can efficiently identify and skip zeros, optimizing computation. On average, this accelerator design demonstrated about six times better energy efficiency compared to other approaches.

Secondly, the researchers introduced a method known as Tailors and Swiftiles, which effectively maximizes the utilization of the memory buffer on a computer chip when processing sparse tensors. By leveraging sparsity, the researchers can use larger tiles for processing, as zero values do not need to be stored. However, the number of zero values can vary across different regions, making it challenging to determine an ideal tile size. To overcome this uncertainty, the researchers propose overbooking, where tiles are selected to fit the buffer assuming there will be some nonzero values that won’t fit. If this occurs, the excess data is bumped out of the buffer, and only the bumped data is re-fetched for processing. This approach, combined with the Swiftiles method for estimating the ideal tile size, significantly improves processing speed while reducing energy demands.

Overall, both techniques developed by the researchers enhance the performance and energy efficiency of hardware accelerators designed for processing sparse tensors. The researchers highlight that these methods maintain flexibility and adaptability, even with specialized and efficient hardware accelerators. In the future, the researchers aim to apply these techniques to different types of machine-learning models and tensors.

Source link