Rewritten Content:
Posted by Phitchaya Mangpo Phothilimthana, Staff Research Scientist, Google DeepMind, and Bryan Perozzi, Senior Staff Research Scientist, Google Research
With the rapid advancements in machine learning (ML), machines are now capable of understanding natural language, engaging in conversations, creating images, and generating videos. ML models are developed and trained using ML programming frameworks like TensorFlow, JAX, and PyTorch. These frameworks provide high-level instructions for ML practitioners, including linear algebra operations and neural network layers. The optimization of ML workloads depends on the efficiency of the underlying compiler used by the ML framework.
In this blog post, we discuss exciting advancements in ML for ML and how we use ML to improve the efficiency of ML workloads. Previous works have shown that ML can enhance the performance of ML programs by improving ML compiler decisions. However, existing datasets for program performance prediction primarily focus on small sub-programs, such as basic blocks or kernels. To address this gap, we introduce the “TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs” and hosted a Kaggle competition based on this dataset.
ML compilers are software routines that convert user-written ML programs into executable instructions for hardware. These compilers need to solve complex optimization problems, including graph-level and kernel-level optimizations. A graph-level optimization requires analyzing the entire computation graph to make optimal decisions, while a kernel-level optimization focuses on individual kernels independently. An important optimization in ML compilers is the assignment of memory layouts to intermediate tensors in the program. The choice of memory layout can significantly impact the efficiency of the program.
To improve the efficiency of ML models, we aim to enhance the ML compiler by equipping it with a learned cost model. This cost model takes an input program and compiler configuration and predicts the runtime of the program. To facilitate the development of such cost models, we release the TpuGraphs dataset, which includes computational graphs of ML workloads, compilation configurations, and execution times. The dataset contains a large number of graphs collected from open-source ML programs, enabling researchers to explore graph-level prediction tasks on large graphs.
We provide baseline learned cost models based on graph neural networks (GNNs) for the TpuGraphs dataset. These models use opcode embeddings and a graph pooling reduction technique to generate a fixed-size embedding for the graph. Additionally, we introduce Graph Segment Training (GST), a method for scaling GNN training to handle large graphs on devices with limited memory capacity. GST partitions large graphs into smaller segments and updates the model using a random subset of segments, reducing computational requirements.
To evaluate the performance of cost models on the TpuGraphs dataset, we conducted the “Fast or Slow? Predict AI Model Runtime” Kaggle competition. The competition attracted participants from around the world and showcased various techniques, including graph pruning and compression, to improve predictions.
Overall, our advancements in ML for ML and the release of the TpuGraphs dataset aim to enhance the efficiency of ML workloads and encourage further research in ML program optimization.
Source link