Motivated by the success of transformer architectures in natural language processing, machine learning researchers introduced the concept of a vision transformer (ViT) in 2021. This innovative approach provides an alternative to convolutional neural networks (CNNs) for computer vision applications, as described in the paper “An Image Is Worth 16×16 Words: Transformers for Image Recognition at Scale”. Vision transformers have shown excellent performance on public benchmarks and are commonly used in image classification and object segmentation tasks. These applications enable various user experiences, such as searching for pictures, room measurement, and ARKIT semantic features.
In our research highlight “Deploying Transformers on the Apple Neural Engine”, we present efficient transformer deployment on the Apple Neural Engine (ANE) and introduce new techniques to support and enhance vision transformers on ANE. One key challenge is the quadratic complexity of the attention module, which makes global attention inefficient for large token lengths with high-resolution image inputs. To address this, state-of-the-art vision transformers utilize local attention blocks, which significantly improve efficiency. The attention mechanism is performed within rectangular regions that partition the image, allowing cross-window information propagation. Alternatively, depth-wise convolution layers can be used to compensate for information loss.
To further optimize the performance of vision transformers, we propose three key optimizations. Firstly, we perform a six-dimensional tensor window partition using a five-dimensional relayed partition. This allows efficient window partitioning/reversal operations with an NHWC tensor layout, which improves memory access efficiency. Secondly, we introduce alternative positional embedding techniques to reduce file size and latency. By replacing relative position embedding (RPE) with alternative position embedding, we can significantly reduce the overhead associated with large token lengths. We experiment with single-head RPE and locally enhanced position embedding (LePE) approaches. Lastly, we recap the principles of split_softmax, replacing linear layers with Conv2d 1×1, and chunking large query, key, and value tensors to further optimize the performance of vision transformers on ANE.
We apply these optimizations to two vision transformer architectures, DeiT and MOAT, and observe that MOAT achieves significantly better efficiency for higher input resolutions. We provide the optimized code and efficient visual attention components in an open-source repository on GitHub, allowing researchers to utilize these techniques and implement new transformer architectures. Our optimized Tiny-MOAT-1 model demonstrates faster performance compared to third-party open-source implementations on ANE.
In conclusion, the introduction of vision transformers has revolutionized computer vision applications, and with the optimizations discussed in our research, their performance and efficiency can be further enhanced. These advancements contribute to the development of more accurate and faster vision models for various tasks.
Source link