A Complete Guide to Transformers in Pytorch
At the latest since the advent of ChatGPT, Large Language models (LLMs) have created a huge hype, and are known even to those outside the AI community. Even though one needs to understand that LLMs inherently are “just” sequence prediction models without any form of intelligence or reasoning — the achieved results are certainly extremely impressive, with some even talking about another step in the “AI Revolution”.
Essential to the success of LLMs are their core building blocks, transformers. In this post, we will give a complete guide of using them in Pytorch, with particular focus on time series prediction. Thanks for stopping by, and I hope you enjoy the ride!
One could argue that all problems solved via transformers essentially are time series problems. While that is true, here we will put special focus to continuous series and data — such as predicting the spreading of diseases or forecasting the weather. The difference to the prominent application of Natural Language Processing (NLP) simply (if this word is allowed in this context — developing a model like ChatGPT and making it work naturally does require a multitude of further optimization steps and tricks) is the continuous input space, while NLP works with discrete tokens. However, apart from this, the basic building blocks are identical.
In this post, we will start with a (short) theoretical introduction of transformers, and then move towards applying them in Pytorch. For this, we will discuss a selected example, namely predicting the sine function. We will show how to generate data for this and pre-process it correctly, and then use transformers for learning how to predict this function. Later, we will discuss how to do inference when future tokens are not available, and conclude the post by extending the example to multi-dimensional data.
Goal of this post is providing a complete hands-on tutorial on how to use transformers for real-world use cases — and not theoretically introducing and explaining these interesting models. Instead I’d like to refer to this amazing article and the original paper [1] (whose architecture we will follow throughout this…