Exploring the advanced version of the attention mechanism in Transformers
In recent years, BERT has become the number one tool in many natural language processing tasks. Its outstanding ability to process, understand information, and construct word embeddings with high accuracy has reached state-of-the-art performance.
As a well-known fact, BERT is based on the attention mechanism derived from the Transformer architecture. Attention is the key component of most large language models nowadays.
Nevertheless, new ideas and approaches regularly evolve in the machine learning world. One of the most innovative techniques in BERT-like models appeared in 2021 and introduced an enhanced attention version called “Disentangled attention”. The implementation of this concept gave rise to DeBERTa – the model incorporating disentangled attention. Though DeBERTa introduces only a pair of new architecture principles, its improvements are prominent on top NLP benchmarks compared to other large models.
In this article, we will refer to the original DeBERTa paper and cover all the necessary details to understand how it works.
In the original Transformer block, each token is represented by a single vector that contains information about token content and position in the form of the element-wise embedding sum. The disadvantage of this approach is potential information loss: the model might not differentiate whether a word itself or its position gives more importance to a certain embedded vector component.
Embedding construction in BERT and DeBERTa. Instead of storing all the information in a single vector, DeBERTa uses two separate vectors to store word and position embeddings.
DeBERTa proposes a novel mechanism in which the same information is stored in two different vectors. Furthermore, the algorithm for attention computation is also modified to explicitly take into account the relations between the content and positions of tokens. For instance, the words “research” and “paper” are much more dependent when they appear near each other than in different text parts. This example clearly justifies why it is necessary to consider content-to-position relations as well.
The introduction of disentangled attention requires modification in attention score computation. As it turns out, this process is very simple. The calculation of cross-attention scores between two embeddings, each consisting of two vectors, can be easily decomposed into the sum of four pairwise multiplications of their subvectors.
Computation of cross-attention score between two embedding vectors.
The same methodology can be generalized in the matrix form. From the diagram, we can observe four different types of matrices (vectors) each representing a certain combination of content and position information: content-to-content matrix, content-to-position matrix, position-to-content matrix, and position-to-position matrix.
It is possible to observe the position-to-position matrix does not store any valuable information as it does not have any details on the words’ content. This is the reason why this term is discarded in disentangled attention.
For the remaining three terms, the final output attention matrix is calculated similarly to the original Transformer.
Output disentangled attention calculation in DeBERTa
Even though the calculation process looks similar, there is a pair of subtleties that need to be taken into consideration.
From the diagram above, we can notice that the multiplication symbol * used for multiplication between query-content Qc and key-position Kr matrices, and key-content Kc and query-position Qr matrices differs from the normal matrix multiplication symbol x. In reality, this is done not by accident as the mentioned pairs of matrices in DeBERTa are multiplied in slightly another way to take into account the relative positioning of tokens.
According to the normal matrix multiplication rules, if C = A x B, then the element C[i][j] is computed by element-wise multiplication of the i-th row of A by the j-th column of B.
In a special case of DeBERTa, if C = A * B, then C[i][j] is calculated as the multiplication of the i-th row of A by δ(i, j)-th column of B where δ denotes a relative distance function between indexes i and j which is defined by the formula below:
Relative distance definition between indexes i and j. k is a hyperparameter controlling the maximum possible relative distance. Image adopted by the author.
k can be thought of as a hyperparameter controlling the maximum possible relative distance between indexes i and j. In DeBERTa, k is set to 512. To get a better sense of the formula, let us plot a heatmap visualizing relative distances (k = 6) for different indexes of i and j.
For example, if k = 6, i = 15 and j = 13, then the relative distance δ between i and j is equal to 8. To obtain a content-to-position score for indexes i = 15 and j = 13, during the multiplication of query-content Qc and key-position Kr matrices, the 15-th row of Qc should be multiplied by the 8-th column of Kr matrix.
Content-to-position score computation for tokens i and j
However, for position-to-content scores, the algorithm works a bit differently: instead of the relative distance being δ(i, j), this time the algorithm uses the value of δ(j, i) in matrix multiplication. As the authors of the paper explain: “this is because for a given position i, position-to-content computes the attention weight of the key content at j with respect to the query position at i, thus the relative distance is δ(j, i)”.
Position-to-content score computation for tokens i and j
δ(i, j) ≠ δ(j, i), i.e. δ is not a symmetric function meaning that the distance between i and j is not the same as the distance between j and i.
Before applying the softmax transformation, attention scores are divided by a constant √(3d) for more stable training. This scaling factor is different from the one used in the original Transformer (√d). This difference in √3 times is justified by larger magnitudes resulting from the summation of 3 matrices in the DeBERTa attention mechanism (instead of a single matrix in Transformer).
Disentangled attention takes into account only content and relative positioning. However, no information about absolute positioning is considered, which might actually play an important role in ultimate prediction. The authors of the DeBERTa paper give a concrete example of such a situation: a sentence “a new store opened beside the new mall” which is fed to BERT with the masked words “store” and “mall” for prediction. Though the masked words have a similar meaning and local context (the adjective “new”), they have different linguistic context which is not captured by disentangled attention. In a language, there can be numerous analogous situations, which is why it is crucial to incorporate absolute positioning into the model.
In BERT, absolute positioning is taken into account in input embeddings. Speaking of DeBERTa, it incorporates absolute positioning after all Transformer layers but before applying the softmax layer. It was shown in experiments that capturing relative positioning in all Transformer layers and only after introducing absolute positioning improves the model’s performance. According to the researchers, doing it inversely could prevent the model from learning sufficient information about relative positioning.
Architecture
According to the paper, the enhanced mask decoder (EMD) has two input blocks:
H – the hidden states from the previous Transformer layer.
I – any necessary information for decoding (e.g. hidden states H, absolute position embedding, or output from the previous EMD layer).
Enhanced mask decoder in DeBERTa. Image adopted by the author.
In general, there can be multiple n EMD blocks inside a model. If so, they are constructed with the following rules:
the output of each EMD layer is the input I for the next EMD layer;
the output of the last EMD layer is fed to the language model head.
In the case of DeBERTa, the number of EMD layers is set to n = 2 with the position embedding used for I in the first EMD layer.
Another frequently used technique in NLP is weight sharing across different layers with the objective of reducing the model complexity (e.g. ALBERT). This idea is also implemented in EMD blocks of DeBERTa.
When I = H and n = 1, EMD becomes the equivalent of the BERT decoder layer.
Ablation studies
Experiments demonstrated that all introduced components in DeBERTa (position-to-content attention, content-to-position attention, and enhanced mask decoder) boost performance. Removing any of them would result in inferior metrics.
Scale-invariant fine-tuning
Additionally, the authors proposed a new adversarial algorithm called “Scale Invariant Fine-Tuning” to improve the model’s generalization. The idea is to incorporate small perturbations to input sequences, making the model more resilient to adversarial examples. In DeBERTa, perturbations are applied to normalized input word embeddings. This technique works even better for larger fine-tuned DeBERTa models.
DeBERTa variants
DeBERTa’s paper presents three models. The comparison between them is shown in the diagram below.
DeBERTa variants
Data
For pre-training, the base and…
Source link