Exploring the advanced version of the attention mechanism in Transformers
In recent years, BERT has become the number one tool in many natural language processing tasks. Its outstanding ability to process, understand information and construct word embeddings with high accuracy reach state-of-the-art performance.
As a well-known fact, BERT is based on the attention mechanism derived from the Transformer architecture. Attention is the key component of most large language models nowadays.
Nevertheless, new ideas and approaches evolve regularly in the machine learning world. One of the most innovative techniques in BERT-like models appeared in 2021 and introduced an enhanced attention version called “Disentangled attention”. The implementation of this concept gave rise to DeBERTa — the model incorporating disentangled attention. Though DeBERTa introduces only a pair of new architecture principles, its improvements are prominent on top NLP benchmarks, compared to other large models.
In this article, we will refer to the original DeBERTa paper and cover all the necessary details to understand how it works.
In the original Transformer block, each token is represented by a single vector which contains information about token content and position in the form of the element-wise embedding sum. The disadvantage of this approach is potential information loss: the model might not differentiate whether a word itself or its position gives more importance to a certain embedded vector component.
DeBERTa proposes a novel mechanism in which the same information is stored in two different vectors. Furthermore, the algorithm for attention computation is also modified to explicitly take into account the relations between the content and positions of tokens. For instance, the words “research” and “paper” are much more dependent when they appear near each other than in different text parts. This example clearly justifies why it is necessary to consider content-to-position relations as well.
The introduction of disentangled attention requires modification in attention score computation. As it turns out, this process is very simple. Calculation of cross-attention scores between two embeddings each consisting of two vectors can be easily decomposed into the sum of four pairwise multiplication of their subvectors:
The same methodology can be generalized in the matrix form. From the diagram, we can observe four different types of matrices (vectors) each representing a certain combination of content and position information:
content-to-content matrix;content-to-position matrix;position-to-content matrix;position-to-position matrix.
It is possible to observe position-to-position matrix does not store any valuable information as it does not have any details on the words’ content. This is the reason why this term is discarded in disentangled attention.
For the resting three terms, the final output attention matrix is calculated similarly as in the original Transformer.
Even though the calculation process looks similar, there is a pair of subtleties that need to be taken into consideration.
From the diagram above, we can notice that the multiplication symbol * used for multiplication between query-content Qc and key-position Krᵀ matrices & key-content Kc and query-position Qrᵀ matrices differs from the normal matrix multiplication symbol x. In reality, this is done not by accident as the mentioned pairs of matrices in DeBERTa are multiplied in slightly another way to take into account the relative positioning of tokens.
According to the normal matrix multiplication rules, if C = A x B, then the element C[i][j] is computed by element-wise multiplication of the i-th row of A by the j-th column of B.In a special case of DeBERTa, if C = A * B, then C[i][j] is calculated as the multiplication of the i-th row of A by δ(i, j)-th column of B where δ denotes a relative distance function between indexes i and j which is defined by the formula below:
k can be thought of as a hyperparameter controlling the maximum possible relative distance between indexes i and j. In DeBERTa, k is set to 512. To get a better sense of the formula, let us plot a heatmap visualising relative distances (k = 6) for different indexes of i and j.
For example, if k = 6, i = 15 and j = 13, then the relative distance δ between i and j is equal to 8. To obtain a content-to-position score for indexes i = 15 and j = 13, during the multiplication of query-content Qc and key-position Kr matrices, the 15-th row of Qc should be multiplied by the 8-th column of Krᵀ.
However, for position-to-content scores, the algorithm works a bit differently: instead of the relative distance being δ(i, j), this time the algorithm uses the value of δ(j, i) in matrix multiplication. As the authors of the paper explain: “this is because for a given position i, position-to-content computes the attention weight of the key content at j with respect to the query position at i, thus the relative distance is δ(j, i)”.
δ(i, j) ≠ δ(j, i), i.e. δ is not a symmetric function meaning that the distance between i and j is not the same as the distance between j and i.
Before applying the softmax transformation, attention scores are divided by a constant √(3d) for more stable training. This scaling factor is different to the one used in the original Transformer (√d). This difference in √3 times is justified by larger magnitudes resulting from the summation of 3 matrices in the DeBERTa attention mechanism (instead of a single matrix in Transformer).
Disentangled attention takes into account only content and relative positioning. However, no information about absolute positioning is considered which might actually play an important role in ultimate prediction. The authors of the DeBERTa paper give a concrete example of such a situation: a sentence “a new store opened beside the new mall” which is fed to BERT with the masked words “store” and “mall” for prediction. Though the masked words have a similar meaning and local context (the adjective “new”), they have different linguistic context which is not captured by disentangled attention. In a language there can be numerous analogous situations, which is why it is crucial to incorporate absolute positioning into the model.
In BERT, absolute positioning is taken into account in input embeddings. Speaking of DeBERTa, it incorporates absolute positioning after all Transformer layers but before applying the softmax layer. It was shown in experiments that capturing relative positioning in all Transformer layers and only after introducing absolute positioning improves the model’s performance. According to the researchers, doing it inversely could prevent the model from learning sufficient information about relative positioning.
According to the paper, the enhanced mask decoder (EMD) has two input blocks:
H — the hidden states from the previous Transformer layer.I — any necessary information for decoding (e.g. hidden states H, absolute position embedding or output from the previous EMD layer).
In general, there can be multiple n EMD blocks inside a model. If so, they are constructed with the following rules:
the output of each EMD layer is the input I for the next EMD layer;the output of the last EMD layer is fed to the language model head.
In the case of DeBERTa, the number of EMD layers is set to n = 2 with the position embedding used for I in the first EMD layer.
Another frequently used technique in NLP is weights sharing across different layers with the objective of reducing the model complexity (e.g. ALBERT). This idea is also implemented in EMD blocks of DeBERTa.
When I = H and n = 1, EMD becomes the equivalent of the BERT decoder layer.
Experiments demonstrated that all introduced components in DeBERTa (position-to-content attention, content-to-position attention and enhanced mask decoder) boost performance. Removing any of them would result in inferior metrics.
Additionally, the authors proposed a new adversarial algorithm called “Scale Invariant Fine-Tuning” to improving the model’s generalization. The idea is to incorporate small perturbations to input sequences making the model more resilient to adversial examples. In DeBERTa, perturbations are applied to normalized input word embeddings. This technique works even better for larger fine-tuned DeBERTa models.
DeBERTa’s paper presents three models. The comparison between them is shown in the diagram below.
For pre-training, the base and large versions of DeBERTa use a combination of the following datasets:
English Wikipedia + BookCorpus (16 GB)OpenWebText (public Reddit content: 38 GB)Stories (31 GB)
After data deduplication, the resulting dataset size is reduced to 78 GB. For DeBERTa 1.5B, the authors used more twice more data (160 GB) with an impressive vocabulary size of 128K.
In comparison, other large models like RoBERTa, XLNet and ELECTRA are pre-trained on 160 GB of data. At the same time, DeBERTa shows a comparable or better performance than these models on a variety of NLP tasks.
Spearking of training, DeBERTa is pre-trained for one million steps with 2K samples in each step.
We have walked through the main aspects of DeBERTa architecture. By possessing disentangled attention and enhanced masked encoding algorithms inside, DeBERTa has become an extremely popular choice in NLP pipelines for many data scientists and also a winning ingredient in many Kaggle competitions. Another amazing fact about DeBERTa is that it is one of the first NLP models which outperforms humans on the SuperGLUE benchmark. This single piece of evidence is enough to conclude that DeBERTa will remain for a long time in the history of LLMs.
All images unless otherwise noted are by the author