Guide to LLM, Part 1: BERT. Understand how BERT constructs… | by Vyacheslav Efimov | Aug, 2023

A.I. Black GuyAugust 30, 2023

0 3 9 minutes read

Guide to LLM, Part 1: BERT. Understand how BERT constructs… | by Vyacheslav Efimov | Aug, 2023

Understand how BERT constructs state-of-the-art embeddings

2017 was a historical year in machine learning when the Transformer model made its first appearance on the scene. It has been performing amazingly on many benchmarks and has become suitable for lots of problems in Data Science. Thanks to its efficient architecture, many other Transformer-based models have been developed later which specialise more on particular tasks.

One of such models is BERT. It is primarily known for being able to construct embeddings which can very accurately represent text information and store semantic meanings of long text sequences. As a result, BERT embeddings became widely used in machine learning. Understanding how BERT builds text representations is crucial because it opens the door for tackling a large range of tasks in NLP.

In this article, we will refer to the original BERT paper and have a look at BERT architecture and understand the core mechanisms behind it. In the first sections, we will give a high-level overview of BERT. After that, we will gradually dive into its internal workflow and how information is passed throughout the model. Finally, we will learn how BERT can be fine-tuned for solving particular problems in NLP.

Transformer’s architecture consists of two primary parts: encoders and decoders. The goal of stacked encoders is to construct a meaningful embedding for an input which would preserve its main context. The output of the last encoder is passed to inputs of all decoders trying to generate new information.

BERT is a Transformer successor which inherits its stacked bidirectional encoders. Most of the architectural principles in BERT are the same as in the original Transformer.

There exist two main versions of BERT: Base and Large. Their architecture is absolutely identical except for the fact that they use different numbers of parameters. Overall, BERT Large has 3.09 times more parameters to tune, compared to BERT Base.

From the letter “B” in the BERT’s name, it is important to remember that BERT is a bidirectional model meaning that it can better capture word connections due to the fact that the information is passed in both directions (left-to-right and right-to-left). Obviously, this results in more training resources, compared to unidirectional models, but at the same time leads to a better prediction accuracy.

For a better understanding, we can visualise BERT architecture in comparison with other popular NLP models.

Comparison of BERT, OpenAI GPT and ElMo architectures from the ogirinal paper. Adopted by the author.

Before diving into how BERT is trained, it is necessary to understand in what format it accepts data. For the input, BERT takes a single sentence or a pair of sentences. Each sentence is split into tokens. Additionally, two special tokens are passed to the input:

[CLS] — passed before the first sentence indicating the beginning of the sequence. At the same time, [CLS] is also used for a classification objective during training (discussed in the sections below).[SEP] — passed between sentences to indicate the end of the first sentence and the beginning of the second.

Passing two sentence makes it possible for BERT to handle a large variety of tasks where an input contains two sentences (e.g. question and answer, hypothesis and premise, etc.).

After tokenisation, an embedding is built for each token. To make input embeddings more representative, BERT constructs three types of embeddings for each token:

Token embeddings capture the semantic meaning of tokens.Segment embeddings have one of two possible values and indicate to which sentence a token belongs.Position embeddings contain information about a relative position of a token in a sequence.

These embeddings are summed up and the result is passed to the first encoder of the BERT model.

Each encoder takes n embeddings as input and then outputs the same number of processed embeddings of the same dimensionality. Ultimately, the whole BERT output also contains n embeddings each of which corresponds to its initial token.

BERT training consists of two stages:

Pre-training. BERT is trained on unlabeled pair of sentences over two prediction tasks: masked language modeling (MLM) and natural language inference (NLI). For each pair of sentences, the model makes predictions for these two tasks and based on the loss values, it performs backpropagation to update weights.Fine-tuning. BERT is initialised with pre-trained weights which are then optimised for a particular problem on labeled data.

Compared to fine-tuning, pre-training usually takes a significant proportion of time because the model is trained on a large corpus of data. That is why there exist a lot of online repositories of pre-trained models which can be then fine-tined relatively fast to solve a particular task.

We are going to look in detail at both problems solved by BERT during pre-training.

Masked Language Modeling

Authors propose training BERT by masking a certain amount of tokens in the initial text and predicting them. This gives BERT the ability to construct resilient embeddings that can use the surrounding context to guess a certain word which also leads to building an appropriate embedding for the missed word as well. This process works in the following way:

After tokenization, 15% of tokens are randomly chosen to be masked. The chosen tokens will be then predicted at the end of the iteration.The chosen tokens are replaced in one of three ways:- 80% of the tokens are replaced by the [MASK] token. Example: I bought a book → I bought a [MASK]- 10% of the tokens are replaced by a random token.Example: He is eating a fruit → He is drawing a fruit- 10% of the tokens remain unchanged.Example: A house is near me → A house is near meAll tokens are passed to the BERT model which outputs an embedding for each token it received as input.

4. Output embeddings corresponding to the tokens processed at step 2 are independently used to predict the masked tokens. The result of each prediction is a probability distribution across all the tokens in the vocabulary.

5. The cross-entropy loss is calculated by comparing probability distributions with the true masked tokens.

6. The model weights are updated by using backpropagation.

Natural Language Inference

For this classification task, BERT tries to predict whether the second sentence follows the first. The whole prediction is made by using only the embedding from the final hidden state of the [CLS] token which is supposed to contain aggregated information from both sentences.

Similarly to MLM, a constructed probability distribution (binary in this case) is used to calculate the model’s loss and update the weights of the model through backpropagation.

For NLI, authors recommend choosing 50% of pairs of sentences which follow each other in the corpus (positive pairs) and 50% of pairs where sentences are taken randomly from the corpus (negative pairs).

Training details

According to the paper, BERT is pre-trained on BooksCorpus (800M words) and English Wikipedia (2,500M words). For extracting longer continuous texts, authors took from Wikipedia only reading passages ignoring tables, headers and lists.

BERT is trained on a million batches of size equal to 256 sequences which is equivalent to 40 epochs on 3.3 billion words. Each sequence contains up to 128 (90% of the time) or 512 (10% of the time) tokens.

According to the original paper, the training parameters are the following:

Optimisator: Adam (learning rate l = 1e-4, weight decay L₂ = 0.01, β₁ = 0.9, β₂ = 0.999, ε = 1e-6).Learning rate warmup is performed over the first 10 000 steps and then reduced linearly.Dropout (α = 0.1) layer is used on all layers.Activation function: GELU.Training loss is the sum of mean MLM and mean next sentence prediction likelihoods.

Once pre-training is completed, BERT can literally understand the semantic meanings of words and construct embeddings which can almost fully represent their meanings. The goal of fine-tuning is to gradually modify BERT weights for solving a particular downstream task.

Data format

Thanks to the robustness of the self-attention mechanism, BERT can be easily fine-tuned for a particular downstream task. Another advantage of BERT is the ability to build bidirectional text representations. This gives a higher chance of discovering correct relations between two sentences when working with pairs. Previous approaches consisted of independently encoding both sentences and then applying bidirectional cross-attention to them. BERT unifies these two stages.

Depending on a certain problem, BERT accepts several input formats. The framework for solving all downstream tasks with BERT is the same: by taking as an input a sequence of text, BERT outputs a set of token embeddings which are then fed to the model. Most of the time, not all of the output embeddings are used.

Let us have a look at common problems and the ways they are solved by fine-tuning BERT.

Sentence pair classification

The goal of sentence pair classification is to understand the relationship between a given pair of sentences. Most of common types of tasks are:

Natural language inference: determining whether the second sentence follows the first.Similarity analysis: finding a degree of similarity between sentences.

For fine-tuning, both sentences are passed to BERT. As a rule of thumb, the output embedding of the [CLS] token is then used for the classification task. According to the researchers, the [CLS] token is supposed to contain the main information about sentence relationships.

Of course, other output embeddings can also be used but they are usually omitted in practice.

Question answering task

The objective of question answering is to find an answer in a text paragraph corresponding to a particular question. Most of the time, the answer is given in the form of two numbers: the start and end token positions of the passage.

For the input, BERT takes the question and the paragraph and outputs a set of embeddings for them. Since the answer is contained within the paragraph, we are only interested in output embeddings corresponding to paragraph tokens.

For finding a position of the start answer token in the paragraph, the scalar product between every output embedding and a special trainable vector Tₛₜₐᵣₜ is calculated. For most cases when the model and the vector Tₛₜₐᵣₜ are trained accordingly, the scalar product should be proportional to the likelihood that a corresponding token is in reality the start answer token. To normalise scalar products, they are then passed to the softmax function and can be thought as probabilities. The token embedding corresponding to the highest probability is predicted as the start answer token. Based on the true probability distribution, the loss value is calculated and the backpropagation is performed. The analogous process is performed with the vector Tₑₙ𝒹 for predicting the end token.

Single sentence classification

The difference, compared to previous downstream tasks, is that here only a single sentence is passed BERT. Typical problems solved by this configuration are the following:

Sentiment analysis: understanding whether a sentence has a positive or negative attitude.Topic classification: classifying a sentence into one of several categories based on its contents.

The prediction workflow is the same as for sentence pair classification: the output embedding for the [CLS] token is used as the input for the classification model.

Single sentence tagging

Named entity recognition (NER) is a machine learning problem which aims to map every token of a sequence to one of respective entities.

For this objective, embeddings are computed for tokens of an input sentence, as usual. Then every embedding (except for [CLS] and [SEP]) is passed independently to a model which maps each of them to a given NER class (or not, if it cannot).

Sometimes we deal not only with text but with numerical features, for example, as well. It is naturally desirable to build embeddings that can incorporate information from both text and other non-text features. Here are the recommended strategies to apply:

Concatenation of text with non-text features. For instance, if we work with profile descriptions about people in the form of text and there are other separate features like their name or age, then a new text description can be obtained in the form: “My name is <name>. <profile description>. I am <age> years old”. Finally, such a text description can be fed into the BERT model.Concatenation of embeddings with features. It is possible to build BERT embeddings, as discussed above, and then concatenate them with other features. The only thing that changes in the configuration is the fact a classification model for a downstream task has to accept now input vectors of higher dimensionality.

In this article, we have dived into the processes of BERT training and fine-tuning. As a matter of fact, this knowledge is enough to solve the majority of tasks in NLP thankfully to the fact that BERT allows to almost fully incorporate text data into embeddings.

In recent times, other BERT-based models have appeared like SBERT, RoBERTa, etc. There even exists a special sphere of study called “BERTology” which analyses BERT capabilities in depth for deriving new high-performant models. These facts reinforce the fact that BERT designated a revolution in machine learning and made it possible to significantly advance in NLP.

All images unless otherwise noted are by the author

Source link