A.I. Tools

Build and Play! Your Own V&L Model Equipped with LLM! | by Yuichi Inoue | Sep, 2023

In the research papers on GIT models, it was explained that a strong vision encoder is utilized and random parameters are adopted for the language model. This time, since the goal is to ultimately use a 7B-class language model, a pre-trained model will be applied to the language model. The following modules will be examined for fine-tuning. The GIT Projection, being an initialized module, is always included. Some combinations may seem redundant, but they are explored without too much concern for this trial.

Modules set for training are given gradients, while the rest are modified to not have gradients.

# Specifying the parameters to train (training all would increase memory usage)for name, p in model.model.named_parameters():if np.any([k in name for k in keys_finetune]):p.requires_grad = Trueelse:p.requires_grad = False

The Vision Encoder and LLM used for this examination are:

openai/clip-vit-base-patch16facebook/opt-350m

Training utilizes COCO dataset and lasts for 5 epochs.

Here are the target modules trained during each experiment:

Proj: GIT Projection. Initialized randomly, so it’s always trained.LoRA: Query, Key, and Value of the self attention in the language model were applid.OPT: All layers were trained.ViT: All layers were trained.Head: The final lm_head of OPT was trained.

(Note: While LoRA can be applied to ViT, but to avoid making the experiments too complicated, it wasn’t included this time.)

This figure shows training loss. Proj, LoRA, OPT, ViT, and Head in the legend are the trained modules explained above. (figure made by the author)

As shown in the training loss plot, it’s apparent that some groups are not performing well. These were the case when OPT is included in the training. Although all experiments were conducted under fairly similar conditions, more detailed adjustments, such as learning rate, might be necessary when fine-tuning the language model. Results, excluding the models where OPT is included in training, will be examined next.

This figure shows training loss without full finetuning results. Proj, LoRA, OPT, ViT, and Head in the legend are the trained modules explained above. (figure made by the author)
This figure shows validation loss. Proj, LoRA, OPT, ViT, and Head in the legend are the trained modules explained above. (figure made by the author)

Both training and validation Loss decreased most with the Projection+LoRA model. Fine-tuning final Head layer showed nearly identical outcomes. If ViT is also trained, the Loss appears slightly higher and results seem unstable. Even when adding LoRA during ViT training, the loss still tends to be high. For fine-tuning with this data, it seems using a pre-trained ViT model without updating its parameters yields more stable results. The effectiveness of LoRA has been acknowledged in various places, and it is evident from this experiment that adding LoRA to the LLM improved bothe traininng and validation loss.

Reviewing the inference results on some test data:

Example results of GIT-OPT. Pictures are cited from M3IT dataset, and text results were made by the author’s model

When training OPT itself, the results are as poor as the result of loss, making the model at a loss for words. Additionally, when training ViT, the output makes semantic sense, but describes something entirely different from the given image. However, the other results seem to capture the features of the images to some extent. For instance, the first image mentions “cat” and “banana”, and the second one identifies “traffic sign”. Comparing results with and without LoRA, the latter tends to repetitively use similar words, but using LoRA seems to make it slightly more natural. Training the Head results in intriguing outputs, like using “playing” instead of “eating” for the first image. While there are some unnatural elements in these results, it can be deduced that the training was successful in capturing image features.

For fine-tuning conditions in earlier experiments, a slightly smaller language model, OPT-350m, was used. Now, the intention is to switch the language model to a 7B model. Not just settling for OPT, stronger LLMs, LLaMA and MPT, will also be introduced.

Integrating these two models can be done in a similar fashion to OPT. Referring to the forward functions of the LlamaModel and MPTModel, combine the projected image vectors with text tokens, and change the mask from Causal Attention Mask to GIT’s Attention Mask. One thing to note: for MPT, the mask isn’t (0, -inf), but (False, True). The subsequent processes can be implemented similarly.

To use the 7B-class model with OPT, merely change the model name from facebook/opt-350m to facebook/opt-6.7b.

For LLaMA, with the availability of LLaMA2, that will be the model of choice. To use this pre-trained model, approvals from both Meta and Hugging Face are needed. An account is necessary for Hugging Face, so make sure to set that up. Approvals typically come within a few hours. Afterwards, log into Hugging Face on the terminal where training is executed.

huggingface-cli login

You can log in using the token created in Hugging Face account → Settings → Access Token.

Training parameters remain consistent, using the COCO dataset and lasting for 3 epochs. Based on results from Experiment 1, the modules set for fine-tuning were Projection + LoRA.

Let’s take a look at the results.

This figure shows training loss (figure made by the author)
This figure shows validation loss (figure made by the author)

Reviewing the loss, it’s apparent that the models using LLaMA2 and MPT as LLM show a more satisfactory reduction. Let’s also observe the inference results.

Example results of GIT-LLMs. Pictures are cited from M3IT dataset, and text results were made by the author’s model

Regarding the first image, for all models, the expressions seem more natural compared to OPT-350m. There are no bizarre expressions like “a banana with a banana”, highlighting the strength of LLM. For the second image, there’s still some difficulty with phrases like “a traffic light” or “a building”. For such complex images, there might be a need to consider upgrading the ViT model.

Finally, let’s run inference on images that became popular with GPT-4.

Example results of GIT-LLMs. A picture is cited from here, and text results were made by the author’s models

Although fluent responses were anticipated since LLM is in use, the outcomes are quite simple. This might be because the model was trained solely on COCO.

Given the underwhelming results of the previous experiment, it was decided to incorporate data other than COCO for training. The M3IT dataset currently in use is quite comprehensive, and it can handle a significant amount of data in the same format as COCO.

This table is cited from Table 3 of “M3IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning”

It is intended to use data from this source excluding the “Chinese” and “Video” categories. Originally, the COCO training dataset contained 566,747 pieces of data. By combining it with additional sources, this increased to 1,361,650. Although the size has roughly doubled, the dataset is believed to have become of higher quality due to the increased diversity of tasks.

Handling multiple Pytorch datasets can be easily achieved using the ConcatDataset.

dataset_list = [datasets.load_dataset(“MMInstruction/M3IT”, i) for i in m3it_name_list]train_dataset = torch.utils.data.ConcatDataset([d[“train”] for d in dataset_list])

The training was conducted for 1 epoch, and the LLaMA2 model was used for fine-tuning the Projection and LoRA, similarly to Experiment 2.

As there’s no loss to compare to this time, let’s dive straight into the inference results.

Example results of GIT-LLaMA2. Pictures are cited from M3IT dataset, and text results were made by the author’s model
Example results of GIT-LLaMA2. Pictures are cited from M3IT dataset, and text results were made by the author’s model
Example results of GIT-LLaMA2. Pictures are cited from M3IT dataset, and text results were made by the author’s model

Along with solving simple problems, the model now handles more complex challenges. By adding datasets for tasks more intricate than just captioning, the capabilities have expanded significantly. Achieving this level of accuracy with only 1 epoch of training was surprising.

Let’s test it with the following example image. Given the increased variety in the dataset, the way the questions were presented was slightly modified.

Example results of GIT-LLaMA2. A picture is cited from here, and text results were made by the author’s models

While the description being “Umbrella” was still wired, it feels like it’s getting better. To improve further, there’s a need to increase the number of training epochs, add more types or volumes of datasets, and leverage more powerful ViT or LLM. Nonetheless, it’s impressive that such a model could be developed in just half a day given the computational and data resources.


Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button
Translate »