DistilVit ~ Image Captioning Model

Ankur Kumar released a popular model on Hugging Face to generate captions for images and blogged about it.

This model was also published as ONNX weights by Xenova so it could be used in Transformers.js, see https://huggingface.co/Xenova/vit-gpt2-image-captioning

The model is doing a pretty good job - even if in some cases I had better results with https://huggingface.co/microsoft/git-base-coco - But the GIT architecture is not yet supported in ONNX converters, and my current understanding of those different architectures is that most of the accuracy is obtained with great and vast data. So for now I am making the bet that I can get good results with ViT.

Ankur used the google/vit-base-patch16-224-in21k image encoder and the GPT2 text decoder and fine-tuned them using the COCO dataset, which is a dataset of 120k labeled images.

I wanted to reduce the model size and speed it up a little bit, so I decided to build the same one replacing GPT2 with DistilGPT2 -- which is 2 times faster and 33% smaller according to its documentation.

I took Ankur's code snippets and recreated a training script that is published here: https://github.com/tarekziade/distilvit

The major differences are:

Once tokenized, the dataset is saved so it can be reused (450GiB).
The training resumes from the last checkpoint on failure
Only the last 10 checkpoints are kept, because each one is 2.1GiB so it's easy to fill your disks.
Some labels were breaking the training because they had different sizes, so I had to created a data collector to fix this. I am not sure why this is happening since the tokenizer is supposed to pad everything.

The training took 45 hours on my 2xRTX4090 GPUs, but one GPU was often idling and the other one was underused. I assume this is because some of the image processing is done on the CPU. I also had several crashes on GPU parallelization, which seems to be a bug somewhere in CUDA or torch, but I could resume every time.

I published the model in https://huggingface.co/tarekziade/distilvit and the ONNX quantized weights are down to : 87MiB for the encoder and 98MiB for the decoder, making the whole system weight under 200MiB when used in Transformers.js.

I've noticed a 30% speedup on average on the non quantized version on my M1, and the generated text was also properly capitalized and punctuated (see infere.py in my repo) compared to the original one. I am not sure why because the COCO dataset has uncased labels with sometimes no periods.

The metrics at the end of the training were:

eval_loss: 0.19939416646957397
eval_rouge1: 43.006
eval_rouge2: 16.9939
eval_rougeL: 38.8923
eval_rougeLsum: 38.8877
eval_gen_len: 11.327256736227712
eval_runtime: 1816.5255
eval_samples_per_second: 13.77
eval_steps_per_second': 1.721
train_runtime: 46263.3695
train_samples_per_second: 38.373
train_steps_per_second: 4.797
train_loss: 0.05974134062104816

My interpretation is that the the delta between the eval loss and the train loss suggest there's room for improvement if the model gets trained on more data - and maybe more epochs. So I am going to add the Flickr30k dataset along side the COCO one, and train on both to see if it gets better.

The results are still pretty good. So it seems worth pursuing.

Fetchez le Python

DistilVit ~ Image Captioning Model

Other articles

PardonMyAI ~ Profanity Detection Model

Named-Entity Recognition on web pages

Compressing T5 models for summarization