Image Captioning is the technique in which automatic descriptions are generated for an image.
Image Captioning is the process of generating a textual description of an image. It uses both Natural Language Processing and CNN to generate the captions.
The entire code is in the jupyter notebook, so that should hopefully make it easier to understand.
Dependencies
1.Keras 2.3.1
2.Tensorflow-gpu 2.2.0
3.tqdm
4.numpy
5.pandas
6.matplotlib
7.pickle
8.PIL
9.glob
Imp: This code is implemented using Tensorflow-gpu.
You must have an Nvidia GPU and corresponding Drivers.
I have used The Flickr8k dataset(size 1 GB). MS-COCO and Flickr30K are other datasets that you can use.
Flickr8K has training images-6000
validation images-1000
testing images-1000
Each image has 5 captions describing it.
In Image Captioning, a CNN is used to extract the features from an image which is then along with the captions is fed into an RNN. To extract the features, we use a model trained on Imagenet. I tried out VGG-16, Resnet-50, and InceptionV3. Vgg16 has almost 134 million parameters and its top-5 error on Imagenet is 7.3%. InceptionV3 has 21 million parameters and its top-5 error on Imagenet is 3.46%. Human top-5 error on Imagenet is 5.1%.
For creating the model, the captions have to be put in an embedding. Setting the embedding size to 300. The image below is the model that I used. .
After training the model for 50 epochs with batch size of 512,
the accuracy achieved was 75% and the loss was lowered to 0.911.
Finally, here are some results that I got. The code for the results is in the jupyter notebook and you can generate your own by writing some code at the end.
1. True caption: Three child soccer players go for the ball .
2. True caption: A dog wading in the water with a ball in his mouth .
3. True caption: A large white bird flies over water .
4. True caption: small dog running in the grass with a toy in its mouth .
5.True caption: The girls is jumping into the air on the beach .