Study and Development of Image Caption Generation using Various Encoders for Different Image Categories

doi:10.21203/rs.3.rs-3196450/v1

Download PDF

Research Article

Study and Development of Image Caption Generation using Various Encoders for Different Image Categories

https://doi.org/10.21203/rs.3.rs-3196450/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Images can act as the source of information or way of communication. Captioning an image by machines in a manner such that it conveys the true meaning is considered the most difficult task. It is indeed a process of deep analysis and most researched area. This paper presents a comparative analysis of different deep learning models such as Inception V3, Resnet50 and VGG16 based on a novel image captioning methodology applied on different picture categories. The hybrid approach is developed to get high BLEU score for each input images. The data set generation, implementation of hybrid approach, and the challenges along with the future work are discussed.

deep learning

computer vision

image captioning

artificial intelligence

Image captioning is the most tedious task when considered from the perspective of a social media user who quickly wants a caption which describes the image better. In general sense, the need of today is automatic image captioning system which extracts the information from an image and generates a caption. Description of an image must consist of all the relevant details present in the image. A specially abled person would always need a technology that can guide the way with all minute details of the surroundings [1] [2]. Different deep learning-based models have been able to generate captions by using different approaches. This paper presents the results of three such models namely, Inception V3, Resnet50 and VGG16. Image Captioning methods can be briefly categorized into traditional and deep learning methods.

Traditional methods relied upon search based and template based techniques which came with the drawback of major dependency on the datasets to generate captions [11] [12]. On the other hand, the deep learning methods turned the direction by introducing encoder-decoder framework [13], attention based model, reinforcement learning [25] and so on.

The study presented in this paper is the comparative analysis of the various method based on deep learning such as Inception V3, Resnet-50 and VGG16. These are the encoders which extract the information and the feature extracted are passed on to the decoder LSTM for language generation. The dataset used for training is Flickr8K which contains around 8000 images. For training the models 6000 images are taken from Flickr8K dataset. The idea that makes this study unique is the using a testing dataset which consists of images divided into different categories mainly pets, humans, action, scene, and monument. The models are evaluated on more than 100 photographs from the test dataset of 1000 images derived from the Flickr8k dataset for each of these image categories. The created models go through 20 epochs of training with a learning rate of 0.001. The dropout is adjusted to 0.5 to prevent model overfitting. Categorical cross-entropy is utilised to determine loss and Adam is employed as the optimizer.

Most picture captioning techniques make use of convolutional neural networks, which are pre-trained on classification [3] [4] [5] [6]. They extract the features from an image and these features act as input to the language generation model RNN or LSTM. A network model based on attention was developed by Anderson et al. [7]. The language decoder was improved by Aneja et al. [8] to obtain more diversity from LSTM networks to CNN. The CNN decoder is also used by Wang et al. [9].

According to Vinyals et al. [10], the output of a convolutional neural network's fully connected layer is sent as an input to an LSTM unit for caption production. To give direction to an LSTM unit, Jia et al. [14] rebuild the LSTM unit and incorporate linguistic alternatives. Xu et al. [15] use lower layer CNN visual characteristics to describe images. Additionally, they use an attention mechanism to choose spatial feature maps that are then provided as input to an LSTM.

There are several publicly accessible databases that offer a selection of pictures and captions. The overall number of photos, captions per image, and data gathering methods vary amongst these datasets in most cases.

Flickr8k Dataset

The Flickr website hosts the Flickr8k image collection. The Flickr8k dataset has about 8092 pictures. There are 6092, 1000, and 1000 photos total for training, validation, and testing, respectively. Five captions are provided for each image in this dataset. The various things and things that happen in a certain image are described in the captions. A single image contains a range of information thanks to the five subtitles that are provided. Due to its comparatively modest size, this dataset is best suited for novices.

In an application for image classification, the evaluation is based on a straightforward contrast between the predicted class and the target class. On the other hand, the output for picture captioning must be reviewed in the form of a sentence and is far more difficult to analyse because there may be more than one true statement that accurately describes the image. Therefore, a quantitative criterion to assess the accuracy of predictions is required for model performance evaluation. The captions are reviewed with manually created annotations to confirm their accuracy. It is determined how well the automatically generated caption resembles the human translation. Some of the most popular evaluation measures are BLEU [18], METEOR [19], SPICE [19], CIDEr [19], and ROUGE [20].

BLEU

Bilingual Evaluation Understudy is what BLEU stands for. This precision-based metric compares the number of words in a machine-generated sentence to the number of words in sentences gathered as a standard for evaluation. The more n-gram matches there are, the more accurate the generated sentence is shown to be. By dividing the total number of words in the generated sentence that appear in the reference sentence by the entire number of words it is composed of, the fundamental n-gram precision is determined. This only applies to the reference sentence and one set of machine-generated sentences. The similar n-gram computation approach is used in image captioning by BLEU. First, sentence by sentence n-gram matches are determined. Therefore, we may determine the accuracy score, P_n, as follows:

$${P}_{n}= \frac{\sum _{C\in \left\{candidates\right\}}\sum _{n-gram\in C}{Count}_{clip}(n-gram)}{\sum _{C\in \left\{candidates\right\}}\sum _{n-{gram}^{{\prime }}\in {C}^{{\prime }}}Count(n-{gram}^{{\prime }})}$$

The BLEU-1, BLEU-2, BLEU-3, and BLEU-4 ratings are the most often chosen BLEU scores for evaluation. The metric's number indicates the size of the n-grams that are being compared.

A. Resnet-50

The ResNet50 convolutional neural network (CNN), is a highly deep network which has 50 layers. Although network depth is important for neural networks, deeper networks are more challenging to train. The structure of ResNet50 facilitates the training of networks and allows them to be much deeper, which leads to increased performance in different tasks. In addition to being substantially deeper than its "simple" counterparts, ResNet50 has a much reduced number of parameters (weights).

B. Inception V3

The InceptionV3's fundamental design is based on GoogleNet [21]. The utilisation of Lin's "Network in Network" technique [22], which increased the representational power of neural networks, is one of the essential components of the Inception framework. As a result, the dimension was reduced to 11 convolutions, lowering the computation cost. The Inception architecture was created to lower the computational expense of deep learning-based image classification. There are typically three possible convolution sizes and one maximum pooling in the Inception module. The following inception modules make up InceptionV3's core architecture: A dropout layer, a fully connected layer with 1024 neuron units and ReLU, and an average pooling layer with a filter size of 5 x 5 and stride 3 are all used for dimension reduction. The channel is aggregated following the convolutional operation, and the fusion operator is then applied to the output of the preceding layer. As a result, it aids in lowering overfitting and enhancing the network's adaptability.

C. VGG16

VGG16 is one of the most widely applied pre-trained CNN models. There are 16 layers in all, 13 of which are convolutional, two of which are fully connected, and one of which is a Softmax activation layer [23]. ReLU (rectified linear unit) activation function is utilised to increase the model's nonlinearity, while Softmax is employed for classification. There are 138 million parameters in total. According to the VGG16 network setups, an input consists of a 224 x 224 picture with the colours red, green, and blue. The only pre-processing done is to remove the average value from each pixel in order to normalise the RGB values for each pixel. The picture is transmitted through the first stack of two convolution layers with a size of 3 x 3 after ReLU activations. There are 64 filters on each layer. To retain spatial resolution after convolution, the convolution stride is set to 1 pixel. The buried layers of the VGG network employ ReLU. The VGG16 network's hidden layers all employ the ReLU activation function.

The motivation behind this experiment is to determine the model which generates best captions for different categories of images. To achieve that, 6000 images from Flickr8K dataset are used for training the models and for testing these models more than 100 images of different categories taken from testing dataset of 1000 images have been used.

The models are trained for total 20 epochs with a learning rate of 0.001. The dropout is adjusted to 0.5 to prevent model overfitting. Categorical cross-entropy is utilised to determine loss and Adam is employed as the optimizer.

While comparing the various images, VGG16 encoder gives high accuracy in image captioning for human images then equally for pets, action, and monuments followed by scene. In case of Resnet-50, image captioning works better for images having pets followed by monuments, action, human and then scene. Image captioning for images with monuments and pets better in Inception V3 encoder followed by images having human, action and scene.

In case of scenic images VGG16 stnads far better than other presented encoders and similar for images with human.

This paper presents a comparative results obtained using a hybrid model approach for three different encoders Inception V3, Resnet-50 and VGG16. This experiment aims at finding the best model capable of generating most accurate captions for different image categories. For language generation LSTM is chosen as decoder. These models were trained using 6000 photos from the Flickr8K dataset. Pets, people, actions, scenes, and monuments are the key image categories taken into consideration for testing the captioning outcomes of the proposed methodology. The models are evaluated on more than 100 photographs from the test dataset of 1000 images derived from the Flickr8k dataset for each of these image categories. The created models go through 20 epochs of training with a learning rate of 0.001. The dropout is adjusted to 0.5 to prevent model overfitting. Adam is used as the optimizer, and the loss is calculated using categorical cross-entropy. Figure 5 shows the comparison of BLEU scores results of aforesaid models. It is significant that Inception V3 has better results for images having human and pets. Resnet-50 gives better BLEU scores on action-based images. While image caption generated for scenic images and images with monument using VGG16 has good accuracy.

The results are validated with existing research works [24] [25]. Luo et al. has presented image captioning using different models and compared their BLEU scores. The different datasets such as Flickr8k, Flickr30k, MSCOCO and Pascal1K used for training the model [24]. Verma et al. has described neural network-based model for automatic image captioning and has presented BLEU scores comparison of proposed model with other existing models for different images [25].

Deep learning encoders along with LSTM decoder have been used to generate image captions of different categories of images. Models for image captioning have improved as a result of evolving deep learning paradigms. In the future, the study can concentrate on enhancing image captioning such that the model can forecast captions for photos that are beyond the training area. The majority of the work is done using a supervised approach. Now, the emphasis should be on creating a system that was trained utilising unsupervised data. Second, the most popular datasets include labelled photographs, which often list the objects that can be seen in the picture but do not convey the feelings that the photographer was trying to convey. Thus, one of the future-oriented works in this sector that has promise is emotion description as part of captioning. Finally, past research utilised smaller datasets than the MSCOCO dataset. Thus, usage of large datasets, creation of datasets including variety of scenes from all the visual domains is a requirement.

This paper presents a detailed review about theory and implementation related to image captioning using different deep learning encoders and generate descriptive sentences as per image context. In this regard, traditional and deep learning methods developed for image caption generation have been discussed. Flickr8K dataset available in public has been used. After that, the various method based on deep learning are trained on dataset. Next, the comparative results based on different encoders for image captioning and scores obtained for proposed hybrid model is presented. Finally, we have discussed the future research directions in this field.

Ethical Approval

This paper uses open data set available in public domain. Thus this study does not require any ethical approval.

Competing Interests

Authors declare that they do not have any competing interest.

Author’s Contributions

AN made substantial contributions in designing and development of the concept. Complete code, testing and training of dataset have been carried out by AN. SRNR guided well in execution of this work.

Funding

Authors declares that there is no funding involve in this work.

Availability of Data and Materials

The Flicker8k dataset is used in the findings of this study. Flicke8k dataset is available in repository [https://www.kaggle.com/datasets/adityajn105/flickr8k]

D. Gurari, Y. Zhao, M. Zhang, and N. Bhattacharya, “Captioning images taken by people who are blind,” in Proc. Eur. Conf. Comput. Vis., Springer, pp. 417−434, 2020.
G. Shirley, et al., “A survey on the use of mobile applications for people who are visually impaired,” J. Vis. Impairment Blindness, 111, pp. 307−323, 2017.
H. Sharma, et al., “Image captioning: a comprehensive survey,” Int. Conf. Pow. Electr. IoT App. Renew. Ener. Contr., pp. 325−328, 2020.
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: lessons learned from the 2015 ms coco image captioning challenge,” IEEE Trans. Pat. Anal. Mach. Intelli., vol.39, no.4, pp.652–663, 2016.
K. Xu, et al., “Show, attend and tell: neural image caption generation with visual attention,” Int. Conf. Mach. Learn., pp. 2048–2057, 2015.
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556.
P. Anderson, et al., “Bottom-up and top-down attention for image captioning and visual question answering,” IEEE Conf. Comput. Vis. Pat. Recognit., 2018
J. Aneja, A. Deshpande, and A. Schwing, “Convolutional image captioning,” IEEE Conf. Comput. Vis. Pat. Recognit. (CVPR), pp. 5561–5570, 2017.
Q. Wang and A. B. Chan, “CNN+CNN: convolutional decoders for image captioning,” In arXiv: 1805.09019, 2018.
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” IEEE Conf. Comput. Vis. Pat. Recognit., 2015
Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S Lazebnik, “Improving image-sentence embeddings using large weakly annotated photo collections,” Eur. Conf. Comput. Vis., Springer, pp. 529–545, 2014.
P. Y. M. Hodosh and Julia Hockenmaier, “Framing image description as a ranking task: data, models and evaluation metrics,” J. Artif. Intell. Res., 47, pp. 853–899, 2013.
I. Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks,” Adv. Neural Inform. Process. Syst., pp. 3104–3112, 2014.
X. Jia, E. Gavves, B. Fernando, and T. Tuytelaars, “Guiding the long-short term memory model for image caption generation,” Proc. IEEE Int. Conf. Comput. Vis., pp. 2407–2415, 2015.
K. Xu, et al., “Show, attend and tell: neural image caption generation with visual attention,” Int. Conf. Mach. Learn., pp. 2048–2057, 2015.
M. Hodosh, P. Young, and J. Hockenmaier, “Framing image description as a ranking task: data, models and evaluation metrics,”. J. of Artif. Intell. Res., 47, pp. 853–899, 2013.
P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions,” Trans. Assoc. Comput. Linguist., 2, pp. 67–78, 2014.
O. Sidorov, R. Hu, M. Rohrbach, and A. Singh, “Textcaps: a dataset for image captioning with reading comprehension,” in Proc. Eur. Conf. Comput. Vis., Springer, pp. 742−758, 2020.
H. Agrawal, et al., “ NOCAPS: novel object captioning at scale,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., pp. 8948−8957, 2019.
K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” In: Proc. 40^th Ann. Meet. Assoc. Comput. Linguist., pp. 311–318, 2002.
Z. Yang, Y. Yuan, Y. Wu, W. W. Cohen, and R. Salakhutdinov, “Review networks for caption generation,” Proc. 27^th Adv. Neural Inform. Process. Sys., 2016.
Q. Wu, C. Shen, and L. Liu, “What value do explicit high level concepts have in vision to language problems?,”IEEE Conf. Comput. Vis. Pat. Recognit., pp. 203–212, 2016.
T. Yao, et al., “Incorporating copying mechanism in image captioning for learning novel objects,” Proc. IEEE Conf. Comput. Vis. Pat. Recognit., pp. 6580–6588, 2017.
G. Luo, L. Cheng, C. Jing, C. Zhao, and G. Song, “A thorough review of models, evaluation metrics, and datasets on image captioning,” IET Image Process., pp. 311–332, 2022.
A. Verma, A. K. Yadav, M. Kumar, and D. Yadav, “Automatic image caption generation using deep learning,” Springer Nat., pp. 1–14, 2021.

No competing interests reported.

Download PDF

Reviews received at journal
20 Aug, 2023
Reviewers agreed at journal
07 Aug, 2023
Reviewers agreed at journal
07 Aug, 2023
Reviewers invited by journal
07 Aug, 2023
Editor assigned by journal
31 Jul, 2023
Submission checks completed at journal
31 Jul, 2023
First submitted to journal
23 Jul, 2023

You are reading this latest preprint version

Study and Development of Image Caption Generation using Various Encoders for Different Image Categories

Status:

Version 1

Abstract

Figures

I. Introduction

II. Related Work

III. Dataset

IV. Evaluation Metrics

V. Models Overview

VI. Experiment And Results

VII. Implementation And Results Comparision

VIII. Future Research Directions

IX. Conclusion

Declarations

Ethical Approval

Competing Interests

Author’s Contributions

Funding

Availability of Data and Materials

References

Additional Declarations

Status:

Version 1