Model

Figure 1: Overview of our Reflective Decoding Network (RDN) for Image Captioning

samples

Figure 2: Examples of captions generated by our RDN compared to the basis decoder (using traditional LSTM) and their reflective attention weight distribution over the past generated hidden states when predicting the key words highlighted in green. The thicker line indicates a relatively larger weight and the red line means the largest contribution.

Abstract

State-of-the-art image captioning methods mostly focus on improving visual features, less attention has been paid to utilizing the inherent properties of language to boost captioning performance. In this paper, we show that vocabulary coherence between words and syntactic paradigm of sentences are also important to generate high-quality image caption. Following the conventional encoder-decoder framework, we propose the Reflective Decoding Network (RDN) for image captioning, which enhances both the long-sequence dependency and position perception of words in a caption decoder. Our model learns to collaboratively attend on both visual and textual features and meanwhile perceive each word's relative position in the sentence to maximize the information delivered in the generated caption. We evaluate the effectiveness of our RDN on the COCO image captioning datasets and achieve superior performance over the previous methods. Further experiments reveal that our approach is particularly advantageous for hard cases with complex scenes to describe by captions.

Hard Image Captioning

Download: Hard Captioning Split set (Image IDs) | CIDEr Scores for Comparison on Hard Split

Definition: Compared to traditional captioning performance evaluation on 'Karpathy' splits, in this paper, we further investigate the effect of the average length of annotations (ground truth captions) on the captioning performance, since generally the images with averagely longer annotations contain more complex scenes and thus are harder for captioning. Specifically, we rank the whole ‘Karparthy’ testset (5000 images) according to their average length of annotations in descending order and extract four different size of subsets (all set, top-1000, top-500, top-300 respectively). Smaller subset corresponds to averagely longer annotations and implies harder image captioning.

hard

Figure 3: Performance comparison between our RDN model and Up-Down on hard Image Captioning as a function of average length of annotations (ground truth captions). We rank the ‘Karpathy’ test set according to their average length of annotations in descending order and extract four different size of subsets. Smaller subset corresponds to averagely longer annotations and harder captioning. It reveals that our model exhibits more superiority over Up-Down in harder cases.

Hard image captioning experiment compares between our RDN and Up-Down (main difference of the two models is that Up-Down uses traditional LSTM). It reveals that the performance of both models are decreasing with the increasing average length of annotations, which reflects that the captioning is getting harder. However, our model exhibits more superiority over Up-Down in harder cases, which in turn validates the ability of our RDN to capture the long-term dependencies within captions.

Citation

Bibtex

@InProceedings{Ke_2019_ICCV,
author = {Ke, Lei and Pei, Wenjie and Li, Ruiyu and Shen, Xiaoyong and Tai, Yu-Wing},
title = {Reflective Decoding Network for Image Captioning},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {October},
year = {2019}
}