Review on a Deep Learning that Make Image Speaks Naturally

Posted by Mohamad Ivan Fanany

Printed version

This writing summarizes and reviews a deep learning which make image speaks naturally: Deep Visual-Semantic Alignments for Generating Image Descriptions


  • A quick glance at an image is sufficient for a human to point out and describe an immense amount of details about the visual scene [8].
  • Still, such remarkable ability is an elusive task for our visual recognition models.

Addressed problem:

  • Build a model that generates free-form natural language descriptions of image regions.
  • Strives to take a step towards the goal of generating dense, free-form descriptions of images


  • Flickr8K [15], Flickr30K [49] and MSCOCO [30]
  • These datasets contain 8,000, 31,000 and 123,000 images respectively and each is annotated with 5 sentences using Amazon Mechanical Turk.
  • For Flickr8K and Flickr30K, 1,000 images for validation, 1,000 for testing and the rest for training (consistent with [15], 18).
  • For MSCOCO, 5,000 images for both validation and testing.

Data Preprocessing:

  • Convert all sentences to lowercase, discard non-alphanumeric characters.
  • Filter words to those that occur at least 5 times in the training set, which results in 2538, 7414, and 8791 words for Flickr8k, Flickr30K, and MSCOCO datasets respectively.

Previous works:

  • Majority of previous work in visual recognition has focused on labeling images with a fixed set of visual categories, and great progress has been achieved in these endeavors [36, 6].
  • Some pioneering approaches that address the challenge of generating image descriptions have been developed [22, 7].
  • The focus of these previous works has been on reducing complex visual scenes into a single sentence, which the authors consider as an unnecessary restriction.

Dense Image Annotation:

  • Barnard et al. [1] and Socher et al. [41] studied the multimodal correspondence between words and images to annotate segments of images.
  • [27, 12, 9] studied the problem of holistic scene understanding in which the scene type, objects and their spatial support in the image is inferred.The difference: the focus of these previous works is on correctly labeling scenes, objects and regions with a fixed set of categories, while the focus of the reviewed paper is on richer and higher-level descriptions of regions.

Generating textual description:

  • Pose the task as a retrieval problem, where the most compatible annotation in the training set is transferred to a test image [15, 42, 7, 36, 17], or where training annotations are broken up and stitched together [ 24, 28, 25].
  • Generating image captions based on fixed templates that are filled based on the content of the image [13, 23, 7, 46, 47, 4]. This approach still imposes limits on the variety of outputs, but the advantage is that the final results are more likely to be syntactically correct.
  • Instead of using a fixed template, some approaches that use a generative grammar have also been developed [35, 48].
  • Srivastava et al. [43] uses a Deep Boltzmann Machine to learn a joint distribution over a images and tags. However, they do not generate extended phrases.
  • Kiros et al. [20] developed a log-bilinear model that can generate full sentence descriptions. However, their model uses a fixed window context, while the proposed Recurrent Neural Network model can condition the probability distribution over the next word in the sentence on all previously generated words.
  • Mao et al. [31] introduced a multimodal Recurrent Neural Network architecture for generating image descriptions on the full image level, but their model is more complex and incorporates the image information in a stream of processing that is separate from the language model.

Grounding natural language in images:

  • Kong et al. [21] develop a Markov Random Field that infers correspondences from parts of sentences to objects to improve visual scene parsing in RGBD images.
  • Matuszek et al. [32] learn joint language and perception model for grounded attribute learning in a robotic setting.
  • Zitnick et al. [51] reason about sentences and their grounding in cartoon scenes.
  • Lin et al. [29] retrieve videos from a sentence description using an intermediate graph representation.
  • The basic form of the proposed model is inspired by Frome et al. [10] who associate words and images through a semantic embedding.
  • Karpathy et al. [18], who decompose images and sentences into fragments and infer their inter-modal alignment using a ranking objective.The difference:Previous model is based on grounding dependency tree relations, whereas the proposed model aligns contiguous segments of sentences which are more meaningful, interpretable, and not fixed in length.

Neural network in visual and language domains:

  • On the image side, Convolutional Neural Networks (CNNs) [26, 22] have recently emerged as a powerful class of models for image classification and object detection [38].
  • On the sentence side, the proposed work takes advantage of pretrained word vectors [34, 16,2] to obtain low-dimensional representations of words.
  • Recurrent Neural Networks have been previously used in language modeling [33, 44], but this paper additionally conditions these models on images.


  • Design of a model that is rich enough to reason simultaneously about contents of images and their representation in the domain of natural language.
  • The model should be free of assumptions about specific hard-coded templates, rules or categories and instead rely primarily on training data.
  • Datasets of image captions are available in large quantities on the internet, but these descriptions multiplex mentions of several entities whose locations in the images are unknown.

Key ideas:

  • The model leverages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between text and visual data.
  • The model is based on a novel combination of:
    • Convolutional Neural Networks over image regions
    • Bidirectional Recurrent Neural Networks over sentences,
    • Structured objective that aligns the two modalities through a multimodal embedding.
  • Closed vocabularies of visual concepts constitute a convenient modeling assumption, however, they are vastly restrictive when compared to the enormous amount of rich descriptions that a human can compose.
  • Treating the sentences as weak labels, in which contiguous segments of words correspond to some particular, but unknown location in the image.
  • Infer the alignments of word segments and use them to learn a generative model of descriptions.


  • Develop a deep neural network model that infers the latent alignment between segments of sentences and the region of the image that they describe.
  • Associates the two modalities through a common, multimodal embedding space and a structured objective.
  • Validate the effectiveness of the proposed approach on image-sentence retrieval experiments in which the proposed models surpass the state-of-the-art.
  • Introduce a multimodal Recurrent Neural Network architecture that takes an input image and generates its description in text.
  • The generated sentences significantly outperform retrieval based baselines, and produce sensible qualitative predictions.
  • Train the model on the inferred correspondences and evaluate its performance on a new dataset of region-level annotations.
  • Code, data and annotations are publicly available.


  • The input is a set of images and their corresponding sentence descriptions.
  • Present a model that aligns segments of sentences to the visual regions that they describe through a multimodal embedding.
  • Treat these correspondences as training data for our multimodal Recurrent Neural Network model which learns to generate the descriptions.

Representing Images:

  • Sentence descriptions make frequent references to objects and their attributes. [23], 18].
  • Girshick et al. [11] detect objects in every image with a Region Convolutional Neural Network (RCNN). The CNN is pre-trained on ImageNet [3] and finetuned on the 200 classes of the ImageNet Detection Challenge [38].
  • Karpathy et al. [18], use the top 19 detected locations and the whole image and compute the representations based on the pixels inside each bounding box.
  • CNN transforms the pixels inside bounding box into 4096-dimensional activations of the fully connected layer immediately before the classifier.
  • The CNN parameters contain approximately 60 million parameters and the architecture closely follows the network of Krizhevsky et al [22].
  • The weights matrix has dimensions h × 4096, where h is the size of the multimodal embedding space (h ranges from 1000-1600 in experiments). Every image is thus represented as a set of h-dimensional vectors.

Representing Sentences:

  • Represent the words in the sentence in the same h dimensional embedding space that the image regions occupy.
  • The simplest approach: project every individual word directly into this embedding.
    Shortcomings: does not consider any ordering and word context information in the sentence.
  • Extended approach: use word bigrams, or dependency tree relations as previously proposed [18].
    Shortcomings: still imposes an arbitrary maximum size of the context window and require the use of Dependency Tree Parsers that might be trained on unrelated text corpora.
  • Use a bidirectional recurrent neural network (BRNN) [39] to compute the word representations.
  • The BRNN takes a sequence of N words (encoded in a 1-of-k representation) and transforms each one into an h-dimensional vector.
  • The representation of each word is enriched by a variably-sized context around that word.
  • The weights specify a word embedding matrix that is initialized with 300-dimensional word2vec [34].
  • the BRNN consists of two independent streams of processing, one moving left to right and the other right to left.
  • The final h-dimensional representation for the word is a function of both the word at that location and also its surrounding context in the sentence.
  • Every word representation is a function of all words in the entire sentence, but the empirical finding is that the final word representations align most strongly to the visual concept of the word at that location.

Image and Sentence Alignments:

  • Map every image and sentence into a set of vectors in a common h dimensional space.
  • Labels are at the level of entire images and sentences.
  • Formulate an image-sentence score as a function of the individual scores that measure how well a word aligns to a region of an image.
  • Intuitively, a sentence-image pair should have a high matching score if its words have a confident support in the image.
  • Karpathy et al. [18], interpreted the dot product between an image fragment and a sentence fragment as a measure of similarity and used these to define the score between the image and the sentence.
  • Every word aligns to the single best image region.
  • The objective function encourages aligned image-sentences pairs to have a higher score than misaligned pairs, by a margin.


  • Evaluate a compatibility score between all pairs of test images and sentences.
  • Report the median rank of the closest ground truth result in the list and Recall @K, which measures the fraction of times a correct item was found among the top K results.


  • Compare the proposed full model (“Our model: BRNN”) to the following baselines:
    • DeViSE [10]: a model that learns a score between words and images.
    • Karpathy et al. [18]: averaged the word and image region representations to obtain a single vector for each modality.
    • Socher et al. [42] is trained with a similar objective, but instead of averaging the word representations, they merge word vectors into a single sentence vector with a Recursive Neural Network.
    • Kiros et al. [19] who use an LSTM [14] to encode sentences, and they reported results on Flickr8K and Flickr30K. They outperform the proposed model with a more powerful CNN (OxfordNet[40]).
  • In all of these cases, the proposed full model (“Our model: BRNN”) provides consistent improvements.


  • The proposed model (RNN) can only generates a description of one input array of pixels at a fixed resolution. A more sensible approach might be to use multiple saccades around the image to identify all entities, their mutual interactions and wider context before generating a description.
  • The RNN couples the visual and language domains in the hidden representation only through additive interactions, which are known to be less expressive than more complicated multiplicative interactions [44, 14].
  • Going directly from an image-sentence dataset to region-level annotations as part of a single model that is trained end-to-end with a single objective remains an open problem.

My Review:

  • The paper is a nice reading since the results are encouraging and interesting.
  • The paper gives comprehensive summary on state-of-the-arts image to text researches.
  • Unfortunately, the paper did not mention any strategy nor direction for future research.
  • As explained in the limitations, still the proposed model does not as intelligence as human to describe an image, since it should learn from pair image-text examples provided by mechanical turk. More or less, it is similar to a condition where we were asked to describe an image but only using some given texts. The true challenges is actually when there are no texts to start with. This seems requires exponentially more complex automatic objects and relation between objects description system.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s