From Images To Sentences through Scene Description Graphs

Somak Aditya, Yezhou Yang,  Chitta BaralYiannis AloimonosCornelia Fermuller


In this paper we propose the construction of linguistic descriptions of images. This is achieved through the extraction of scene description graphs (SDGs) from visual scenes using an automatically constructed knowledge base. SDGs are constructed using both vision and reasoning. Specifically, commonsense reasoning1  is applied on (a) detections obtained from existing perception methods on given images, (b) a “commonsense” knowledge base constructed using natural language processing of image annotations and (c) lexical ontological knowledge from resources such as WordNet. Amazon Mechanical Turk(AMT)-based evaluations on Flickr8k, Flickr30k and MS-COCO datasets show that in most cases, sentences auto-constructed from SDGs obtained by our method give a more relevant and thorough description of an image than a recent state-of-the-art image caption based approach. Our Image-Sentence Alignment Evaluation results are also comparable to that of the recent state-of-the art approaches.

1 Commonsense reasoning and commonsense knowledge can be of  many types (Davis Commonsense knowledge can belong to different levels of abstraction  (ConceptNet, CYC). In this paper, we focus on capturing and reasoning based on knowledge about natural activities.


Recently, researchers have revived the power of Deep Neural Networks (RNN, CNN, LSTM Encoder/Decoders (first published in Hochreiter et. al. 1997)) and applied on sequential inputs and outputs in different application scenarios, ranging from text to images and videos.

For the domain of static images, starting from Karpathy, several works have been published which successfully generate Image captions (i.e. a sequence of well-connected meaningful words) given a non-domain specific image. These captions have shown to outperform the previous procedures.

Our work, however was, motivated by both the success and failure cases of neural networks. Though, the accuracy achieved in individual object, scene detections out-performs the previously reported results; there are several images (see our work for example) for which, there is hardly any correlation between the generated caption and the image. In a sense, the system lacked interpretability (aka justifiability), which is caused by its lack of explicit modeling of “common-sense” knowledge. Our motivation is precisely bringing in (or adding) commonsense reasoning and knowledge to the process of current community-accepted end-to-end learning paradigm.

ArXiv Version:

Please download our paper from here. Based on this paper, we proposed a general architecture for image understanding in: “DeepIU: An Architecture for Image Understanding“.

Full Architecture:


Vision-Reasoning Architecture that utilizes Commonsense Knowledge Base to create a Descriptive Graph from a static Image


The Revised General Architecture:


Flickr 8k Scene Constituent annotation:

You can find the Constituent Annotations and predictions for Flickr8k,30k test images from here


Website for Visualization:

You can surf some of our Image description and SDG results here (Acknowledgement: original template for the Website is obtained from NeuralTalk Website )


You can find the detailed comparative AMT results on thoroughness and relevance for Flickr-8k,30k and MS-COCO test images  here.

Related Work (Papers citing the work):

  • Chen, Hua, Antoine Trouve, Kazuaki J. Murakami, and Akira Fukuda. “Semantic image retrieval for complex queries using a knowledge parser.” Multimedia Tools and Applications (2017): 1-19.
  • Dai B, Zhang Y, Lin D. “Detecting Visual Relationships with Deep Relational Networks.”, arXiv preprint arXiv:1704.03114. 2017 Apr 11.
  • Bouchakwa, Mariam, Yassine Ayadi, and Ikram Amous. “Modeling the semantic content of the socio-tagged images based on the extended conceptual graphs formalism.” Proceedings of the 14th International Conference on Advances in Mobile Computing and Multi Media. ACM, 2016.
  • Wiriyathammabhum, P., Summers-Stay, D., Fermüller, C., & Aloimonos, Y. (2016). Computer Vision and Natural Language Processing: Recent Approaches in Multimedia and Robotics. ACM Computing Surveys (CSUR) 2016,49(4), 71.
  • Muraoka, Masayasu, Sumit Maharjan, Masaki Saito, Kota Yamaguchi, Naoaki Okazaki, Takayuki Okatani, and Kentaro Inui. “Recognizing Open-Vocabulary Relations between Objects in Images.”, PACLIC 2016 [.pdf]
  • Kilickaya, Mert, Aykut Erdem, Nazli Ikizler-Cinbis, and Erkut Erdem. “Re-evaluating Automatic Metrics for Image Captioning.” arXiv preprint arXiv:1612.07600, 2016. [.pdf]
  • Frank Keller. “Jointly Representing Images and Text: Dependency Graphs, Word Senses, and Multimodal Embeddings.” Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion. ACM, 2016 [abstract,.pdf]
  • Peter Anderson, Basura Fernando, Mark Johnson, Stephen Gould, SPICE: Semantic Propositional Image Caption Evaluation, CVPR 2016  [.pdf]
  • Y Yang, Y Li, C Fermuller, Y Aloimonos, Neural Self Talk: Image Understanding via Continuous Questioning and Answering. [.pdf]
  • S Aditya, C Baral, Y Yang, C Fermuller, Y Aloimonos, DeepIU: An Architecture for Image Understanding, Advances of Cognitive Systems 2016 [.pdf]


    title={From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge},
    author={Aditya, Somak and Yang, Yezhou and Baral, Chitta and Fermuller, Cornelia and Aloimonos, Yiannis},
    journal={arXiv preprint arXiv:1511.03292},