From Images To Sentences through Scene Description Graphs

Somak Aditya, Yezhou Yang,  Chitta BaralYiannis AloimonosCornelia Fermuller


In this paper we propose the construction of linguistic descriptions of images. This is achieved through the extraction of scene description graphs (SDGs) from visual scenes using an automatically constructed knowledge base. SDGs are constructed using both vision and reasoning. Specifically, commonsense reasoning1  is applied on (a) detections obtained from existing perception methods on given images, (b) a “commonsense” knowledge base constructed using natural language processing of image annotations and (c) lexical ontological knowledge from resources such as WordNet. Amazon Mechanical Turk(AMT)-based evaluations on Flickr8k, Flickr30k and MS-COCO datasets show that in most cases, sentences auto-constructed from SDGs obtained by our method give a more relevant and thorough description of an image than a recent state-of-the-art image caption based approach. Our Image-Sentence Alignment Evaluation results are also comparable to that of the recent state-of-the art approaches.

1 Commonsense reasoning and commonsense knowledge can be of  many types (Davis Commonsense knowledge can belong to different levels of abstraction  (ConceptNet, CYC). In this paper, we focus on capturing and reasoning based on knowledge about natural activities.


Recently, researchers have revived the power of Deep Neural Networks (RNN, CNN, LSTM Encoder/Decoders (first published in Hochreiter et. al. 1997)) and applied on sequential inputs and outputs in different application scenarios, ranging from text to images and videos.

For the domain of static images, starting from Karpathy, several works have been published which successfully generate Image captions (i.e. a sequence of well-connected meaningful words) given a non-domain specific image. These captions have shown to outperform the previous procedures.

Our work, however, was motivated by both the successful and the failure cases of neural network-based models. Though, the accuracy achieved in the individual object, scene detections out-performs the previously reported results; there are several images (see our work for example) for which, there is hardly any correlation between the generated caption and the image. In a sense, the system lacked interpretability (aka justifiability), which is caused by its lack of explicit modeling of “common-sense” knowledge. Our motivation is precisely bringing in (or adding) commonsense reasoning and knowledge to the process of current community-accepted end-to-end learning paradigm.

ArXiv Version and Subsequent Conference/Journal Articles:

Please download our paper from here. Based on the arXiv draft, we further proposed a general architecture for image understanding in: “DeepIU: An Architecture for Image Understanding” (ACS 4, 2016). An extension of this work (and a more complete version with new experiments) “Image Understanding using Vision and Reasoning through Scene Description Graph”  has been accepted at CVIU journal (Accepted in December 2017). A pre-print draft is available from here.

Full Architecture:


Vision-Reasoning Architecture that utilizes Commonsense Knowledge Base to create a Descriptive Graph from a static Image

The Revised General Architecture:


Flickr 8k Scene Constituent annotation:

You can find the Constituent Annotations and predictions for Flickr8k,30k test images from here


The website for Visualization:

You can surf some of our Image description and SDG results here (Acknowledgement: original template for the Website is obtained from NeuralTalk Website )


You can find the detailed comparative AMT results on thoroughness and relevance for Flickr-8k,30k and MS-COCO test images here (re-coined as the COMPOSITE dataset by Anderson et. al. 2016). If you use this data for your research, please cite the journal version of the previous arXiv draft (see bibtex below).

Related Work (Papers citing the work):

  • Doore, Stacy, Kate Beard, and Nicholas Giudice. “Spatial Prepositions in Natural-Language Descriptions of Indoor Scenes.” In International Conference on Spatial Information Theory, pp. 255-260. Springer, Cham, 2017.
  • Chen, Hua, Antoine Trouve, Kazuaki J. Murakami, and Akira Fukuda. “Semantic image retrieval for complex queries using a knowledge parser.” Multimedia Tools and Applications (2017): 1-19.
  • Dai B, Zhang Y, Lin D. “Detecting Visual Relationships with Deep Relational Networks.”, arXiv preprint arXiv:1704.03114. 2017 Apr 11.
  • Bouchakwa, Mariam, Yassine Ayadi, and Ikram Amous. “Modeling the semantic content of the socio-tagged images based on the extended conceptual graphs formalism.” Proceedings of the 14th International Conference on Advances in Mobile Computing and Multi-Media. ACM, 2016.
  • Wiriyathammabhum, P., Summers-Stay, D., Fermüller, C., & Aloimonos, Y. (2016). Computer Vision and Natural Language Processing: Recent Approaches in Multimedia and Robotics. ACM Computing Surveys (CSUR) 2016,49(4), 71.
  • Muraoka, Masayasu, Sumit Maharjan, Masaki Saito, Kota Yamaguchi, Naoaki Okazaki, Takayuki Okatani, and Kentaro Inui. “Recognizing Open-Vocabulary Relations between Objects in Images.”, PACLIC 2016 [.pdf]
  • Kilickaya, Mert, Aykut Erdem, Nazli Ikizler-Cinbis, and Erkut Erdem. “Re-evaluating Automatic Metrics for Image Captioning.” arXiv preprint arXiv:1612.07600, 2016. [.pdf]
  • Frank Keller. “Jointly Representing Images and Text: Dependency Graphs, Word Senses, and Multimodal Embeddings.” Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion. ACM, 2016 [abstract,.pdf]
  • Peter Anderson, Basura Fernando, Mark Johnson, Stephen Gould, SPICE: Semantic Propositional Image Caption Evaluation, CVPR 2016  [.pdf]
  • Y Yang, Y Li, C Fermüller, Y Aloimonos, Neural Self Talk: Image Understanding via Continuous Questioning and Answering. [.pdf]
  • S Aditya, C Baral, Y Yang, C Fermüller, Y Aloimonos, DeepIU: An Architecture for Image Understanding, Advances of Cognitive Systems 2016 [.pdf]


title = "Image Understanding using vision and reasoning through Scene Description Graph ",
journal = "Computer Vision and Image Understanding ",
volume = "",
number = "",
year = "2017",
note = "In Press, Accepted Manuscript",
issn = "1077-3142",
doi = "",
url = "",
author = "Somak Aditya and Yezhou Yang and Chitta Baral and Yiannis Aloimonos and Cornelia Fermüller"


    title={From Images to Sentences through Scene Description Graphs using Commonsense Reasoning and Knowledge},
    author={Aditya, Somak and Yang, Yezhou and Baral, Chitta and Fermuller, Cornelia and Aloimonos, Yiannis},
    journal={arXiv preprint arXiv:1511.03292},