What is Visual Dialog?

Visual Dialog is a novel task that requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a follow-up question about the image, the agent has to answer the question.

    VisDial dataset:
  • >200k images from COCO
  • 1 dialog / image
  • 10 rounds of question-answers / dialog
  • Total >2M dialog question-answers
Dec 2016VisDial v0.5 dataset and code for real-time chat interface used to collect data on AMT are now available!

Later versions of the dataset, code, pretrained models and a Visual Chatbot on CloudCV coming soon!

Email — contact@visualdialog.org


We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being grounded in vision enough to allow objective evaluation of individual responses and benchmark progress. We develop a novel two-person chat data-collection protocol to curate a large-scale Visual Dialog dataset (VisDial). Data collection is underway and on completion, VisDial will contain 1 dialog with 10 question-answer pairs on all ~200k images from COCO, with a total of ~2M dialog question-answer pairs.

We introduce a family of neural encoder-decoder models for Visual Dialog with 3 encoders — Late Fusion, Hierarchical Recurrent Encoder and Memory Network — and 2 decoders (generative and discriminative), which outperform a number of sophisticated baselines. We propose a retrieval-based evaluation protocol for Visual Dialog where the AI agent is asked to sort a set of candidate answers and evaluated on metrics such as mean-reciprocal-rank of human response. We quantify gap between machine and human performance on the Visual Dialog task via human studies. Putting it all together, we demonstrate the first 'visual chatbot'!

Read more in the paper.


  title={{V}isual {D}ialog},
  author={Abhishek Das and Satwik Kottur and Khushi Gupta and Avi Singh and Deshraj Yadav and Jos\'e M.F. Moura and
    Devi Parikh and Dhruv Batra},
  journal={arXiv preprint arXiv:1611.08669},

VisDial Dataset

Distribution of lengths for VisDial questions and answers

Percentage coverage of unique answers over all answers from the training dataset, compared to VQA. VisDial has more unique answers indicating greater answer diversity.

Most frequent answer responses in VisDial except for 'yes'/'no'

Distribution of first n-grams for VisDial questions and answers. The ordering of the words starts towards the center and radiates outwards. The arc length is proportional to the number of questions containing the word. White areas are words with contributions too small to show.


Late Fusion Encoder

This encoder embeds the image, concatenated dialog history (including caption) and question into separate vectors and performs a 'late fusion' of these into a joint embedding.

Hierarchical Recurrent Encoder

This encoder contains a dialog-level recurrent neural network on top of a QA-level recurrent block. Each QA-level recurrent block optionally includes an attention-over-history mechanism to attend to the round of the history relevant to the current question.

Memory Network Encoder

This encoder treats each previous QA pair as a 'fact' in its memory bank and learns to 'poll' the stored facts and the image to develop a context vector, which is used to decode the answer.


We thank Harsh Agrawal and Jiasen Lu for help on the AMT data collection interface. We also thank Xiao Lin, Ramprasaath Selvaraju and Latha Pemula for model discussions. Finally, we are grateful to the developers of Torch for building an excellent framework.