What is Visual Dialog?

Visual Dialog is a novel task that requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a follow-up question about the image, the agent has to answer the question.

    VisDial dataset:
  • >200k images from COCO
  • 1 dialog / image
  • 10 rounds of question-answers / dialog
  • Total >2M dialog question-answers
Dec 2016VisDial v0.5 dataset and code for real-time chat interface used to collect data on AMT are now available!

Later versions of the dataset, code, pretrained models and a Visual Chatbot on CloudCV coming soon!


Email — [email protected]


Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning

Abhishek Das*, Satwik Kottur*, José M.F. Moura, Stefan Lee and Dhruv Batra
* equal contribution
ArXiv 2017 [Bibtex] [PDF]

Visual Dialog

Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh and Dhruv Batra
CVPR 2017 (Spotlight) [Bibtex] [PDF]


Acknowledgements

We thank Harsh Agrawal and Jiasen Lu for help on the AMT data collection interface; Xiao Lin, Ramprasaath Selvaraju and Latha Pemula for model discussions; Marco Baroni, Antoine Bordes, Mike Lewis, and Marc'Aurelio Ranzato for helpful discussions. Finally, we are grateful to the developers of Torch for building an excellent framework.