What is Visual Dialog?
Visual Dialog is a novel task that requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a follow-up question about the image, the agent has to answer the question.
- VisDial dataset:
- 140k images from COCO
- 1 dialog / image
- 10 rounds of question-answers / dialog
- Total 1.4M dialog question-answers
Mar 2017 — VisDial v0.9 dataset and code for real-time chat interface used to collect data on AMT are now available!
Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
* equal contribution
We thank Harsh Agrawal and Jiasen Lu for help on the AMT data collection interface; Xiao Lin, Ramprasaath Selvaraju and Latha Pemula for model discussions; Marco Baroni, Antoine Bordes, Mike Lewis, and Marc'Aurelio Ranzato for helpful discussions. Finally, we are grateful to the developers of Torch for building an excellent framework. This work was funded in part by the NSF CAREER awards to Dhruv Batra and Devi Parikh, ONR YIP awards to Dhruv Batra and Devi Parikh, ONR Grant N00014-14-1-0679 to Dhruv Batra, a Sloan Fellowship to Devi Parikh, ARO YIP awards to Dhruv Batra and Devi Parikh, an Allen Distinguished Investigator award to Devi Parikh from the Paul G. Allen Family Foundation, ICTAS Junior Faculty awards to Dhruv Batra and Devi Parikh, Google Faculty Research Awards to Dhruv Batra and Devi Parikh, Amazon Academic Research Awards to Dhruv Batra and Devi Parikh, AWS in Education Research grant to Dhruv Batra and NVIDIA GPU donations to Dhruv Batra.