Visual Dialog Challenge 2018


We are pleased to announce the first Visual Dialog Challenge!

Visual Dialog is a novel task that requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history (consisting of the image caption and a sequence of previous questions and answers), the agent has to answer a follow-up question in the dialog. To perform well on this task, the agent needs to ground the query not only in the visual content but also in the dialog history.

We believe that the next generation of intelligent systems will need to posses this ability to hold a dialog about visual content for a variety of applications, and we encourage teams to participate and help push the state of the art in this exciting area!


04 Jun 2018 — Visual Dialog Challenge 2018 announced!
mid-June — VisDial v1.0 test set release.
mid-August — Submission deadline for participants.
08 Sep 2018 — Winners' announcement at ECCV 2018, Munich, Germany.

Dataset Description

The challenge will be conducted on v1.0 of the VisDial dataset, which is based on COCO images.

For context, the currently available VisDial v0.9 contains 1 dialog with 10 question-answer pairs (starting from an image caption) on ~120k images from COCO-trainval, totalling ~1.2 million question-answer pairs. This data forms the training set for the challenge.

For VisDial v1.0, we have collected dialogs for an additional ~10k COCO-like images — bringing the total size of the dataset to ~1.3 million dialog QA pairs for ~130k images. We have worked closely with the COCO team to ensure that this distribution of images and captions matches that of the training set. This additional data forms the val and test sets for the challenge. See FAQ below for details.

1,23,287 images with 10-round dialogs + candidate answers for each question

2,000 images with 10-round dialogs + candidate answers for each question

8,000 images with n-round dialogs (n anywhere in 1 to 10) and 1 follow-up question + candidate answers

The download links and more information on data format can be found here.

Participation Guidelines

To participate, teams must register on EvalAI and create a team for the challenge (see this Quick-Start Guide).
The challenge has three phases:

Phase #(Images) x #(Dialog rounds) Submissions Results Leaderboard
val 2,064 x 10 unlimited immediate none
test-std 4000 x 1 5 per day immediate public (optional)
test-challenge 4000 x 1 5 total announced at ECCV 2018 private, announced at ECCV 2018

While answers are already provided for the val set, this phase is useful for sanity checking result format without wasting submissions in the other phases. For the test-std and test-challenge phases, the results must be submitted on the full test set. By default, the submissions for test-std phase are private but can be voluntarily released to the public leaderboard, with a limit of one public leaderboard entry per team. Submissions to test-challenge phase are considered entries into the challenge. For multiple submissions to test-challenge, the approach with the highest test-std accuracy will be used.

It is not acceptable to create multiple accounts for a single team in order to bypass these limits. The exception to this is if a group is working on multiple unrelated methods, in this case all sets of results can be submitted for evaluation. Results must be submitted to the evaluation server by the challenge deadline -- no exceptions will be made.

Quickstart: We have provided a Lua Torch implementation of the models from the Visual Dialog paper. This codebase also provides example dataloaders and scripts to output results in the challenge submission format (described in next section).

Submission Format

To submit to a phase, teams must upload a JSON file containing their model's answer rankings in the following format:

    'image_id': int,
    'round_id': int,
    'ranks': [int x100]
  }, {...}]
where ranks is an array of ranks 1-100 for each of the 100 candidate answers with the first entry corresponds to the rank of the first candidate answer. We provide an example file here. When submitting, teams should also include a method name, method description, project URL, and publication URL if available.


For evaluation, we will have the following:

  • Retrieval metrics: From Visual Dialog paper, we have [email protected], mean rank, and mean reciprocal rank (MRR) as described in the paper. Evaluation through these metrics is carried out for all the three challenge phases.
  • Human annotation based evaluation: Available for test phases, along with retrieval metrics. More details will be released with the release of the test split.

Frequently Asked Questions (FAQ)

  • Why aren't the val and test sets the same as COCO?
  • Conversations in Visual Dialog are seeded with captions, and releasing captions for COCO test images would compromise integrity of the COCO benchmarks. Hence, we've worked closely with the COCO team to ensure our images and annotations for our evaluations splits are distributed similarly to the training set.

  • Why are only single rounds evaluated per image in the test set?
  • As the task is set up, agents must see the dialog history before answering questions -- making it impossible to test agents on more than one round per dialog without giving away the answers to other questions. We have sampled rounds uniformly for testing and will provide analysis at the workshop.

  • Isn't evaluation on the first round just VQA?
  • Not quite! Even at the first round, agents are primed with a caption for the image and questions often refer to objects referenced in it.

  • I don't see my question here, what do I do?
  • Ping us on Discord, or email us at [email protected]!


Satwik Kottur
Carnegie Mellon
Abhishek Das
Georgia Tech
Deshraj Yadav
Georgia Tech
Karan Desai
Georgia Tech
Stefan Lee      
Georgia Tech
Devi Parikh
Georgia Tech, FAIR
Dhruv Batra
Georgia Tech, FAIR

Email — [email protected]