Visual Dialog Challenge 2020



We are pleased to announce the third Visual Dialog Challenge!

Visual Dialog is task that requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history (consisting of the image caption and a sequence of previous questions and answers), the agent has to answer a follow-up question in the dialog. To perform well on this task, the agent needs to ground the query not only in the visual content but also in the dialog history.

We believe that the next generation of intelligent systems will need to posses this ability to hold a dialog about visual content for a variety of applications, and we encourage teams to participate and help push the state of the art in this exciting area!

The first and second editions of this challenge were organised on the VisDial v1.0 dataset whose results were announced at the SiVL workshop at ECCV 2018 and the VQA and Dialog workshop at CVPR 2019 respectively.


Winners' announcement + analysis: #winners
Leaderboard: #leaderboard


14 Apr 2020Update on how challenge winners will be picked.
3 Feb 2020 — Visual Dialog Challenge 2020 announced!
14 May 2020 (23:59:59 GMT) — Submission deadline for participants.
14 Jun 2020 — Winners' announcement at the VQA and Dialog Workshop, CVPR 2020.

For questions about the challenge, ping us on Discord, or email us at [email protected].

Dataset Description

The challenge will be conducted on v1.0 of the VisDial dataset, which is based on COCO images.

VisDial v1.0 contains 1 dialog with 10 question-answer pairs (starting from an image caption) on ~130k images from COCO-trainval and Flickr, totalling ~1.3 million question-answer pairs. The v1.0 training set consists of dialogs on ~120k images from COCO-trainval, while the validation and test sets consist of dialogs on an additional ~10k COCO-like images from Flickr. We have worked closely with the COCO team to ensure that these additional images match the distribution of images and captions of the training set.

Note that the v1.0 training set combines v0.9 training and v0.9 validation splits collected on COCO-train2014 and COCO-val2014 images respectively. See FAQ below for more details on splits.

1,23,287 images with all 10 rounds of dialogs -- follow-up questions + candidate answers for each round

2,000 images with all 10 rounds of dialogs -- follow-up questions + candidate answers for each round

8,000 images with `n` rounds of dialog (`n` anywhere in 1 to 10) -- 1 follow-up question + candidate answers

The download links and more information on data format can be found here.

Participation Guidelines

To participate, teams must register on EvalAI and create a team for the challenge (see this quickstart guide). The challenge page is available here:

The challenge has three phases:

Phase #(Images) x #(Dialog rounds) Submissions Results Leaderboard
val 2,064 x 10 unlimited immediate none
test-std 4000 x 1 5 total immediate public (optional)
test-challenge 4000 x 1 5 total announced at CVPR 2020 private, announced at CVPR 2020

While answers are already provided for the val set, this phase is useful for sanity checking result format without wasting submissions in the other phases. For the test-std and test-challenge phases, the results must be submitted on the full test set. By default, the submissions for test-std phase are private but can be voluntarily released to the public leaderboard, with a limit of one public leaderboard entry per team. Submissions to test-challenge phase are considered entries into the challenge. For multiple submissions to test-challenge, the approach with the highest test-std accuracy will be used.

It is not acceptable to create multiple accounts for a single team in order to bypass these limits. The exception to this is if a group is working on multiple unrelated methods, in this case all sets of results can be submitted for evaluation. Results must be submitted to the evaluation server by the challenge deadline -- no exceptions will be made.

Starter code

The Lua Torch implementation supports all models from the Visual Dialog paper. PyTorch starter code comes with Late Fusion (LF) Encoder - Discriminative Decoder support with detector features. Both of these codebases include dataloaders for VisDial v1.0, pretrained models, scripts to save model predictions in the challenge submission format, as well as code to train your own models.

Submission Format

To submit to a phase, teams must upload a JSON file containing their model's answer rankings in the following format:

    'image_id': int,
    'round_id': int,
    'ranks': [int x100]
  }, {...}]
where ranks is an array of ranks 1-100 for each of the 100 candidate answers with the first entry corresponds to the rank of the first candidate answer. We provide an example submission file here (from the Late Fusion + Attention model from the Torch codebase). When submitting, teams should also include a method name, method description, project URL, and publication URL if available.


For evaluation, we have the following:

  • Retrieval metrics / evaluation using sparse annotations: We have mean reciprocal rank (MRR), recall (R@{1, 5, 10}), and mean rank as described in the Visual Dialog paper. Evaluation through these metrics is carried out for all the three challenge phases.
  • Evaluation using dense annotations: As some of the candidate options may be semantically identical (e.g. 'yeah' and 'yes'), we have had four human annotators indicate whether each of the 100 candidate answers is correct for each val and test phase instance. For evaluation, we report the normalized discounted cumulative gain (NDCG) over the top K ranked options, where K is the number of answers marked as correct by at least one annotator. For this computation, we consider the relevance of an answer to be the fraction of annotators that marked it as correct.

Challenge winners will be picked based on both the NDCG and MRR evaluation metrics. We will first rank submissions on each metric individually, and then average ranks across the two for the final leaderboard.

Note that we will be computing averaged rank for the same single submission per team. This is so that submissions that rank higher on the final leaderboard are ones that do well on both metrics simultaneously.

NDCG is invariant to the order of options with identical relevance and to the order of options outside of the top K. Here is an example:

Consider five answer options with ranking (from high to low): ["yes", "yes it is", "probably", "two", "no"],
their corresponding ground truth relevances: [1, 1, 0.5, 0, 0 ],
such that the number of relevant options, i.e. options with non-zero relevance, or K, is 3.
Here, NDCG remains unchanged for two cases:

  • Swapping options with same relevance.
    Ranking NDCG
    ["yes", "yes it is", "two", "probably", "no"] 0.8670
    ["yes it is", "yes", "two", "probably", "no"] 0.8670
  • Shuffling options after first K indices.
    Ranking NDCG
    ["yes", "two", "yes it is", "probably", "no"] 0.7974
    ["yes", "two", "yes it is", "no", "probably"] 0.7974

Download dense annotations on v1.0 val and (a small subset of) v1.0 train

Relevance scores from these dense annotations on VisDial v1.0 val (5 human annotators per instance) are available for download here, and on a small subset of 2000 images (200 instances per round x 10; 2 human annotators per instance) from VisDial v1.0 train are available for download here.

Winners' announcement and analysis

Winning teams of the Visual Dialog challenge 2020 were announced at the VQA and Dialog Workshop at CVPR.
Slides from the presentation are available here, and includes analysis of submissions made to the challenge.


NOTE: Challenge winners were picked based on both the NDCG and MRR evaluation metrics. We first ranked submissions on each metric individually, and then averaged ranks across the two for the final leaderboard.

Position Team NDCG MRR R@1 R@5 R@10 Mean Rank
1 Technion [Slides] [Talk] 73.35 ± 0.25 70.42 ± 0.42 58.59 ± 0.55 82.85 ± 0.42 88.84 ± 0.35 3.96 ± 0.08
1 idansc 73.49 ± 0.25 69.80 ± 0.42 58.61 ± 0.55 81.21 ± 0.44 89.19 ± 0.35 3.91 ± 0.08
2 SES-100M 75.86 ± 0.27 63.84 ± 0.47 55.62 ± 0.56 72.20 ± 0.50 83.70 ± 0.41 5.84 ± 0.11
2 MReaL Lab 75.70 ± 0.25 64.12 ± 0.43 50.81 ± 0.56 80.03 ± 0.45 90.92 ± 0.32 3.83 ± 0.06
3 VD-BERT 75.92 ± 0.29 51.84 ± 0.46 39.91 ± 0.55 63.45 ± 0.54 78.56 ± 0.46 6.57 ± 0.10
3 taufik 73.87 ± 0.27 65.12 ± 0.46 55.84 ± 0.56 74.58 ± 0.49 85.22 ± 0.40 4.89 ± 0.09
4 fga_leo 56.07 ± 0.23 69.59 ± 0.40 55.94 ± 0.56 87.01 ± 0.38 94.42 ± 0.26 3.02 ± 0.05
4 lalaland 75.04 ± 0.28 60.44 ± 0.45 48.94 ± 0.56 73.59 ± 0.49 85.54 ± 0.39 5.99 ± 0.12
4 VisDial-BERT 63.34 ± 0.23 68.79 ± 0.41 55.20 ± 0.56 86.15 ± 0.39 93.88 ± 0.27 3.12 ± 0.05
5 M 40.50 ± 0.24 48.64 ± 0.44 33.85 ± 0.53 64.54 ± 0.53 76.28 ± 0.48 8.18 ± 0.13

Frequently Asked Questions (FAQ)

  • Why aren't the val and test sets the same as COCO?
  • Conversations in Visual Dialog are seeded with captions, and releasing captions for COCO test images would compromise integrity of the COCO benchmarks. Hence, we have worked closely with the COCO team to ensure images for our evaluation splits are distributed similarly to the training set.

  • Why are only single rounds evaluated per image in the test set?
  • As the task is set up, agents must see the dialog history before answering questions -- making it impossible to test agents on more than one round per dialog without giving away the answers to other questions. We have sampled rounds uniformly for testing and will provide analysis at the workshop.

  • Why is there only one track? Won't discriminative models dominate competition because it is based on ranking a set of options?
  • We considered having two tracks -- one for generative models and one for discriminative models. But the distinction between the two can get blurry (e.g., non-parametric models that internally maintain a large list of answer options), and the separation would be difficult to enforce in practice anyway. So for now, we have a single track. Note that our choice of ranking for evaluation isn't an endorsement of either approach (generative or discriminative) and we welcome all submissions.

    Empirical findings from the 1st Visual Dialog Challenge indicate that generative models perform comparably (or even better sometimes) than discriminative models on the NDCG metric -- for example, 53.67 vs. 49.58 on VisDial v1.0 test-std for Memory Network + Attention with generative vs. discriminative decoding respectively. Code and models available here.

  • Isn't evaluation on the first round just VQA?
  • Not quite! Even at the first round, agents are primed with a caption for the image and questions often refer to objects referenced in it.

  • I don't see my question here, what do I do?
  • Ping us on Discord, or email us at [email protected]!


Vishvak Murahari
Georgia Tech
Ayush Shrivastava
Georgia Tech
Karan Desai
Georgia Tech
Rishabh Jain
Georgia Tech
Deshraj Yadav
Georgia Tech
Abhishek Das
Georgia Tech
Stefan Lee      
Georgia Tech
Devi Parikh
Georgia Tech, FAIR
Dhruv Batra
Georgia Tech, FAIR

Email — [email protected]


This work is supported by grants awarded to Dhruv Batra and Devi Parikh.