Rethinking Benchmarking in AI: Adversarial NLI, Dynabench and Hateful Memes

The current benchmarking paradigm in AI has many issues: benchmarks saturate quickly, are susceptible to overfitting, contain exploitable annotator artifacts, have unclear or imperfect evaluation metrics, and do not measure what we really care about. I will talk about my work in trying to rethink the way we do benchmarking in AI, covering the Adversarial NLI dataset and the recently launched Dynabench platform. I’ll also discuss the Hateful Memes Challenge, which is meant to measure true multimodal reasoning and understanding capabilities in vision and language models.