NLP Power

The First Workshop on Efficient Benchmarking in NLP.

NLP Power!

The First Workshop on Efficient Benchmarking in NLP.

Twitter URL

Workshop schedule 🍭🌙

The workshop programme is now available here.

Round table: We are organizing the round table on the benchmarks and their problems. If you feel like it’s your theme to elaborate on, feel free to join the discussion!

GLUE authors, let’s stick together!

⏰: May 26 16:50 via zoom, Dublin time, final hours of the workshop

Invited speakers

Ulises A. Mejias (SUNY Oswego)


This presentation will frame the increasing monopolistic power of Big Tech in the field of NLP as an instance of data colonialism. The concept of data colonialism explains the emergence of a new social order based not on the extraction of natural resources or labor, but on the appropriation of human life —including language— through data. This new type of colonialism is already resulting in a highly unequal social arrangement that is deeply incompatible with human freedom and autonomy. And as research in NLP becomes the exclusive domain of powerful corporations, language itself is becoming a zone of colonial extractivism. Resisting this order will require strategies that decolonial thinking has foregrounded for centuries.

Ulises A. Mejias is Professor of Communication Studies and director of the Institute for Global Engagement at SUNY Oswego. From 2021 to 2025, a Fulbright Specialist fellow. Ulises A. Mejias is co-founder of the Non-Aligned Technologies Movement and the network Tierra ComĂşn and also serves on the Board of Directors of Humanities New York, a National Endowment for the Humanities affiliate. Research interests include critical internet studies, network theory and science, philosophy of technology, sociology of communication, and political economy of digital media.

Anna Rumshisky (UMASS, Amazon)


Multiple studies have shown that Transformers are remarkably robust, and retain performance advantages under massive pruning and other disruptions. Contrary to this received wisdom, in this talk, we will show that pre-trained Transformer encoders are surprisingly fragile to the removal of a very small number of features in the layer outputs. We will also consider the performance of several pre-trained Transformer families and pre-training objectives on a number of downstream probing tasks, and discuss some important challenges to these architectures.

Anna Rumshisky is an Associate Professor of Computer Science at the University of Massachusetts Lowell, where she heads the Text Machine Lab for NLP. Her primary research area is machine learning for natural language processing, with a focus on deep learning techniques.

He He (CILVR / ML2)


Model robustness and spurious correlations have received increasing attention in the NLP community, both in methods and evaluation. The term “spurious correlation” has been abused to refer to any undesirable shortcuts learned by the model, judged by domain knowledge. However, in NLP, many features (e.g. word overlap and negation) are not spurious in the sense that the background is a spurious feature to classifying the object in the image. They carry important information that’s needed to make predictions by humans. In this talk, we argue that it is more productive to consider features from the aspects of necessity and sufficiency, and discuss the implications of this categorization in representation, learning, and evaluation.

He He is an Assistant Professor of Computer Science and Data Science at CILVR Lab (Computational Intelligence, Learning, Vision, and Robotics) NYU, where she develops reliable communication in natural language between machines and humans. Recent research directions include: (1) text generation (2) robust language understanding (3) dialogue systems.

Important dates


Workshop description

Benchmarks have played a crucial role in accelerating progress in the field of NLP, covering a wide range of research directions: natural language understanding (GLUE, SuperGLUE), natural language generation (GEM), cross-lingual knowledge transfer (XGLUE, XTREME), probing, and interpretation (LINSPECTOR, SentEval), hate speech and bias (HateCheck, StereoSet, HONEST) and robustness to adversarial attacks (RobustnessGym, AdvGLUE). Despite the fact that the concept of benchmarking has become a standard practice for evaluating upcoming models against one another and human solvers, there are still a number of unresolved issues and methodological concerns.

The main objectives of this workshop are to (1) create a space for critical reflection on current benchmarks and evaluation tools, (2) encourage the development of improved or new benchmarks and evaluation tools that resolve current challenges, (3) develop better approaches to model ranking, (4) rethink benchmarking strategies that best account for computational costs, energy and ethical considerations, out-of-domain language capabilities and meeting the end-user preferences. We welcome submissions on ongoing and finished research and hope to provide an opportunity for participants to present their work and exchange ideas. Particular topics of interest include, but are not limited to:

Computational race & carbon footprints A recent trend on scaling the number of parameters in pretrained language models to hundreds of billions (Brown et al., 2020) has facilitated novel state-of-the-art results on NLU benchmarks at the cost of millions of dollars and large carbon footprint (Strubell et al., 2019; Lottick et al., 2019; Bender et al., 2021). This often results in insignificant improvements, e.g. the difference in the overall score of current top-3 models on the SuperGLUE leaderboard is of maximum 0.3%. The trend also leads to the problem of unequal access to computational resources (Couldry and Mejias, 2020) and difficulties of using such models in academic and industrial fields (Schick and SchĂĽtze, 2021).

Linguistic competence Modern models struggle to learn rare phenomena from data, even when increasing the size of pre-training corpora (Zhang et al., 2020). The linguistic phenomena generally follow Zipf distribution, meaning that most of them are harder to learn because of their rare presence in natural language data (Rogers, 2021).

Reproducibility crisis Recent studies report that reproducibility of the stated results in NLP is hard to achieve w.r.t. neural models (Belz et al., 2021a; Bianchi and Hovy, 2021) and human solvers (Belz et al., 2021b), a scientific problem that has received the term of “reproducibility crisis” (Baker, 2016). Systematic reviews of this kind raise doubts on the reliability of model ranking on canonical leaderboards, specifically against human baselines which are widely regarded as the standard form of tracing progress in the field.

Model and human evaluation design The NLP field has not yet reached a consensus about benchmark methodological guidelines on model and human evaluation (Bowman and Dahl, 2021; Rodriguez et al., 2021). Many works detail cognitive biases in human evaluation (Schoch et al., 2020), critique unreliability of human baselines through crowd-sourcing (Nangia and Bowman, 2019), reinforce the need of independent evaluation in multiple languages (Linzen, 2020), highlight the necessity of measuring contribution of a particular task to the resulting score (Choudhury and Deshpande, 2021), and evaluating inference time and memory consumption (Kiela et al., 2021) along with the user preferences (Ethayarajh and Jurafsky, 2020).

Application to real-word scenarios Recent studies reveal that even those models, which gain highest scores in existing benchmarks, can be fooled be carefully crafted adversarial examples (Nie et al., 2020). This justifies the need for systematic evaluation of model robustness in form of adversarial attacks and other transformations, applied to test data (Wang et al., 2021; Goel et al., 2021).

Data collection & leakage Benchmark datasets076are traditionally collected from publicly available resources that may contain malicious and ethically inappropriate data (Liang et al., 2020). Moreover, recent works reveal data leakages between train and test sets (Elangovan et al., 2021; Lewis et al., 2021), rising questions whether the models indeed demonstrate their generalization abilities or simply learn statistical artefacts in the pre-training and downstream data. This has stimulated design of zero- and few-shot scenarios that assess generalization capacity of the models (Hou et al., 2020; Alex et al., 2021).

Program Committee

Emergency reviewers:


Contact information

Email: nlp_power at

Ethics policy

Authors are required to honour the ethical code set out in the ACL Code of Ethics, and comply with the ethics guidelines for ARR submissions.

Anti-Harassment Policy

NLP Power 2022 adheres to the ACL Anti-Harassment Policy.

Demographic Diversity

The open exchange of ideas, the freedom of thought and expression, and respectful scientific debate are central to our workshop. First, we will make sure that the ACL Anti-Harassment Policy is respected during the organization and execution of the event. Second, our approach to the selection of invited speakers / PC takes into consideration the need for demographic diversity. Third, our organizing team includes individuals from diverse genders and sociodemographic backgrounds.