This is the page of the task 11 @SemEval2023 on Learning with Disagreements Le-Wi-DI, 2nd edition
šš
10/01/23: test released, evaluation phase begins!
- Deadline to submit your predictions valid for the competition at Semeval workshop: 31/01/23 23:59 UTC
- Go on our Codalab page to get the data and start preparing your models!!
- Watch here the video with the presentation of the task!!
Overview
In recent years, the assumption that natural language (NL) expressions have a single and clearly identifiable interpretation in a given context is more and more recognized as just a convenient idealization. The objective of the Learning with Disagreement shared task is to provide a unified testing framework for learning from disagreements, using datasets containing information about disagreements for interpreting language. Learning with Disagreement (Le-Wi-Di) 2021 created a benchmark consisting of 6 existing and widely used datasets, but focusing primarily on semantic ambiguity and image classification.
For SemEval 2023, we run a second shared task on the topic of Learning with Disagreement: (i) the focus is entirely on subjective tasks, where training with aggregated labels makes much less sense, and (ii) while relying on the same infrastructure, it will involve new datasets. We believe that the shared task thus reformulated is extremely timely, given the current high degree of interest in subjective tasks such as offensive language detection in general, and in particular on the issue of disagreements in such data (Basile et al., 2021; Leonardelli et al., 2021; Akhtar et al., 2021; Davani et al., 2022; Uma et al., 2021) and we hope it attract substantial interest from the community.
The Datasets
Our focus is entirely on subjective tasks, where training with aggregated labels makes much less sense. To this end, we collected a benchmark of four (textual) datasets with different characteristics, in terms of genres (social media and conversations), of languages (English and Arabic), of tasks (misogyny, hate-speech, offensiveness detection) and of annotationsā methods (experts, specific demographics groups, AMT-crowd). But all datasets providing a multiplicity of labels for each instance.
The four datasets presented are:
- The HS-brexit dataset: an entirely new dataset of tweets on Abusive Language on Brexit and annotated for hate speech (HS), aggressiveness and offensiveness by six annotators belonging to two distinct groups: a target group of three Muslim immigrants in the UK, and a control group of three other individuals.
- The ArMIS dataset: a dataset of Arabic tweets annotated for misogyny detection by annotators with different demographics characteristics (āModerate Femaleā, āLiberal Femaleā and āConservative Maleā). This dataset is new.
- The ConvAbuse dataset: a dataset of 4,185 English dialogues conducted between users and two conversational agents. The user utterances have been annotated by at least three experts in gender studies using a heirarchical labelling scheme (following categories: Abuse binary, Abuse severity; Directedness; Target; Type).
- The MultiDomain Agreement dataset: a dataset of around 10k English tweets from three domains (BLM, Election, Covid-19). Each tweet is annotated for offensiveness by 5 annotators via AMT.Particular focus was put on pre-selecting tweets to be annotated that are potentially leading to disagreement. Indeed, almost 1/3 of the dataset has then been annotated with a 2 vs 3 annotators disagreement, and another third of the dataset has an agreement of 1 vs 4.
Aim of the task and data format
We encourage participants in developing methods able to capture agreements/disagreements, rather than focusing on developing the best model. To this end, we developed an harmonized json format used to release all datasets. Thus, features that are common to all datasets, are released in a homogenous format, so to facilitate participants in testing their methods across all the datasets.
Among the information released that is common to all datasets, and of particular relevance for the task, are the disaggregated crowd-annotations labels and the annotatorsā reference. Moreover, dataset-specific information are also released, and vary for each dataset, from demographics of annotators (ArMIS and HS-Brexit datasets), to the other annotations made by the same annotators within the same dataset (all datasets) or additional annotations given for for the same item (HS-Brexit and ConvAbuse datasets) by the same annotator. Participants can leverage on this dataset-specific information to improve perfomance for a specific dataset.
The competition
The shared task is hosted on Codalab. Please refer to Codalab platform for more detailed information. Note that the competion has not started yet and only a sample of the datasets has been released. The competition will start with the practice phase from 01/09/22. Please subscribe to our google group to be updated with the news about the task!
Important Dates
- Sample data ready 15 July 2022
- Training data ready 1 September 2022
- Evaluation start 10 January 2023
- Evaluation end by 31 January 2023 (latest date; task organizers may choose an earlier date)
- System paper submission due February 2023
- Task paper submission due February 2023
- Notification to authors March 2023
- Camera ready due April 2023
- SemEval workshop Summer 2023 (co-located with a major NLP conference)
Organisers
- Elisa Leonardelli, FKB Trento, Italy
- Gavin Abercrombie, Heriot Watt University Edinburgh, UK
- Valerio Basile, University of Turin, IT
- Tommaso Fornaciari, Italian National Policem IT
- Barbara Plank, LMU Munich, DE
- Massimo Poesio, Queen Mary University, UK
- Verena Rieser, Heriot Watt University Edinburgh, UK
- Alexandra N Uma, Connex One, UK
Communication
Join our google group. Weāll keep you updated with the news about the task. Contact us directly, if you have further inquiries. Follow us on Twitter, for news about learning with disagreements and more!