1

Annotating Online Misogyny

Online misogyny, a category of online abusive language, has serious and harmful social consequences. Automatic detection of misogynistic language online, while imperative, poses complicated challenges to both data gathering, data annotation, and bias …

PROCAT: Product Catalogue Dataset for Implicit Clustering, Permutation Learning and Structure Prediction

In this dataset paper we introduce PROCAT, a novel e-commerce dataset containing expertly designed product catalogues consisting of individual product offers grouped into complementary sections. We aim to address the scarcity of existing datasets in …

Hyperparameter Power Impact in Transformer Language Model Training

Training large language models can consume a large amount of energy. We hypothesize that the language model's configuration impacts its energy consumption, and that there is room for power consumption optimisation in modern large language models. To …

Summarizing scientific literature on the basis of deconstructed systematic reviews and meta-analyses

There is an acute need for large-scale help digesting scientific literature. In 2018, the total number of published scientific articles was estimated at 2.52 million and the number of scientific journals at around 30.000 . With such vast amounts of …

An IDR Framework of Opportunities and Barriers between HCI and NLP

This paper presents a framework of opportunities and barriers/risks between the two research fields Natural Language Processing (NLP) and Human-Computer Interaction (HCI). The framework is constructed by following an interdisciplinary research-model …

DanFEVER: claim verification dataset for Danish

Automatic detection of false claims is a difficult task. Existing data to support this task has largely been limited to English. We present a dataset, DANFEVER, intended for claim verification in Danish. The dataset builds upon the task framing of …

The Danish Gigaword Corpus

Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion …

Abusive Language Recognition in Russian

Abusive phenomena are commonplace in language on the web. The scope of recognizing abusive language is broad, covering many behaviours and forms of expression. This work addresses automatic detection of abusive language in Russian. The lexical, …

Discriminating Between Similar Nordic Languages

Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine learning approach for automatic language identification for the Nordic languages, …

Offensive Language and Hate Speech Detection for Danish

The presence of offensive language on social media platforms and the implications this poses is becoming a major concern in modern society. Given the enormous amount of content created every day, automatic methods are required to detect and deal with …