Danish NLP

We do research in language processing for Danish, focusing on providing tools, resources, and language preservation.

Projects

Danish Gigaword

It’s hard to develop good tools processing Danish with computers when no large and wide-coverage dataset of Danish text is readily available. To address this, the Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words.

See gigaword.dk for releases and further info.

PI: Leon Derczynski; Co-I: Manuel R. Ciosici.

LITHME

EU COST, 80.000 DKK, 2020-2024

LITHME is a COST Action network with members from every EU member state, plus a number of other countries outside the EU. Action chair: Dave Sayers (University of Jyväskylä).

Our aims:

  • to prepare linguistics and its subdisciplines for what is coming;
  • to facilitate longer term dialogue between linguists and technology developers.

NLPL

NordForsk/NeIC, 2017-2020

The Nordic Language Processing Laboratory (NLPL) is a collaboration of university research groups in Natural Language Processing (NLP) in Northern Europe. Our vision is to implement a virtual laboratory for large-scale NLP research by (a) creating new ways to enable data- and compute-intensive Natural Language Processing research by implementing a common software, data and service stack in multiple Nordic HPC centres, (b) by pooling competencies within the user community and among expert support teams, and (c) by enabling internationally competitive, data-intensive research and experimentation on a scale that would be difficult to sustain on commodity computing resources.

Theses

  • Automatic Quote Selection in News Production / Automatisk citatudvælgelse i nyhedsproduktion. Lasse Funder Andersen, 2021, MSc.
  • Automatic Text Summarization For Danish Using BERT. Lukas Christian Nielsen, Sebastian Lindegaard Veile, 2020, MSc.

Resources

DKStance

Danish Stance detection & misinformation classification dataset

Danish Stance detection tool

Danish Stance detection tool

DAGW Validation tool

A validator for the Danish GigaWord (DAGW) project format

daner

Danish NER tool

dapipe

Danish NLP pipeline

DKHate

Danish abusive language dataset

PolStance

Danish political stance dataset

DKIE

Information Extraction pipeline for Danish in the GATE architecture

SAM lexicon

Sentiment Analysis Multitool. Danish Sentiment lexicon

SAM dataset

Sentiment Analysis Multitool. Danish Sentiment annotated dataset

SAM classifier

Sentiment Analysis Multitool. Sentiment classification for Danish

E4C-2010

Danish Clinical notes (patientjournaler) - deeply anonymised

Danish Clinical Word Representations

Word vectors and word clusters for Danish clinical text

DAGW

Danish Gigaword corpus

Faroese parallel text

Faroese parallel text

Bornholmsk-Danish word embeddings

Bornholmsk-Danish word embeddings aligned in the FastText space

Danish Brown Clusters

Danish Brown clusters (134M tokens, a=5000)

TV2 Regionerne News Corpus

News stories and metadata published by TV2 Regionerne since 2016

Bornholmsk-Danish parallel text

Bornholmsk-Danish parallel text

Danish CoreNLP Part-of-Speech tagger model

Danish CoreNLP Part-of-Speech tagger model

Danish CoreNLP NER model

Danish CoreNLP NER model

Political stance tagger

Danish political stance tagger

Bornholmsk baseline corpus

Corpus of Bornholmsk drawn from various sources, formal/informal, contemporary/older