Danish NLP

We do research in language processing for Danish, focusing on providing tools, resources, and language preservation.


Danish Gigaword

It’s hard to develop good tools processing Danish with computers when no large and wide-coverage dataset of Danish text is readily available. To address this, the Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words.

See gigaword.dk for releases and further info.

PI: Leon Derczynski; Co-I: Manuel R. Ciosici.


EU COST, 80.000 DKK, 2020-2024

LITHME is a COST Action network with members from every EU member state, plus a number of other countries outside the EU. Action chair: Dave Sayers (University of Jyväskylä).

Our aims:

  • to prepare linguistics and its subdisciplines for what is coming;
  • to facilitate longer term dialogue between linguists and technology developers.


NordForsk/NeIC, 2017-2020

The Nordic Language Processing Laboratory (NLPL) is a collaboration of university research groups in Natural Language Processing (NLP) in Northern Europe. Our vision is to implement a virtual laboratory for large-scale NLP research by (a) creating new ways to enable data- and compute-intensive Natural Language Processing research by implementing a common software, data and service stack in multiple Nordic HPC centres, (b) by pooling competencies within the user community and among expert support teams, and (c) by enabling internationally competitive, data-intensive research and experimentation on a scale that would be difficult to sustain on commodity computing resources.


  • Automatic Quote Selection in News Production / Automatisk citatudvælgelse i nyhedsproduktion. Lasse Funder Andersen, 2021, MSc.
  • Automatic Text Summarization For Danish Using BERT. Lukas Christian Nielsen, Sebastian Lindegaard Veile, 2020, MSc.



Danish Stance detection & misinformation classification dataset

Danish Stance detection tool

DAGW Validation tool

A validator for the Danish GigaWord (DAGW) project format


Danish NER tool


Danish NLP pipeline


Danish abusive language dataset


Danish political stance dataset


Information Extraction pipeline for Danish in the GATE architecture

SAM lexicon

Sentiment Analysis Multitool. Danish Sentiment lexicon

SAM dataset

Sentiment Analysis Multitool. Danish Sentiment annotated dataset

SAM classifier

Sentiment Analysis Multitool. Sentiment classification for Danish


Danish Clinical notes (patientjournaler) - deeply anonymised

Danish Clinical Word Representations

Word vectors and word clusters for Danish clinical text


Danish Gigaword corpus

Faroese parallel text

Bornholmsk-Danish word embeddings

Bornholmsk-Danish word embeddings aligned in the FastText space

Danish Brown Clusters

Danish Brown clusters (134M tokens, a=5000)

TV2 Regionerne News Corpus

News stories and metadata published by TV2 Regionerne since 2016

Bornholmsk-Danish parallel text

Danish CoreNLP Part-of-Speech tagger model

Danish CoreNLP NER model

Political stance tagger

Bornholmsk baseline corpus

Corpus of Bornholmsk drawn from various sources, formal/informal, contemporary/older