Danish NLP

We do research in language processing for Danish, focusing on providing tools, resources, and language preservation.

Projects

Danish Gigaword

It’s hard to develop good tools processing Danish with computers when no large and wide-coverage dataset of Danish text is readily available. To address this, the Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words.

See gigaword.dk for releases and further info.

PI: Leon Derczynski; Co-I: Manuel R. Ciosici.

LITHME

EU COST, 80.000 DKK, 2020-2024

LITHME is a COST Action network with members from every EU member state, plus a number of other countries outside the EU. Action chair: Dave Sayers (University of Jyväskylä).

Our aims:

to prepare linguistics and its subdisciplines for what is coming;
to facilitate longer term dialogue between linguists and technology developers.

NLPL

NordForsk/NeIC, 2017-2020

The Nordic Language Processing Laboratory (NLPL) is a collaboration of university research groups in Natural Language Processing (NLP) in Northern Europe. Our vision is to implement a virtual laboratory for large-scale NLP research by (a) creating new ways to enable data- and compute-intensive Natural Language Processing research by implementing a common software, data and service stack in multiple Nordic HPC centres, (b) by pooling competencies within the user community and among expert support teams, and (c) by enabling internationally competitive, data-intensive research and experimentation on a scale that would be difficult to sustain on commodity computing resources.

Publications

Theses

Automatic Quote Selection in News Production / Automatisk citatudvælgelse i nyhedsproduktion. Lasse Funder Andersen, 2021, MSc.
Automatic Text Summarization For Danish Using BERT. Lukas Christian Nielsen, Sebastian Lindegaard Veile, 2020, MSc.

Press

This Powerful AI Technique Led to Clashes at Google and Fierce Debate in Tech. Here’s Why., Morning Brew, 2021.03.29
https://www.kmd.dk/presse/pressemeddelelser-og-nyheder/sprogmodellen-aelaectra-vil-forbedre-dansk-sprogteknologi-paa-en-klimavenlig-maade, KMD, 2021.01.05
Startup træner den første dansk-sprogede AI-model: Vi synes, det giver mening at lægge den åbent ud, 2019.12.09
Forskere kæmper for, at man fortsat kan sige »Hon e en go pibel« og blive forstået, Berlingske, 2019.11.19
»Ijn bruner katt«: Kunstig intelligens skal redde truet dansk dialekt, Politiken, 2019.09.26
Politikerne og vælgere har hver deres valgkamp på nettet, Mandag Morgen 2019.05

Resources

DKStance

Danish Stance detection & misinformation classification dataset

Homepage: https://github.com/danish-stance-detectors/Data
Paper: https://www.aclweb.org/anthology/W19-6122/
Creator: Anders Lillie, Emil Middelboe
Year: 2019
Size: 3007 posts
License: MIT
Download: https://github.com/danish-stance-detectors/Data/archive/master.zip

Danish Stance detection tool

Danish Stance detection tool

Homepage: https://github.com/danish-stance-detectors/Stance
Paper: https://www.aclweb.org/anthology/W19-6122/
Data statement: n/a
Creator: Anders Lillie, Emil Middelboe
Year: 2019
License: MIT
Download: https://github.com/danish-stance-detectors/Stance/archive/master.zip

DAGW Validation tool

A validator for the Danish GigaWord (DAGW) project format

Homepage: https://github.com/ITUnlp/dagw_validator
Paper: https://arxiv.org/abs/2005.03521
Data statement: n/a
Creator: Manuel Ciosici
Year: 2020
License: MIT
Download: https://github.com/ITUnlp/dagw_validator/archive/master.zip

daner

Danish NER tool

Homepage: https://github.com/ITUnlp/daner
Paper: https://arxiv.org/abs/1906.11608
Data statement: n/a
Creator: Leon Derczynski
Year: 2018
License: GNU GPL v3
Download: https://github.com/ITUnlp/daner/archive/master.zip

dapipe

Danish NLP pipeline

Homepage: https://github.com/ITUnlp/dapipe
Paper: https://arxiv.org/abs/1906.11608
Data statement: n/a
Creator: Leon Derczynski
Year: 2018
License: CC-BY-NC-SA
Download: https://github.com/ITUnlp/dapipe/archive/master.zip

DKHate

Danish abusive language dataset

Homepage: https://figshare.com/articles/Danish_Hate_Speech_Abusive_Language_data/12220805
Paper: https://www.aclweb.org/anthology/2020.lrec-1.430/
DOI: https://figshare.com/articles/Danish_Hate_Speech_Abusive_Language_data/12220805
Creator: Gudbjartur Ingi Sigurbergsson
Year: 2020
Size: 3600 documents
License: CC-BY 4.0
Download: https://figshare.com/articles/Danish_Hate_Speech_Abusive_Language_data/12220805

PolStance

Danish political stance dataset

Homepage: https://figshare.com/articles/Danish_political_stance_dataset/12382592
Paper: https://www.aclweb.org/anthology/W19-6121/
DOI: https://raw.githubusercontent.com/rasleh/Political-Stance-in-Danish/master/Scraper/out/quote_db.csv
Data statement: https://github.com/rasleh/Political-Stance-in-Danish/blob/master/DATASTATEMENT.md
Creator: Rasmus Lehmann
Year: 2019
Size: 898 quotes
License: CC-BY 4.0
Download: https://raw.githubusercontent.com/rasleh/Political-Stance-in-Danish/master/Scraper/out/quote_db.csv

DKIE

Information Extraction pipeline for Danish in the GATE architecture

Homepage: https://gate.ac.uk
Paper: https://www.aclweb.org/anthology/E14-2016/
DOI: https://github.com/GateNLP/gate-core
Data statement: n/a
Creator: Leon Derczynski
Year: 2014
License: GNU LGPL v3
Download: https://github.com/GateNLP/gate-core

SAM lexicon

Sentiment Analysis Multitool. Danish Sentiment lexicon

Homepage: https://github.com/steffan267/Sentiment-Analysis-on-Danish-Social-Media/tree/master/Lexicon
Paper: https://github.com/lucaspuvis/SAM/raw/master/Thesis.pdf
DOI: https://figshare.com/articles/SAM_lexicon/12382418
Creator: Mads Guldborg Kjeldgaard Kongsbak, Steffan Eybye Christensen, Lucas Høyberg Puvis de Chavannes, Peter Due Jensen
License: Apache
Download: https://figshare.com/articles/SAM_lexicon/12382418

SAM dataset

Sentiment Analysis Multitool. Danish Sentiment annotated dataset

Homepage: https://github.com/steffan267/Sentiment-Analysis-on-Danish-Social-Media
Paper: https://github.com/lucaspuvis/SAM/raw/master/Thesis.pdf
DOI: https://figshare.com/articles/SAM_dataset/12382397
Creator: Mads Guldborg Kjeldgaard Kongsbak, Steffan Eybye Christensen, Lucas Høyberg Puvis de Chavannes, Peter Due Jensen
License: CC-BY-NC-SA
Download: https://figshare.com/articles/SAM_dataset/12382397

SAM classifier

Sentiment Analysis Multitool. Sentiment classification for Danish

Homepage: https://github.com/lucaspuvis/SAM
Paper: https://github.com/lucaspuvis/SAM/raw/master/Thesis.pdf
DOI: https://github.com/lucaspuvis/SAM/archive/master.zip
Data statement: n/a
Creator: Mads Guldborg Kjeldgaard Kongsbak, Steffan Eybye Christensen, Lucas Høyberg Puvis de Chavannes, Peter Due Jensen
Year: 2019
License: CC-BY 4.0
Download: https://github.com/lucaspuvis/SAM/archive/master.zip

E4C-2010

Danish Clinical notes (patientjournaler) - deeply anonymised

Paper: https://journals.sagepub.com/doi/full/10.1177/1460458216647760
DOI: https://itu.dk
Creator: Kostas Pantazos, Søren Lauesen, Søren Lippert
Year: 2011
Size: 5.8M records
License: Restricted
Download: https://itu.dk

Danish Clinical Word Representations

Word vectors and word clusters for Danish clinical text

Homepage: https://figshare.com/articles/Word_Representations_for_Clinical_Danish/12377858
DOI: https://figshare.com/articles/Word_Representations_for_Clinical_Danish/12377858
Creator: Leon Derczynski
Year: 2020
Size: 380K words
License: CC-BY 4.0
Download: https://figshare.com/articles/Word_Representations_for_Clinical_Danish/12377858

DAGW

Danish Gigaword corpus

Homepage: https://gigaword.dk
Paper: https://arxiv.org/abs/2005.03521
Creator: Danish Gigaword Consortium
Year: 2020
Size: 10^9 words
License: CC-BY
Download: https://gigaword.dk

Faroese parallel text

Faroese parallel text

Homepage: https://figshare.com/articles/NLPL_Faroese_Danish_Parallel_Corpus/12384047
DOI: https://figshare.com/articles/NLPL_Faroese_Danish_Parallel_Corpus/12384047
Creator: Leon Derczynski
Year: 2020
Size: 5K sentence pairs
License: CC-BY
Download: https://figshare.com/articles/NLPL_Faroese_Danish_Parallel_Corpus/12384047

Bornholmsk-Danish word embeddings

Bornholmsk-Danish word embeddings aligned in the FastText space

Homepage: https://github.com/leondz/bornholmsk
Paper: https://www.aclweb.org/anthology/W19-6138/
Creator: Leon Derczynski
Year: 2020
License: CC-BY
Download: https://github.com/leondz/bornholmsk/blob/master/bornholmsk.300d.cc-da-aligned.all.bz2

Danish Brown Clusters

Danish Brown clusters (134M tokens, a=5000)

Paper: https://arxiv.org/abs/1906.11608
Creator: Leon Derczynski
Year: 2018
Size: 778K words
License: CC-BY
Download: http://itu.dk/~leod/dansk-brown.tar.bz2

TV2 Regionerne News Corpus

News stories and metadata published by TV2 Regionerne since 2016

Homepage: https://figshare.com/articles/TV2_Regionerne_News_Corpus/12382610
DOI: https://figshare.com/articles/TV2_Regionerne_News_Corpus/12382610
Creator: TV2 Regionerne
Year: 2020
Size: 50K stories
License: CC-BY
Download: https://figshare.com/articles/TV2_Regionerne_News_Corpus/12382610

Bornholmsk-Danish parallel text

Bornholmsk-Danish parallel text

Homepage: https://github.com/leondz/bornholmsk
Paper: https://www.aclweb.org/anthology/W19-6138/
Creator: Leon Derczynski, Alex Speed Kjeldsen
Year: 2019
Size: 5K sentence pairs
License: CC-BY
Download: https://github.com/leondz/bornholmsk/raw/master/parallel.da.da-bornholm.zip

Danish CoreNLP Part-of-Speech tagger model

Danish CoreNLP Part-of-Speech tagger model

Homepage: https://github.com/GateNLP/gateplugin-Lang_Danish
Paper: https://www.aclweb.org/anthology/E14-2016/
Data statement: n/a
Creator: Leon Derczynski
Year: 2014
License: CC-BY
Download: https://github.com/GateNLP/gateplugin-Lang_Danish/blob/master/src/main/resources/resources/pos/ddt-pos.model

Danish CoreNLP NER model

Danish CoreNLP NER model

Homepage: https://github.com/GateNLP/gateplugin-Lang_Danish
Paper: https://arxiv.org/abs/1906.11608
Data statement: n/a
Creator: Leon Derczynski
Year: 2014
License: CC-BY
Download: https://github.com/ITUnlp/daner/blob/master/da01.model.gz

Political stance tagger

Danish political stance tagger

Paper: https://www.aclweb.org/anthology/W19-6121/
DOI: https://github.com/rasleh/Political-Stance-in-Danish/archive/master.zip
Data statement: n/a
Creator: Rasmus Lehmann
Year: 2019
License: CC-BY 4.0
Download: https://github.com/rasleh/Political-Stance-in-Danish/archive/master.zip

Bornholmsk baseline corpus

Corpus of Bornholmsk drawn from various sources, formal/informal, contemporary/older

Homepage: https://github.com/leondz/bornholmsk
Paper: https://www.aclweb.org/anthology/W19-6138/
Creator: Leon Derczynski, Alex Speed Kjeldsen
Year: 2019
Size: 175K words
License: CC-BY
Download: https://github.com/leondz/bornholmsk/blob/master/da-bornholm.clean.txt