We do research in language processing for Danish, focusing on providing tools, resources, and language preservation.
It’s hard to develop good tools processing Danish with computers when no large and wide-coverage dataset of Danish text is readily available. To address this, the Danish Gigaword Project (DAGW) maintains a corpus for Danish with over a billion words.
See gigaword.dk for releases and further info.
PI: Leon Derczynski; Co-I: Manuel R. Ciosici.
EU COST, 80.000 DKK, 2020-2024
LITHME is a COST Action network with members from every EU member state, plus a number of other countries outside the EU. Action chair: Dave Sayers (University of Jyväskylä).
Our aims:
NordForsk/NeIC, 2017-2020
The Nordic Language Processing Laboratory (NLPL) is a collaboration of university research groups in Natural Language Processing (NLP) in Northern Europe. Our vision is to implement a virtual laboratory for large-scale NLP research by (a) creating new ways to enable data- and compute-intensive Natural Language Processing research by implementing a common software, data and service stack in multiple Nordic HPC centres, (b) by pooling competencies within the user community and among expert support teams, and (c) by enabling internationally competitive, data-intensive research and experimentation on a scale that would be difficult to sustain on commodity computing resources.
Danish Stance detection & misinformation classification dataset
Danish Stance detection tool
A validator for the Danish GigaWord (DAGW) project format
Danish NER tool
Danish NLP pipeline
Danish abusive language dataset
Danish political stance dataset
Information Extraction pipeline for Danish in the GATE architecture
Sentiment Analysis Multitool. Danish Sentiment lexicon
Sentiment Analysis Multitool. Danish Sentiment annotated dataset
Sentiment Analysis Multitool. Sentiment classification for Danish
Danish Clinical notes (patientjournaler) - deeply anonymised
Word vectors and word clusters for Danish clinical text
Danish Gigaword corpus
Faroese parallel text
Bornholmsk-Danish word embeddings aligned in the FastText space
Danish Brown clusters (134M tokens, a=5000)
News stories and metadata published by TV2 Regionerne since 2016
Bornholmsk-Danish parallel text
Danish CoreNLP Part-of-Speech tagger model
Danish CoreNLP NER model
Danish political stance tagger
Corpus of Bornholmsk drawn from various sources, formal/informal, contemporary/older