Where Do We Need to Go From Here?



Introducing DoReCo: Language Documentation Reference Corpora

On the ELAR blog this week, we are introducing the DoReCo project (Language Documentation Reference Corpus). DoReCo started in March 2019 and is a French-German collaborative project that brings together spoken language corpora from about 50 languages, extracted from documentations of small and often endangered languages.

Matt Stave, postdoctoral researcher at Laboratoire Dynamique du Langage in Lyon, tells us more about the project and its aims.

Please can you tell us a bit about DoReCo?

The DoReCo project brings together spoken language corpora from 50+ languages, primarily from small and often endangered languages. All corpora have at least 10,000 words, transcriptions, and translations in a majority language, and subset (30+) have additional morphological annotation. During the course of the project, corpora will be time-aligned at the phoneme and word levels. This is a collaborative project between the Leibniz-Zentrum Allgemeine Sprachwissenschaft (ZAS) in Berlin and the Laboratoire Dynamique du Langage (DDL) in Lyon, headed by PIs Manfred Krifka (ZAS) and Frank Seifart (DDL), and funded jointly by the Deutsche Forshungsgemeinshaft (DFG) and the Agence Nationale de la Recherche (ANR).

What was the motivation behind the DoReCo project?

Many claims that have been made about the structure of language and language processing have been made with reference to only a small number of languages. This is understandable: until recently, language data from non-majority languages could be hard to come by. However, researchers working on language documentation of smaller languages have been moving more and more towards the collection of annotated spoken corpora. These corpora, in addition to being valuable resources for language communities and for language description, also have the potential to be very valuable resources for improving our understanding language, generally.

The DoReCo corpora will be of interest to researchers working on all aspects of language, from pragmatics to morphosyntax to phonetics. In addition to the lexical and morphological annotations, corpora will be given phonemically time-aligned annotations by using the MAUS (Munich Automatic Segmentation System) software, which we offer back to the corpus creators.

Dozens of corpus creators from around the world have already contributed their corpora, with more on the way.
Dozens of corpus creators from around the world have already contributed their corpora, with more on the way.

Why is DoReCo important? Can you give our readers an insight into why this project is so worthwhile?

Corpus-based approaches to linguistic analysis are very useful, both for showing large-scale patterns in language and for identifying interesting irregularities in the patterns that require a closer analysis. But corpus linguistics has mainly been limited to Indo-European languages and a handful of other majority languages, which has limited our understanding of how humans process language. By bringing together a collection of publicly shareable corpora from a diverse set of languages, we hope to enable researchers to ask and answer more questions about human language generally, rather than just about a small subset of human languages. This will sound a bit nerdy, but I think that living at a moment when it is suddenly possible to ask so many questions about the world’s languages is quite exciting!

We at the DoReCo project will be exploring a number of theoretical questions on the corpora, ourselves. The ZAS group will be looking at universal claims of phonetic lengthening and segment (in)compressibility, examining the influence of language-specific phonemic and morphophonemic properties. The DDL group will be looking at information rate and information packaging, testing to see whether languages tend towards an optimized, universal “attractor state” for information rate, and whether they tend to package similar amounts of information in inter-pausal units, regardless of language-specific morphological patterns.

The three text tiers show alignment at the utterance, word, and phoneme levels.
The three text tiers show alignment at the utterance, word, and phoneme levels.

Anything else you would like to share with our readers?

The DoReCo project began only a few months ago, and at this point, we have received and begun processing on more than 25 languages, with many more on the way. But we are still eager to make the corpus even more comprehensive. If you are interested in participating in the project, check out our website at http://doreco.info/, or contact either of the post-docs on the project: Ludger Paschen (paschen@leibniz-zas.de) or Matt Stave (stave.matthew@gmail.com).

Leave a Comment

Your email address will not be published. Required fields are marked *