INEL Nganasan Corpus

Name: INEL Nganasan Corpus
Published: 2025-05-02
License: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode

Brykina, Maria; Gusev, Valentin; Szeverényi, Sándor; Wagner-Nagy, Beáta

doi:10.25592/uhhfdm.17419

May 2, 2025 Dataset Open Access

INEL Nganasan Corpus

Brykina, Maria; Gusev, Valentin; Szeverényi, Sándor; Wagner-Nagy, Beáta

Data manager(s)

Lazarenko, Elena; Riaposov, Aleksandr; Lehmberg, Timm

Editor(s)

Wagner-Nagy, Beáta; Arkhipov, Alexandre

Corpus Citation

Brykina, Maria; Gusev, Valentin; Szeverényi, Sándor; Wagner-Nagy, Beáta. INEL Nganasan Corpus. Version 1.0. Publication date 2025-05-02. https://hdl.handle.net/11022/0000-0007-FE63-C. Archived at Universität Hamburg. In: The INEL corpora of indigenous Northern Eurasian languages. https://hdl.handle.net/11022/0000-0007-F45A-1

Corpus Description

The INEL Nganasan corpus has been created within the long-term INEL project ("Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages"), 2016–2033. The corpus is largely based on the Nganasan Spoken Language Corpus, which has been adapted to the INEL standards and supplemented with new texts. The corpus makes possible typologically oriented corpus-based research on Nganasan and expands the documentation of the lesser described indigenous languages of Northern Eurasia.

The INEL Nganasan corpus consists of two parts. The glossed (searchable) part of the corpus includes texts provided with source media files (whenever available) and annotated transcripts. The archival part of the corpus contains non-glossed texts, represented either by audio recordings (optionally – with preliminary transcriptions) or scanned pages of the manuscripts or publications.

The corpus includes texts recorded between 1933–2019 in Nganasan. The sources of the corpus are:

Audio recordings done by Maria Brykina, Valentin Gusev, Sándor Szeverényi and Beáta Wagner-Nagy.
Legacy audio recordings done by A. Aksyonova, Svetlana S. Aksyonova, Josefina Budzisch, Michael Daniel, Oksana E. Dobzhanskaya, Eugene Helimski, Nadezhda T. Kosterkina, Jean-Luc Lambert, Marina D. Lyublinskaya, N. A. Popov, Florian Sobanski, Eugénie Stapert, Larisa Y. Turdagina, Zsuzsa Várnai, Peter Voliak, Tatjana Zhdanova and possibly other people.
Legacy manuscript transcriptions done by Ekaterina P. Boldt, Eugene Helimski, Nadezhda T. Kosterkina, I. E. Machkinis, E. P. Nojfeld, A. K. Stolyarova, Natalia M. Tereshchenko and Tatjana Zhdanova.
Texts published by Ekaterina P. Boldt, I. E. Machkinis, Tibor Mikola, Georgij N. Prokofiev and A. K. Stolyarova.

Corpus size

The glossed (searchable) part of the corpus contains 236 texts, 34,872 sentences and 221,747 tokens. The total duration of the audio recordings is 49 hours 53 minutes.

The archival part of the corpus contains 98 hours of audio material (210 texts) and 30 manuscripts.

Funding

The INEL Nganasan corpus has been produced in the context of the joint research funding of the German Federal Government and Federal States in the Academies’ Programme, with funding from the Federal Ministry of Education and Research and the Free and Hanseatic City of Hamburg. The Academies’ Programme is coordinated by the Union of the German Academies of Sciences and Humanities.

The Nganasan Spoken Language Corpus, which was integrated into the INEL Nganasan corpus, was created as part of the project Corpus based grammatical studies on Nganasan at the Institute of Finno-Ugric/Uralic Studies of Universität Hamburg. The project was supported by the Deutsche Forschungsgemeinschaft under grant number WA3153/2-1 between 2014 and 2017.

Contributions/Acknowledgements

Many native speakers shared their knowledge of Nganasan and thus made the existence of this corpus possible (see the documentation file below, Appendix A1). We are especially grateful to those who spent days and sometimes months working with us: Svetlana S. Aksyonova, Zinaida S. Chebodaeva, Nikolai S. Chunanchar, Nina D. Chunanchar, Yuliya M. Goricheva, Ekaterina Ch. Kokore, Ekaterina S. Kosterkina, Nadezhda T. Kosterkina, Svetlana M. Kudryakova, Serafima M. Kupchik, Tat`yana T. Kuzenko, Aleksandr Ch. Momde, Dar`ya Ch. Momde, Vera L. Momde, Vasilij F. Porbin, Evdokiya D. Porbina, Mariya M. Porbina, Zoya Ch. Porbina, Galina F. Porotova, Ekaterina N. Sovalova, Lodun N. Turdagina, Nadezhda K. Turdagina, Tat`yana D. Turkina, Mariya D. Yarotskaya, Sy`ku M. Yarotskaya.
The Department of Siberian Indigenous Languages of Tomsk State Pedagogical University and the Institute for Linguistic Studies RAS kindly provided access to their archives.
The Dudinka branch of GTRK “Norilsk” generously provided access to the Nganasan part of its extensive audio archive.
The Taimyr House of National Arts and the City Centre of National Arts in Dudinka helped and supported us during our field trips.

Searching the corpus

The corpus can be downloaded from the ZFDM Repository using the links provided below and browsed or searched locally using the EXMARaLDA software or, alternatively, ELAN.

Online search with Tsakorpus platform is available at https://inel.corpora.uni-hamburg.de/NganasanCorpus/search.

Remote search with EXMARaLDA is also possible without downloading all the files (see https://inel.corpora.uni-hamburg.de/portal/help/en/index.php).

See the user documentation (section 3) for details on transcription, annotation tiers and annotation tags. Find further information and links on the Nganasan Corpus page at the INEL Resources portal: https://inel.corpora.uni-hamburg.de/portal/corpora/nganasan/.

Preview

Files (151.3 GB)

Name	Size
nganasan-1.0-documentation.pdf md5:911195244a6f330ea301215d21ed84ed	3.5 MB	Download
nganasan-1.0-lite.zip md5:5ad2dc3aa08b5e9d7844e64ebb5ad1e0	119.5 MB	Download
nganasan-1.0-mp3.zip md5:bc4ad3b6b389fa8a1d28d9882aebd998	13.2 GB	Download
nganasan-1.0-standard.zip md5:7d162a7ea5f89cc6887ae5fac8cae21a	60.5 GB	Download
nganasan-1.0-video.zip md5:050e39b5eb55148aed6878d66bb49745	77.5 GB	Download

Publication date:

May 2, 2025

DOI:

Keyword(s):

Uralic Samoyedic Nganasan endangered language language contact language documentation legacy data INEL AdWHH text corpus speech corpus parallel texts folklore tales narrative song transcription time-aligned audio morphological glossing part-of-speech borrowings code-switching existential predication locative predication possessive predication English translation Russian translation EXMARaLDA ELAN XML ISO/TEI

Alternate identifiers:

11022/0000-0007-FE63-C

Communities:

License (for files):

Creative Commons Attribution Non Commercial Share Alike 4.0 International

Versions

Version 1.0 10.25592/uhhfdm.17419

May 2, 2025

Cite all versions? You can cite all versions by using the DOI 10.25592/uhhfdm.17418. This DOI represents all versions, and will always resolve to the latest one.

Zentrumfür Nachhaltiges Forschungsdatenmanagement

Suche

INEL Nganasan Corpus

Data manager(s)

Editor(s)

Versions

Cite record as

Export

INEL Nganasan Corpus

Data manager(s)

Editor(s)

DOI Badge

Markdown

[![DOI](https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.17419.svg)](https://doi.org/10.25592/uhhfdm.17419)

reStructedText

.. image:: https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.17419.svg :target: https://doi.org/10.25592/uhhfdm.17419

HTML

<a href="https://doi.org/10.25592/uhhfdm.17419"><img src="https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.17419.svg" alt="DOI"></a>

Image URL

https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.17419.svg

Target URL

https://doi.org/10.25592/uhhfdm.17419

Versions

Cite record as

Export