INEL Evenki Corpus

Name: INEL Evenki Corpus
Published: 2024-12-31
License: https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode

Däbritz, Chris Lasse; Gusev, Valentin; Stoynova, Natalia

doi:10.25592/uhhfdm.16605

December 31, 2024 Dataset Open Access

INEL Evenki Corpus

Däbritz, Chris Lasse; Gusev, Valentin; Stoynova, Natalia

Data manager(s)

Ferger, Anne;

Jettka, Daniel; Lazarenko, Elena;

Lehmberg, Timm; Riaposov, Aleksandr

Editor(s)

Wagner-Nagy, Be´ata;

Arkhipov, Alexandre

Corpus Citation

Däbritz, Chris Lasse; Gusev, Valentin; Stoynova, Natalia. 2024. INEL Evenki Corpus. Version 2.0. Publication date 2024-12-31. Archived at Universität Hamburg. https://hdl.handle.net/11022/0000-0007-FE38-D. In: The INEL corpora of indigenous Northern Eurasian languages. https://hdl.handle.net/11022/0000-0007-F45A-1

Corpus Description

The INEL Evenki Corpus has been created within the long-term INEL project (Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages), 2016–2033.
The corpus makes possible typologically aware corpus-based grammatical research on the Evenki (< Tungusic) language and expands the documentation of the lesser described indigenous languages of Northern Eurasia.
The INEL Evenki Corpus covers Northern (Taimyr, Khantayskoe Ozero, Ilimpi, Yerbogachyon) and Southern (Sym, Barhahan, and to a smaller extent Stony Tunguska and Nepa) Evenki dialects. These are exactly the dialects which are or were in contact with other languages included in the INEL project, that is first and foremost Dolgan and Selkup. The INEL Evenki Corpus contains texts from different sources:

Published texts from several text collections: Vasilevich (1936): the Ilimpi, Yerbogachyon, Sym, Nepa dialects; Anisimov (1936): the Stony Tunguska dialect; Brodskaya (1967): the Khantayskoe Ozero dialect.
Transcripts of recordings obtained from the Taimyr House of National Arts (TDNT) in Dudinka (2000s) as well as transcripts of recordings made by and from Tat`yana V. Bolina, all of them representing the Khantayskoe Ozero dialect. For these texts, corresponding time-aligned audio files are available.
Texts from the handwritten archive of the Russian ethnographer and linguist Konstantin M. Rychkov recorded in the 1900s/1910s, covering the Taimyr, Ilimpi, Sym, and Barhahan dialects.

Each text in the corpus is provided with morphological glossing, translation into English, Russian, and German, as well as annotation of Russian borrowings. Some texts also have annotations for syntactic functions, semantic roles, information status, as well as for existential, locative, and possessive predication.

Corpus size

Northern dialects (Ilimpi, Yerbogachyon, Khantayskoye Ozero, Taimyr):
176 texts, 7,091 sentences, 34,931 tokens
Southern “sh” dialects (Sym, Barhahan):
425 texts, 12,395 sentences, 55,674 tokens
Southern “s” dialects (Stony Tunguska, Nepa):
11 texts, 445 sentences, 2,659 tokens
Total: 612 texts, 19,931 sentences, 93,264 tokens
Total duration of audio: 3 hours 58 minutes (69 texts)

New in release 2.0

The total size of the corpus has increased about twice (from 47,708 to 93,264 tokens):
- new texts in the Sym dialect from the Rychkov archive have been added (15,495 tokens), the entire Sym collection from the archive is now included in the corpus
- a text collection in the Barhahan dialect from the Rychkov archive has been included in the corpus (30,061 tokens)
Some errors in glossing have been fixed
Glossing has been unified at some points (e.g. the analysis of finite past tense forms as finite verbs vs. participles: all such forms are now glossed as finite verbs)
Many glossing labels have been changed; in particular, most ambiguous grammatical glosses have been disambiguated by numbers and/or by semantic specifications: e.g. DIM for four affixes ⇒ DIM1, DIM2, DIM3, DIM4; NMLZ ⇒ NMLZ.TMP, NMLZ.PT, etc.
The structure of metadata has been slightly modified (e.g. fields for the source type and availability of audio files have been added)

Funding

The corpus has been produced in the context of the joint research funding of the German Federal Government and Federal States in the Academies’ Programme, with funding from the Federal Ministry of Education and Research and the Free and Hanseatic City of Hamburg. The Academies’ Programme is coordinated by the Union of the German Academies of Sciences and Humanities.

Contributions/Acknowledgements

The Taimyr House of National Arts (TDNT) provided valuable audio material (see above).
Tat`yana V. Bolina (TDNT Leading Methodologist for Evenki folklore and culture) recorded further Evenki material in 2018 and 2019.
The Institute of Oriental Manuscripts of the Russian Academy of Sciences (IOM RAS / IVR; Институт восточных рукописей РАН) in Saint Petersburg provided scanned manuscripts from the Rychkov archive (The Archives of the Orientalists of IOM RAS, Coll. 49, inv. 1, items 4, 5, 6а, 6б, 6в).

Searching the corpus

The corpus can be downloaded from the ZFDM Repository using the links provided below and browsed or searched locally using the EXMARaLDA software or, alternatively, ELAN.

Online search with Tsakorpus platform is available at https://inel.corpora.uni-hamburg.de/EvenkiCorpus/search.

Remote search with EXMARaLDA is also possible without downloading all the files (see https://inel.corpora.uni-hamburg.de/portal/help/en/index.php#search).

See the user documentation (section 3) for details on transcription, annotation tiers and annotation tags. Find further information and links on the Evenki Corpus page at the INEL Resources portal: https://inel.corpora.uni-hamburg.de/portal/corpora/evenki/.

Preview

Files (3.1 GB)

Name	Size
evenki-2.0-documentation.pdf md5:8c3472ec27035d8d56c70d50b57dc55d	2.5 MB	Download
evenki-2.0-lite.zip md5:395717280876078cd33d54382b9717e1	61.5 MB	Download
evenki-2.0-mp3.zip md5:1476fa0e1374563b41e2c32850d2d4aa	1.0 GB	Download
evenki-2.0-standard.zip md5:e578975ec4c2517a30e7aed597338e15	2.0 GB	Download

Publication date:

December 31, 2024

DOI:

Keyword(s):

Tungusic Evenki endangered language language contact language documentation legacy data INEL AdWHH text corpus speech corpus parallel texts folklore tales narrative conversation song transcription time-aligned audio morphological glossing part-of-speech borrowings code-switching semantic roles syntactic functions information status existential predication locative predication possessive predication English translation German translation Russian translation EXMARaLDA ELAN XML ISO/TEI

Related identifiers:

Previous versions:
11022/0000-0007-F43C-3

Communities:

License (for files):

Creative Commons Attribution Non Commercial Share Alike 4.0 International

Versions

Version 2.0 10.25592/uhhfdm.16605	Dec 31, 2024
Version 1.0 10.25592/uhhfdm.9628	Dec 31, 2021

Cite all versions? You can cite all versions by using the DOI 10.25592/uhhfdm.9627. This DOI represents all versions, and will always resolve to the latest one.

Zentrumfür Nachhaltiges Forschungsdatenmanagement

Suche

INEL Evenki Corpus

Data manager(s)

Editor(s)

Versions

Cite record as

Export

INEL Evenki Corpus

Data manager(s)

Editor(s)

DOI Badge

Markdown

[![DOI](https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.16605.svg)](https://doi.org/10.25592/uhhfdm.16605)

reStructedText

.. image:: https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.16605.svg :target: https://doi.org/10.25592/uhhfdm.16605

HTML

<a href="https://doi.org/10.25592/uhhfdm.16605"><img src="https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.16605.svg" alt="DOI"></a>

Image URL

https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.16605.svg

Target URL

https://doi.org/10.25592/uhhfdm.16605

Versions

Cite record as

Export