Dataset Open Access

INEL Evenki Corpus

Däbritz, Chris Lasse; Gusev, Valentin; Stoynova, Natalia

Data manager(s)
Ferger, Anne; Jettka, Daniel; Lazarenko, Elena; Lehmberg, Timm; Riaposov, Aleksandr
Editor(s)
Wagner-Nagy, Be´ata; Arkhipov, Alexandre

Corpus Citation

Däbritz, Chris Lasse; Gusev, Valentin; Stoynova, Natalia. 2024. INEL Evenki Corpus. Version 2.0. Publication date 2024-12-31. Archived at Universität Hamburg. https://hdl.handle.net/11022/0000-0007-FE38-D. In: The INEL corpora of indigenous Northern Eurasian languages. https://hdl.handle.net/11022/0000-0007-F45A-1

Corpus Description

The INEL Evenki Corpus has been created within the long-term INEL project (Grammatical Descriptions, Corpora and Language Technology for Indigenous Northern Eurasian Languages), 2016–2033.
The corpus makes possible typologically aware corpus-based grammatical research on the Evenki (< Tungusic) language and expands the documentation of the lesser described indigenous languages of Northern Eurasia.
The INEL Evenki Corpus covers Northern (Taimyr, Khantayskoe Ozero, Ilimpi, Yerbogachyon) and Southern (Sym, Barhahan, and to a smaller extent Stony Tunguska and Nepa) Evenki dialects. These are exactly the dialects which are or were in contact with other languages included in the INEL project, that is first and foremost Dolgan and Selkup. The INEL Evenki Corpus contains texts from different sources:

  1. Published texts from several text collections: Vasilevich (1936): the Ilimpi, Yerbogachyon, Sym, Nepa dialects; Anisimov (1936): the Stony Tunguska dialect; Brodskaya (1967): the Khantayskoe Ozero dialect.
  2. Transcripts of recordings obtained from the Taimyr House of National Arts (TDNT) in Dudinka (2000s) as well as transcripts of recordings made by and from Tat`yana V. Bolina, all of them representing the Khantayskoe Ozero dialect. For these texts, corresponding time-aligned audio files are available.
  3. Texts from the handwritten archive of the Russian ethnographer and linguist Konstantin M. Rychkov recorded in the 1900s/1910s, covering the Taimyr, Ilimpi, Sym, and Barhahan dialects.

Each text in the corpus is provided with morphological glossing, translation into English, Russian, and German, as well as annotation of Russian borrowings. Some texts also have annotations for syntactic functions, semantic roles, information status, as well as for existential, locative, and possessive predication.

Corpus size

  • Northern dialects (Ilimpi, Yerbogachyon, Khantayskoye Ozero, Taimyr):
    176 texts, 7,091 sentences, 34,931 tokens
  • Southern “sh” dialects (Sym, Barhahan):
    425 texts, 12,395 sentences, 55,674 tokens
  • Southern “s” dialects (Stony Tunguska, Nepa):
    11 texts, 445 sentences, 2,659 tokens
  • Total: 612 texts, 19,931 sentences, 93,264 tokens
  • Total duration of audio: 3 hours 58 minutes (69 texts)

New in release 2.0

  • The total size of the corpus has increased about twice (from 47,708 to 93,264 tokens):
    • new texts in the Sym dialect from the Rychkov archive have been added (15,495 tokens), the entire Sym collection from the archive is now included in the corpus
    • a text collection in the Barhahan dialect from the Rychkov archive has been included in the corpus (30,061 tokens)
  • Some errors in glossing have been fixed
  • Glossing has been unified at some points (e.g. the analysis of finite past tense forms as finite verbs vs. participles: all such forms are now glossed as finite verbs)
  • Many glossing labels have been changed; in particular, most ambiguous grammatical glosses have been disambiguated by numbers and/or by semantic specifications: e.g. DIM for four affixes  ⇒  DIM1, DIM2, DIM3, DIM4; NMLZ ⇒ NMLZ.TMP, NMLZ.PT, etc.
  • The structure of metadata has been slightly modified (e.g. fields for the source type and availability of audio files have been added)

Funding

The corpus has been produced in the context of the joint research funding of the German Federal Government and Federal States in the Academies’ Programme, with funding from the Federal Ministry of Education and Research and the Free and Hanseatic City of Hamburg. The Academies’ Programme is coordinated by the Union of the German Academies of Sciences and Humanities.

Contributions/Acknowledgements

  • The Taimyr House of National Arts (TDNT) provided valuable audio material (see above).
  • Tat`yana V. Bolina (TDNT Leading Methodologist for Evenki folklore and culture) recorded further Evenki material in 2018 and 2019.
  • The Institute of Oriental Manuscripts of the Russian Academy of Sciences (IOM RAS / IVR; Институт восточных рукописей РАН) in Saint Petersburg provided scanned manuscripts from the Rychkov archive (The Archives of the Orientalists of IOM RAS, Coll. 49, inv. 1, items 4, 5, 6а, 6б, 6в).

Searching the corpus

The corpus can be downloaded from the ZFDM Repository using the links provided below and browsed or searched locally using the EXMARaLDA software or, alternatively, ELAN.

Online search with Tsakorpus platform is available at https://inel.corpora.uni-hamburg.de/EvenkiCorpus/search.

Remote search with EXMARaLDA is also possible without downloading all the files (see https://inel.corpora.uni-hamburg.de/portal/help/en/index.php#search).

See the user documentation (section 3) for details on transcription, annotation tiers and annotation tags. Find further information and links on the Evenki Corpus page at the INEL Resources portal: https://inel.corpora.uni-hamburg.de/portal/corpora/evenki/.

Files (3.1 GB)
Name Size
evenki-2.0-documentation.pdf
md5:8c3472ec27035d8d56c70d50b57dc55d
2.5 MB Download
evenki-2.0-lite.zip
md5:395717280876078cd33d54382b9717e1
61.5 MB Download
evenki-2.0-mp3.zip
md5:1476fa0e1374563b41e2c32850d2d4aa
1.0 GB Download
evenki-2.0-standard.zip
md5:e578975ec4c2517a30e7aed597338e15
2.0 GB Download

Cite record as