The Spoken Wikipedia Corpora

Baumann, Timo

doi:10.25592/uhhfdm.1875

October 27, 2017 Dataset Open Access

The Spoken Wikipedia Corpora

Baumann, Timo

JSON-LD (schema.org) Export

{"@context":"https://schema.org/","@id":"http://doi.org/10.25592/uhhfdm.1875","@type":"Dataset","contributor":[{"@type":"Person","name":"Stegen, Florian"},{"@id":"https://orcid.org/0000-0003-2203-1783","@type":"Person","name":"Baumann, Timo"},{"@id":"https://orcid.org/0000-0002-4880-2016","@type":"Person","name":"K\u00f6hn, Arne"}],"creator":[{"@id":"https://orcid.org/0000-0003-2203-1783","@type":"Person","name":"Baumann, Timo"}],"datePublished":"2017-10-27","description":"<p>The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are &ndash; for one reason or another &ndash; unable or unwilling to consume the written version of the article. Our resource, the Spoken Wikipedia Corpus, consolidates the Spoken Wikipediae, adding text segmentation, normalization, time-alignment and further annotations, making it accessible for research and fostering new ways of interacting with the material.</p>\n\n<p>Timo Baumann and Arne K&ouml;hn and Felix Hennig. 2018. The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening, in Language Resources and Evaluation, Special Issue representing significant contributions of LREC 2016.</p>\n\n<p>Arne K&ouml;hn, Florian Stegen, Timo Baumann. 2016. Mining the Spoken Wikipedia for Speech Data and Beyond, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).</p>\n\n<p>&nbsp;</p>\n\n<p><strong>CLARIN Metadata summary for The Spoken Wikipedia Corpora (CMDI-based)</strong></p>\n\n<p><strong>Title: </strong>The Spoken Wikipedia Corpora<br>\n<strong>Description: </strong> The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are &ndash; for one reason or another &ndash; unable or unwilling to consume the written version of the article. Our resource, the Spoken Wikipedia Corpus, consolidates the Spoken Wikipediae, adding text segmentation, normalization, time-alignment and further annotations, making it accessible for research and fostering new ways of interacting with the material.<br>\n<strong>Publication date: </strong>2017<br>\n<strong>Data owner: </strong> Timo Baumann - Universit&auml;t Hamburg<br>\n<strong>Contributors: </strong> Timo Baumann (author), Arne K&ouml;hn (author), Florian Stegen (author)<br>\n<strong>Languages: </strong> <a href=\"https://www.ethnologue.com/language/eng\">English (eng)</a>, <a href=\"https://www.ethnologue.com/language/deu\">German (deu)</a>, <a href=\"https://www.ethnologue.com/language/nld\">Dutch (nld)</a><br>\n<strong>Size: </strong> 5397 article, 1005 hour<br>\n<strong>Segmentation units: </strong> other<br>\n<strong>Genre: </strong> encyclopedia<br>\n<strong>Modality: </strong> spoken<br>\n<strong>References: </strong> Timo Baumann; Arne K&ouml;hn; Felix Hennig (2018) The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening <strong>References: </strong> Arne K&ouml;hn; Florian Stegen; Timo Baumann (2016) Mining the Spoken Wikipedia for Speech Data and Beyond</p>\n\n<p>&nbsp;</p>","distribution":[{"@type":"DataDownload","contentUrl":"https://www.fdr.uni-hamburg.de/api/files/2e9aa9ac-ea6b-4ef4-99d9-d2c7867110f7/swc-2.0.xml","encodingFormat":"xml"},{"@type":"DataDownload","contentUrl":"https://www.fdr.uni-hamburg.de/api/files/2e9aa9ac-ea6b-4ef4-99d9-d2c7867110f7/swc-2.0.cmdi","encodingFormat":"cmdi"},{"@type":"DataDownload","contentUrl":"https://www.fdr.uni-hamburg.de/api/files/2e9aa9ac-ea6b-4ef4-99d9-d2c7867110f7/swc-2.0.zip","encodingFormat":"zip"}],"identifier":"http://doi.org/10.25592/uhhfdm.1875","keywords":["linguistics","English","German","Dutch"],"license":"https://creativecommons.org/licenses/by-sa/4.0/legalcode","name":"The Spoken Wikipedia Corpora","url":"https://www.fdr.uni-hamburg.de/record/1875","version":"2.0"}

Publication date:

October 27, 2017

DOI:

Keyword(s):

linguistics English German Dutch

Published in:

Language Resources and Evaluation: 53 pp. 303–329.

Communities:

License (for files):

Creative Commons Attribution Share Alike 4.0 International

Versions

Version 2.0 10.25592/uhhfdm.1875

Oct 27, 2017

Cite all versions? You can cite all versions by using the DOI 10.25592/uhhfdm.1874. This DOI represents all versions, and will always resolve to the latest one.

Zentrumfür Nachhaltiges Forschungsdatenmanagement

Suche

The Spoken Wikipedia Corpora

JSON-LD (schema.org) Export

Versions

Cite record as

Export

The Spoken Wikipedia Corpora

JSON-LD (schema.org) Export

DOI Badge

Markdown

[![DOI](https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.1875.svg)](https://doi.org/10.25592/uhhfdm.1875)

reStructedText

.. image:: https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.1875.svg :target: https://doi.org/10.25592/uhhfdm.1875

HTML

<a href="https://doi.org/10.25592/uhhfdm.1875"><img src="https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.1875.svg" alt="DOI"></a>

Image URL

https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.1875.svg

Target URL

https://doi.org/10.25592/uhhfdm.1875

Versions

Cite record as

Export