The Spoken Wikipedia Corpora

Baumann, Timo

doi:10.25592/uhhfdm.1875

October 27, 2017 Dataset Open Access

The Spoken Wikipedia Corpora

Baumann, Timo

JSON Export

{"conceptdoi":"10.25592/uhhfdm.1874","conceptrecid":"1874","created":"2020-10-27T11:00:38.765070+00:00","doi":"10.25592/uhhfdm.1875","id":1875,"links":{"badge":"https://www.fdr.uni-hamburg.de/badge/doi/10.25592/uhhfdm.1875.svg","conceptbadge":"https://www.fdr.uni-hamburg.de/badge/doi/10.25592/uhhfdm.1874.svg","conceptdoi":"http://doi.org/10.25592/uhhfdm.1874","doi":"http://doi.org/10.25592/uhhfdm.1875"},"metadata":{"access_right":"open","access_right_category":"success","communities":[{"id":"hzsk"},{"id":"uhh"}],"contributors":[{"name":"Stegen, Florian","type":"DataCurator"},{"name":"Baumann, Timo","orcid":"0000-0003-2203-1783","type":"DataCurator"},{"name":"K\u00f6hn, Arne","orcid":"0000-0002-4880-2016","type":"DataCurator"}],"creators":[{"name":"Baumann, Timo","orcid":"0000-0003-2203-1783"}],"description":"<p>The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are &ndash; for one reason or another &ndash; unable or unwilling to consume the written version of the article. Our resource, the Spoken Wikipedia Corpus, consolidates the Spoken Wikipediae, adding text segmentation, normalization, time-alignment and further annotations, making it accessible for research and fostering new ways of interacting with the material.</p>\n\n<p>Timo Baumann and Arne K&ouml;hn and Felix Hennig. 2018. The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening, in Language Resources and Evaluation, Special Issue representing significant contributions of LREC 2016.</p>\n\n<p>Arne K&ouml;hn, Florian Stegen, Timo Baumann. 2016. Mining the Spoken Wikipedia for Speech Data and Beyond, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).</p>\n\n<p>&nbsp;</p>\n\n<p><strong>CLARIN Metadata summary for The Spoken Wikipedia Corpora (CMDI-based)</strong></p>\n\n<p><strong>Title: </strong>The Spoken Wikipedia Corpora<br>\n<strong>Description: </strong> The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are &ndash; for one reason or another &ndash; unable or unwilling to consume the written version of the article. Our resource, the Spoken Wikipedia Corpus, consolidates the Spoken Wikipediae, adding text segmentation, normalization, time-alignment and further annotations, making it accessible for research and fostering new ways of interacting with the material.<br>\n<strong>Publication date: </strong>2017<br>\n<strong>Data owner: </strong> Timo Baumann - Universit&auml;t Hamburg<br>\n<strong>Contributors: </strong> Timo Baumann (author), Arne K&ouml;hn (author), Florian Stegen (author)<br>\n<strong>Languages: </strong> <a href=\"https://www.ethnologue.com/language/eng\">English (eng)</a>, <a href=\"https://www.ethnologue.com/language/deu\">German (deu)</a>, <a href=\"https://www.ethnologue.com/language/nld\">Dutch (nld)</a><br>\n<strong>Size: </strong> 5397 article, 1005 hour<br>\n<strong>Segmentation units: </strong> other<br>\n<strong>Genre: </strong> encyclopedia<br>\n<strong>Modality: </strong> spoken<br>\n<strong>References: </strong> Timo Baumann; Arne K&ouml;hn; Felix Hennig (2018) The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening <strong>References: </strong> Arne K&ouml;hn; Florian Stegen; Timo Baumann (2016) Mining the Spoken Wikipedia for Speech Data and Beyond</p>\n\n<p>&nbsp;</p>","doi":"10.25592/uhhfdm.1875","journal":{"issue":"2","pages":"303\u2013329","title":"Language Resources and Evaluation","volume":"53"},"keywords":["linguistics","English","German","Dutch"],"license":{"id":"CC-BY-SA-4.0"},"publication_date":"2017-10-27","related_identifiers":[{"identifier":"10.25592/uhhfdm.1874","relation":"isVersionOf","scheme":"doi"}],"relations":{"version":[{"count":1,"index":0,"is_last":true,"last_child":{"pid_type":"recid","pid_value":"1875"},"parent":{"pid_type":"recid","pid_value":"1874"}}]},"resource_type":{"title":"Dataset","type":"dataset"},"title":"The Spoken Wikipedia Corpora","version":"2.0"},"owners":[798],"revision":4,"updated":"2025-06-05T12:13:23.062434+00:00"}

Publication date:

October 27, 2017

DOI:

Keyword(s):

linguistics English German Dutch

Published in:

Language Resources and Evaluation: 53 pp. 303–329.

Communities:

License (for files):

Creative Commons Attribution Share Alike 4.0 International

Versions

Version 2.0 10.25592/uhhfdm.1875

Oct 27, 2017

Cite all versions? You can cite all versions by using the DOI 10.25592/uhhfdm.1874. This DOI represents all versions, and will always resolve to the latest one.

Zentrumfür Nachhaltiges Forschungsdatenmanagement

Suche

The Spoken Wikipedia Corpora

JSON Export

Versions

Cite record as

Export

The Spoken Wikipedia Corpora

JSON Export

DOI Badge

Markdown

[![DOI](https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.1875.svg)](https://doi.org/10.25592/uhhfdm.1875)

reStructedText

.. image:: https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.1875.svg :target: https://doi.org/10.25592/uhhfdm.1875

HTML

<a href="https://doi.org/10.25592/uhhfdm.1875"><img src="https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.1875.svg" alt="DOI"></a>

Image URL

https://www.fdr.uni-hamburg.de/badge/DOI/10.25592/uhhfdm.1875.svg

Target URL

https://doi.org/10.25592/uhhfdm.1875

Versions

Cite record as

Export