Dataset Open Access

The Spoken Wikipedia Corpora

Baumann, Timo


JSON Export

{"conceptdoi":"10.25592/uhhfdm.1874","conceptrecid":"1874","created":"2020-10-27T11:00:38.765070+00:00","doi":"10.25592/uhhfdm.1875","id":1875,"links":{"badge":"https://www.fdr.uni-hamburg.de/badge/doi/10.25592/uhhfdm.1875.svg","conceptbadge":"https://www.fdr.uni-hamburg.de/badge/doi/10.25592/uhhfdm.1874.svg","conceptdoi":"http://doi.org/10.25592/uhhfdm.1874","doi":"http://doi.org/10.25592/uhhfdm.1875"},"metadata":{"access_right":"open","access_right_category":"success","communities":[{"id":"hzsk"},{"id":"uhh"}],"contributors":[{"name":"Stegen, Florian","type":"DataCurator"},{"name":"Baumann, Timo","orcid":"0000-0003-2203-1783","type":"DataCurator"},{"name":"K\u00f6hn, Arne","orcid":"0000-0002-4880-2016","type":"DataCurator"}],"creators":[{"name":"Baumann, Timo","orcid":"0000-0003-2203-1783"}],"description":"<p>The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are &ndash; for one reason or another &ndash; unable or unwilling to consume the written version of the article. Our resource, the Spoken Wikipedia Corpus, consolidates the Spoken Wikipediae, adding text segmentation, normalization, time-alignment and further annotations, making it accessible for research and fostering new ways of interacting with the material.</p>\n\n<p>Timo Baumann and Arne K&ouml;hn and Felix Hennig. 2018. The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening, in Language Resources and Evaluation, Special Issue representing significant contributions of LREC 2016.</p>\n\n<p>Arne K&ouml;hn, Florian Stegen, Timo Baumann. 2016. Mining the Spoken Wikipedia for Speech Data and Beyond, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).</p>\n\n<p>&nbsp;</p>\n\n<p><strong>CLARIN Metadata summary for The Spoken Wikipedia Corpora (CMDI-based)</strong></p>\n\n<p><strong>Title: </strong>The Spoken Wikipedia Corpora<br>\n<strong>Description: </strong> The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are &ndash; for one reason or another &ndash; unable or unwilling to consume the written version of the article. Our resource, the Spoken Wikipedia Corpus, consolidates the Spoken Wikipediae, adding text segmentation, normalization, time-alignment and further annotations, making it accessible for research and fostering new ways of interacting with the material.<br>\n<strong>Publication date: </strong>2017<br>\n<strong>Data owner: </strong> Timo Baumann - Universit&auml;t Hamburg<br>\n<strong>Contributors: </strong> Timo Baumann (author), Arne K&ouml;hn (author), Florian Stegen (author)<br>\n<strong>Languages: </strong> <a href=\"https://www.ethnologue.com/language/eng\">English (eng)</a>, <a href=\"https://www.ethnologue.com/language/deu\">German (deu)</a>, <a href=\"https://www.ethnologue.com/language/nld\">Dutch (nld)</a><br>\n<strong>Size: </strong> 5397 article, 1005 hour<br>\n<strong>Segmentation units: </strong> other<br>\n<strong>Genre: </strong> encyclopedia<br>\n<strong>Modality: </strong> spoken<br>\n<strong>References: </strong> Timo Baumann; Arne K&ouml;hn; Felix Hennig (2018) The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening <strong>References: </strong> Arne K&ouml;hn; Florian Stegen; Timo Baumann (2016) Mining the Spoken Wikipedia for Speech Data and Beyond</p>\n\n<p>&nbsp;</p>","doi":"10.25592/uhhfdm.1875","journal":{"issue":"2","pages":"303\u2013329","title":"Language Resources and Evaluation","volume":"53"},"keywords":["linguistics","English","German","Dutch"],"license":{"id":"CC-BY-SA-4.0"},"publication_date":"2017-10-27","related_identifiers":[{"identifier":"10.25592/uhhfdm.1874","relation":"isVersionOf","scheme":"doi"}],"relations":{"version":[{"count":1,"index":0,"is_last":true,"last_child":{"pid_type":"recid","pid_value":"1875"},"parent":{"pid_type":"recid","pid_value":"1874"}}]},"resource_type":{"title":"Dataset","type":"dataset"},"title":"The Spoken Wikipedia Corpora","version":"2.0"},"owners":[798],"revision":4,"updated":"2025-06-05T12:13:23.062434+00:00"}

Cite record as