Dataset Open Access

The Spoken Wikipedia Corpora

Baumann, Timo


Citation Style Language JSON Export

{"DOI":"10.25592/uhhfdm.1875","abstract":"<p>The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are &ndash; for one reason or another &ndash; unable or unwilling to consume the written version of the article. Our resource, the Spoken Wikipedia Corpus, consolidates the Spoken Wikipediae, adding text segmentation, normalization, time-alignment and further annotations, making it accessible for research and fostering new ways of interacting with the material.</p>\n\n<p>Timo Baumann and Arne K&ouml;hn and Felix Hennig. 2018. The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening, in Language Resources and Evaluation, Special Issue representing significant contributions of LREC 2016.</p>\n\n<p>Arne K&ouml;hn, Florian Stegen, Timo Baumann. 2016. Mining the Spoken Wikipedia for Speech Data and Beyond, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).</p>\n\n<p>&nbsp;</p>\n\n<p><strong>CLARIN Metadata summary for The Spoken Wikipedia Corpora (CMDI-based)</strong></p>\n\n<p><strong>Title: </strong>The Spoken Wikipedia Corpora<br>\n<strong>Description: </strong> The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are &ndash; for one reason or another &ndash; unable or unwilling to consume the written version of the article. Our resource, the Spoken Wikipedia Corpus, consolidates the Spoken Wikipediae, adding text segmentation, normalization, time-alignment and further annotations, making it accessible for research and fostering new ways of interacting with the material.<br>\n<strong>Publication date: </strong>2017<br>\n<strong>Data owner: </strong> Timo Baumann - Universit&auml;t Hamburg<br>\n<strong>Contributors: </strong> Timo Baumann (author), Arne K&ouml;hn (author), Florian Stegen (author)<br>\n<strong>Languages: </strong> <a href=\"https://www.ethnologue.com/language/eng\">English (eng)</a>, <a href=\"https://www.ethnologue.com/language/deu\">German (deu)</a>, <a href=\"https://www.ethnologue.com/language/nld\">Dutch (nld)</a><br>\n<strong>Size: </strong> 5397 article, 1005 hour<br>\n<strong>Segmentation units: </strong> other<br>\n<strong>Genre: </strong> encyclopedia<br>\n<strong>Modality: </strong> spoken<br>\n<strong>References: </strong> Timo Baumann; Arne K&ouml;hn; Felix Hennig (2018) The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening <strong>References: </strong> Arne K&ouml;hn; Florian Stegen; Timo Baumann (2016) Mining the Spoken Wikipedia for Speech Data and Beyond</p>\n\n<p>&nbsp;</p>","author":[{"family":"Baumann, Timo"}],"container_title":"Language Resources and Evaluation","id":"1875","issue":"2","issued":{"date-parts":[[2017,10,27]]},"page":"303\u2013329","title":"The Spoken Wikipedia Corpora","type":"dataset","version":"2.0","volume":"53"}

Cite record as