Dataset Open Access

The Spoken Wikipedia Corpora

Baumann, Timo


JSON-LD (schema.org) Export

{"@context":"https://schema.org/","@id":"http://doi.org/10.25592/uhhfdm.1875","@type":"Dataset","contributor":[{"@type":"Person","name":"Stegen, Florian"},{"@id":"https://orcid.org/0000-0003-2203-1783","@type":"Person","name":"Baumann, Timo"},{"@id":"https://orcid.org/0000-0002-4880-2016","@type":"Person","name":"K\u00f6hn, Arne"}],"creator":[{"@id":"https://orcid.org/0000-0003-2203-1783","@type":"Person","name":"Baumann, Timo"}],"datePublished":"2017-10-27","description":"<p>The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are &ndash; for one reason or another &ndash; unable or unwilling to consume the written version of the article. Our resource, the Spoken Wikipedia Corpus, consolidates the Spoken Wikipediae, adding text segmentation, normalization, time-alignment and further annotations, making it accessible for research and fostering new ways of interacting with the material.</p>\n\n<p>Timo Baumann and Arne K&ouml;hn and Felix Hennig. 2018. The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening, in Language Resources and Evaluation, Special Issue representing significant contributions of LREC 2016.</p>\n\n<p>Arne K&ouml;hn, Florian Stegen, Timo Baumann. 2016. Mining the Spoken Wikipedia for Speech Data and Beyond, in Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).</p>\n\n<p>&nbsp;</p>\n\n<p><strong>CLARIN Metadata summary for The Spoken Wikipedia Corpora (CMDI-based)</strong></p>\n\n<p><strong>Title: </strong>The Spoken Wikipedia Corpora<br>\n<strong>Description: </strong> The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are &ndash; for one reason or another &ndash; unable or unwilling to consume the written version of the article. Our resource, the Spoken Wikipedia Corpus, consolidates the Spoken Wikipediae, adding text segmentation, normalization, time-alignment and further annotations, making it accessible for research and fostering new ways of interacting with the material.<br>\n<strong>Publication date: </strong>2017<br>\n<strong>Data owner: </strong> Timo Baumann - Universit&auml;t Hamburg<br>\n<strong>Contributors: </strong> Timo Baumann (author), Arne K&ouml;hn (author), Florian Stegen (author)<br>\n<strong>Languages: </strong> <a href=\"https://www.ethnologue.com/language/eng\">English (eng)</a>, <a href=\"https://www.ethnologue.com/language/deu\">German (deu)</a>, <a href=\"https://www.ethnologue.com/language/nld\">Dutch (nld)</a><br>\n<strong>Size: </strong> 5397 article, 1005 hour<br>\n<strong>Segmentation units: </strong> other<br>\n<strong>Genre: </strong> encyclopedia<br>\n<strong>Modality: </strong> spoken<br>\n<strong>References: </strong> Timo Baumann; Arne K&ouml;hn; Felix Hennig (2018) The Spoken Wikipedia Corpus Collection: Harvesting, Alignment and an Application to Hyperlistening <strong>References: </strong> Arne K&ouml;hn; Florian Stegen; Timo Baumann (2016) Mining the Spoken Wikipedia for Speech Data and Beyond</p>\n\n<p>&nbsp;</p>","distribution":[{"@type":"DataDownload","contentUrl":"https://www.fdr.uni-hamburg.de/api/files/2e9aa9ac-ea6b-4ef4-99d9-d2c7867110f7/swc-2.0.xml","encodingFormat":"xml"},{"@type":"DataDownload","contentUrl":"https://www.fdr.uni-hamburg.de/api/files/2e9aa9ac-ea6b-4ef4-99d9-d2c7867110f7/swc-2.0.cmdi","encodingFormat":"cmdi"},{"@type":"DataDownload","contentUrl":"https://www.fdr.uni-hamburg.de/api/files/2e9aa9ac-ea6b-4ef4-99d9-d2c7867110f7/swc-2.0.zip","encodingFormat":"zip"}],"identifier":"http://doi.org/10.25592/uhhfdm.1875","keywords":["linguistics","English","German","Dutch"],"license":"https://creativecommons.org/licenses/by-sa/4.0/legalcode","name":"The Spoken Wikipedia Corpora","url":"https://www.fdr.uni-hamburg.de/record/1875","version":"2.0"}

Cite record as