Voicer: A Crowd Sourcing Tool for Speech Data Collection

Abstract

Speech corpora do not exist for most low-resource languages. Thus, creating speech corpora for a language of such a nature is challenging and involves a significant amount of time and effort. This paper provides an overview of related data collection strategies, highlighting a few issues prevalent in the existing approaches. The objectives of this paper encompass firstly the introduction of an open-source tool called “Voicer” that is accessible via both handheld devices and computers that can be used to conduct a speech data collection for a specific domain in a short span of time irrespective of the language. Secondly, it demonstrates the power of the tool, utilizing the same to build a Sinhala speech corpus that consists of 10 hours of speech data for 39 different sentences in the banking domain. Finally, this paper provides a framework to evaluate a speech data corpus along with the lessons learned during the process of data collection with a view to contributing towards future researches.

Publication
2018 18th International Conference on Advances in ICT for Emerging Regions (ICTer)
Avatar
Sudeepa Nadeeshan
Research Assistant

My research interests include Intelligent Transport Systems, Machine Learning.

Related