Multilingual Spoken Words Corpus - 50 Languages and Over 23 Million Audio Keyword Examples

Today we are extremely excited to announce the initial release of the Multilingual Spoken Words Corpus (MSWC), a large and growing audio dataset of spoken words in 50 different languages. These languages are collectively spoken by over 5 billion people and for most languages, this is the first publicly available, free-to-use dataset for training voice interfaces. It is licensed under CC-BY 4.0 to inspire academic research and commercial work in keyword spotting, spoken term search, and other applications that can benefit people across the planet. Our ultimate goal is to make keyword spotting voice-based interfaces available for any keyword in any language.

Voice-based interaction is already democratizing access to technology. For example, keyword spotting is a common application in many smart devices (such as Apple’s Siri, Amazon’s Alexa, or Google’s Voice Assistant). Keyword spotting systems use very low-power hardware to continuously listen for a key phrase in order to trigger an action such as turning on lights or waking up a more sophisticated interface. For some people, this kind of interaction is a modern convenience, but for others, like the visually-impaired, it can be a life-changing capability.

Robust voice interaction requires training machine learning models on large datasets. Traditionally, these keyword datasets require enormous effort to collect and validate many thousands of utterances for each keyword of interest from a diverse set of speakers and environmental contexts. Unfortunately, most existing public keyword datasets are monolingual and contain only a handful of keywords. Many commonly spoken languages lack any public datasets, which makes it extremely difficult to provide basic voice capabilities to speakers of these languages.

To address these challenges, MLCommons® has developed and will maintain and update MSWC, a large speech recognition dataset of spoken words in 50 languages. In total, the dataset contains over 340,000 words and 23 million one-second audio samples, adding up to over 6,000 hours of speech. We constructed this dataset by applying open source tools to extract individual words from crowdsourced sentences donated to the Common Voice project, which can then be used to train keyword spotting models for voice assistants across a diverse array of languages.

Of the languages in our dataset, 12 are “high-resource”, with over 100 hours of data, 12 are “medium resource” with 10 to 100 hours of data, and 26 are “low-resource” with under 10 hours of data. To our knowledge, the MSWC dataset is the only open-source spoken word dataset for 46 of these languages. Each keyword has predefined training, validation, and testing splits and we are also releasing the open-source tools used to build the dataset and categorize the keywords.

Contributors to the MSWC include researchers from Coqui, Factored, Google, Harvard University, Intel, Landing AI, NVIDIA, and the University of Michigan. It can be downloaded at mlcommons.org/words and for more information, please read our paper accepted to the 2021 Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks.