MLCommons Multilingual Spoken Words Dataset
MLCommons Multilingual Spoken Words Corpus is a large and growing audio dataset of spoken words in 50 languages for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0. The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours).
The dataset has many use cases, ranging from voice-enabled consumer devices to call center automation. All alignments are included in the dataset. Please see our paper for a detailed analysis of the contents of the data and methods for detecting potential outliers, along with baseline accuracy metrics on keyword spotting models trained from our dataset compared to models trained on a manually-recorded keyword dataset.
- Read our full paper here.
- Join the MSWC mailing list here.
- Connect with other MSWC users on
the MLCommons Discord server. - Get started by trying out our introductory
tutorial notebook here on Google Colab. - Watch our NeurIPS talk here.
Download Disclaimers
By using the Cloudflare mirror, MLCommons requires that you agree not to attempt to determine the identity of the speakers in the dataset. By using the Alibaba Mirror, MLCommons requires that you agree not to attempt to determine the identity of the speakers in the dataset. By using the Google mirror, Google requires that you agree not to attempt to determine the identity of the speakers in the dataset.
Full Dataset
- License: CC-BY 4.0
- Audio Format: Opus
- Size: 124 GB
- Description: All 50 languages
Microset
- License: CC-BY 4.0
- Audio Format: Opus
- Size: 584 MB
- Description: Small subset of 51 English and Spanish words for prototyping
Metadata
- License: CC-BY 4.0
- Size: 103 MB
- Description: The metadata file contains our dataset version info, and metadata organized as json dictionaries by each language isocode. The per-language metadata contains the following items: the full language name, the number of words we contain in the language, a dictionary of each word and the number of clips for each word, and another dictionary of each word and the opus filenames for each clip.
Language
Primary languages in our dataset by country
This map depicts 28 primary languages which are included in our 50-language dataset, highlighted by country. Our dataset contains keywords in the following 50 languages: Arabic, Assamese, Basque, Breton, Catalan, Chinese, Chuvash, Czech, Dhivehi, Dutch, English, Esparanto, Estonian, French, Frisian, Georgian, German, Greek, Guarani, Hakha Chin, Hausa, Indonesian, Interlingua, Irish, Italian, Kinyarwada, Kyrgyz, Latvian, Lithuanian, Maltese, Mongolian, Oriya, Persian, Polish, Portuguese, Romanian, Russian, Sakha, Slovak, Slovenian, Spanish, Sursilvan, Swedish, Tamil, Tatar, Turkish, Ukranian, Vallader, Vietnamese, and Welsh.