MLCommons Datasets

Multilingual Spoken Words

MLCommons Multilingual Spoken Words corpus is a large and growing audio dataset of spoken words in 50 languages for academic research and commercial applications in keyword spotting and spoken term search, licensed under CC-BY 4.0.

About the dataset

The MLCommons Multilingual Spoken Words dataset The dataset contains more than 340,000 keywords, totaling 23.4 million 1-second spoken examples (over 6,000 hours).

It has many use cases, ranging from voice-enabled consumer devices to call center automation. All alignments are included in the dataset. Please see our paper for a detailed analysis of the contents of the data and methods for detecting potential outliers, along with baseline accuracy metrics on keyword spotting models trained from our dataset compared to models trained on a manually-recorded keyword dataset.

Download Disclaimers
By using the Cloudflare mirror, MLCommons requires that you agree not to attempt to determine the identity of the speakers in the dataset. By using the Alibaba Mirror, MLCommons requires that you agree not to attempt to determine the identity of the speakers in the dataset. By using the Google mirror, Google requires that you agree not to attempt to determine the identity of the speakers in the dataset.

Full Dataset

  • License: CC-BY 4.0
  • Audio Format: Opus
  • Size: 124 GB
  • Description: All 50 languages

Microset

  • License: CC-BY 4.0
  • Audio Format: Opus
  • Size: 584 MB
  • Description: Small subset of 51 English and Spanish words for prototyping

Metadata

  • License: CC-BY 4.0
  • Size: 103 MB
  • Description: The metadata file contains our dataset version info, and metadata organized as json dictionaries by each language isocode. The per-language metadata contains the following items: the full language name, the number of words we contain in the language, a dictionary of each word and the number of clips for each word, and another dictionary of each word and the opus filenames for each clip.

Language

LanguageLicenseSizeCloudflare mirrorAlibaba MirrorGoogle Mirror
EnglishCC-BY 4.032.45 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
GermanCC-BY 4.017.95 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
FrenchCC-BY 4.012.44 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
CatalanCC-BY 4.011.18 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
KinyarwadaCC-BY 4.08.08 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
SpanishCC-BY 4.06.85 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
RussianCC-BY 4.02.84 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
ItalianCC-BY 4.03.02 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
PolishCC-BY 4.02.47 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
BasqueCC-BY 4.02.33 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
PersianCC-BY 4.06.33 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
DutchCC-BY 4.01.13 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
EsparantoCC-BY 4.01.55 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
PortugueseCC-BY 4.01.09 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
WelshCC-BY 4.01.70 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
TatarCC-BY 4.00/53 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
CzechCC-BY 4.00.42 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
UkranianCC-BY 4.00.34 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
EstonianCC-BY 4.00.31 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
TurkishCC-BY 4.00.38 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
MongolianCC-BY 4.00.18 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
KyrgyzCC-BY 4.00.23 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
ArabicCC-BY 4.00.16 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
FrisianCC-BY 4.00.15 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
SwedishCC-BY 4.00.18 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
MalteseCC-BY 4.00.12 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
IndonesianCC-BY 4.00.20 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
GreekCC-BY 4.00.12 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
BretonCC-BY 4.00.08 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
SursilvanCC-BY 4.00.08 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
RomaniaCC-BY 4.00.09 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
SlovenianCC-BY 4.00.06 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
SakhaCC-BY 4.00.05 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
LatvianCC-BY 4.00.07 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
InterlinguaCC-BY 4.00.07 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
SlovakCC-BY 4.00.03 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
ChuvashCC-BY 4.00.04 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
IrishCC-BY 4.00.05 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
ChineseCC-BY 4.00.06 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
GeorgianCC-BY 4.00.02 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
Hakha ChinCC-BY 4.00.03 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
HausaCC-BY 4.00.02 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
ValladerCC-BY 4.00.02 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
TamilCC-BY 4.00.01 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
VietnameseCC-BY 4.00.00 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
AssameseCC-BY 4.00.00 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
GuaraniCC-BY 4.00.00 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
OriyaCC-BY 4.00.00 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
DhivehiCC-BY 4.00.00 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments
LithuanianCC-BY 4.00.14 GBAudio
Splits
Alignments
Audio
Splits
Alignments
Audio
Splits
Alignments

Primary languages in our dataset by country

This map depicts 28 primary languages which are included in our 50-language dataset, highlighted by country. Our dataset contains keywords in the following 50 languages: Arabic, Assamese, Basque, Breton, Catalan, Chinese, Chuvash, Czech, Dhivehi, Dutch, English, Esparanto, Estonian, French, Frisian, Georgian, German, Greek, Guarani, Hakha Chin, Hausa, Indonesian, Interlingua, Irish, Italian, Kinyarwada, Kyrgyz, Latvian, Lithuanian, Maltese, Mongolian, Oriya, Persian, Polish, Portuguese, Romanian, Russian, Sakha, Slovak, Slovenian, Spanish, Sursilvan, Swedish, Tamil, Tatar, Turkish, Ukranian, Vallader, Vietnamese, and Welsh.