Building language models, one story at a time

Share
  • August 20, 2022

One-third of the world’s languages are spoken in Africa, but less than 1% of African languages are represented online. This is significant because the language you speak, write or sign shapes your online experience. Language is the cornerstone of your identity, the connection to your past and the key to your future. When we can’t experience the internet in our language, it limits what we can learn, what jobs we can have, what stories we can access, and so much more.

In my home country Mali, eighty percent of the population speaks Bambara as its first or second language. It is also spoken in Burkina Faso, Ivory Coast, Liberia and Guinea — making it one of West Africa’s most widely spoken languages. But, if Bambara is your primary language, it can be difficult to have an immersive internet experience. That’s why I’ve set out to make the internet more accessible to Bambara speakers, remove the language barrier, and bring this primarily spoken language online for everyone.

To achieve this goal, a language model for Bambara needs to be built. Language models require lots of data, which typically means having hours of transcribed recordings where humans are speaking the language so that computers can learn the language through a process called Natural Language Processing. Unfortunately, Bambara lacks readily available data to train. Researchers call this being “low-resourced.” My team at Robots Mali has been trying to solve this challenge for years as part of a collaborative project called Bayɛlɛmabaga. Through collaboration with the Google Research team in Accra, we’re closer to accomplishing our goals of building more resources (written and bilingual texts) for Bambara.

To overcome the challenge of being “low-resourced,” we teamed up with those who hold the culture’s knowledge, rich history and teachings. Malian Griots are the real keepers of the Bambara collective memory, passing their knowledge only through oral storytelling. So, we gathered more than thirty griots to record them narrating generational stories. We transcribed and translated each tale to preserve the knowledge for future generations. While griots are traditionally older men, for this project, we worked to identify a diverse group of griots based on age, gender and background to build a representative group.

Using these recordings we’ve been able to build a model for understanding Bambara speech and facilitating easy translation to other languages, known as an Automatic Speech Recognition (ASR) model. As a result, we are making the world’s information more accessible to millions of Bambara speakers and releasing our findings for the research community and everyone to benefit. Our work has allowed us to uplift traditional practices while building a new future for Bambara speakers. We’re in contact with the National Museum of Mali to donate all of the beautiful stories that the griots have narrated. The rich history and teachings from the griots will be available to the local community and public. Furthermore, the project is selected to be showcased at The Deep Learning Indaba 2022 next week, the largest machine learning conference in Africa.

Most importantly, we identified oral literature as a viable resource for languages. Many languages are underrepresented online, and this project represents a big step towards bringing more of them online. Of course, there’s still a lot of work to do. But, by introducing this work to the community, researchers have new tools to keep breaking down the online language barrier.

Source : Building language models, one story at a time