Nordic language BERT models: “Languages with fewer speakers are underrepresented on the internet”

JAXenter: Your company has released open source versions of Danish and Norwegian BERT models. What is the difference between your models and the multilanguage BERT version released by Google that includes Danish and Norwegian?

Google’s multilingual model performs poorly for languages such as Danish or Norwegian.

Jens Dahl Møllerhøj: The multilingual BERT model released by Google is trained on more than a hundred different languages. The multilingual model performs poorly for languages such as the Nordic languages like Danish or Norwegian because of underrepresentation in the training data. For example, only 1% of the total amount of data constitutes the Danish text. To illustrate, the BERT model has a vocabulary of 120,000 words *, which leaves room for about 1,200 Danish words. Here comes BotXO’s model, which has a vocabulary of 32,000 Danish words.

(* Actually, “words” is a bit imprecise. In practice, the model divides rare words, for example, the word “inconsequential” becomes “in-”, “-con-” and “-sequential”. As this kind of word divisions are present in different languages, there is room for more than 1,200 Danish “words” in Google’s multilingual model.)

SEE ALSO: Combatting AI bias: remembering the human at the heart of the data is key

JAXenter: What makes a language particularly well suited for NLP tasks—and what makes the process especially complicated?

Jens Dahl Møllerhøj: The performance of general-purpose language models such as BERT is dependent on the amount and quality of training data available. Since languages with fewer speakers are underrepresented on the internet, it can be challenging to gather enough data to train big language models.

JAXenter: What were the greatest challenges in creating the Danish and Norwegian BERT models?

Jens Dahl Møllerhøj: Getting the training process to run fast enough required running it on custom Google hardware called TPUs. As these chips are experimental, they are not particularly well documented. Getting the algorithm running on them required quite a bit of experimentation. Moreover, renting Google’s TPUs costs a lot of money. Since TPUs are expensive to use, it is important to make the algorithms run as fast as possible to decrease the cost.

JAXenter: Do you have plans to develop BERT models for other languages?

We trained the Swedish BERT model on an astonishing 25 GB of raw text data.

Jens Dahl Møllerhøj: Yes! In fact, we have also released this Swedish BERT model. The release of the Danish and Norwegian models was a success, and we continued working on other Nordic languages. We trained the Swedish BERT model on an astonishing 25 GB of raw text data. It is ten times more data than the previously biggest Swedish BERT model.

Working on the Swedish BERT model has been a bit different because the language has a different set of characters than Danish and Norwegian. It includes English characters as well as the vowels Å, Ä, and Ö. Besides, the Swedish language is spoken by 10 million people, almost as many as the Danish and Norwegian population combined.

Now, we are planning to work on a Finnish BERT model, and we want to run a more detailed analysis of the data for different languages.

JAXenter: In what areas and by whom do you expect your open source models to be used?

Jens Dahl Møllerhøj: We expect that the Danish model can be useful for any Danish private company, educational institution, NGO or public organization in need of AI in Danish. We hope that others can further develop it and use it to improve their products and services as well as producing new solutions.

Researchers are using the models for general text classification, sentiment analysis and entity extraction.

We have already observed that other researchers are using the models for general text classification, sentiment analysis and entity extraction. At the same time, many of BotXO’s customers are running experiments and are utilizing the model for different projects.

We hope that Norwegian, as well as Swedish models, will help data scientists in Norway and Sweden to build state-of-the-art natural language processing solutions.

Nordic language BERT models: “Languages with fewer speakers are underrepresented on the internet”

SEE ALSO: Combatting AI bias: remembering the human at the heart of the data is key

ML Conference – The Conference
for Machine Learning Innovation

Protecting AI Solutions From Attacks

AI for Decision Makers

Reinforcement Learning and Imitation Learning Workshop

SEE ALSO: How to track progress and collaborate in data science and machine learning projects

You may also like...

Random Post

Recent

Nordic language BERT models: “Languages with fewer speakers are underrepresented on the internet”

SEE ALSO: Combatting AI bias: remembering the human at the heart of the data is key

ML Conference – The Conference for Machine Learning Innovation

Protecting AI Solutions From Attacks

AI for Decision Makers

Reinforcement Learning and Imitation Learning Workshop

SEE ALSO: How to track progress and collaborate in data science and machine learning projects

You may also like...

15 of the best movies on Sundance Now for when you want something special

Epic Games Store Finally Getting Achievements

Lessons from SpaceX: How to build the next Toyota Camry for space

Random Post

Recent

Tags

ML Conference – The Conference
for Machine Learning Innovation