Nordic language BERT models: “Languages with fewer speakers are underrepresented on the internet”

Share

JAXenter: Your company has released open source versions of Danish and Norwegian BERT models. What is the difference between your models and the multilanguage BERT version released by Google that includes Danish and Norwegian?

Google’s multilingual model performs poorly for languages such as Danish or Norwegian.

Jens Dahl Møllerhøj: The multilingual BERT model released by Google is trained on more than a hundred different languages. The multilingual model performs poorly for languages such as the Nordic languages like Danish or Norwegian because of underrepresentation in the training data. For example, only 1% of the total amount of data constitutes the Danish text. To illustrate, the BERT model has a vocabulary of 120,000 words *, which leaves room for about 1,200 Danish words. Here comes BotXO’s model, which has a vocabulary of 32,000 Danish words.

(* Actually, “words” is a bit imprecise. In practice, the model divides rare words, for example, the word “inconsequential” becomes “in-”, “-con-” and “-sequential”. As this kind of word divisions are present in different languages, there is room for more than 1,200 Danish “words” in Google’s multilingual model.)

SEE ALSO: Combatting AI bias: remembering the human at the heart of the data is key

JAXenter: What makes a language particularly well suited for NLP tasks—and what makes the process especially complicated?

Jens Dahl Møllerhøj: The performance of general-purpose language models such as BERT is dependent on the amount and quality of training data available. Since languages with fewer speakers are underrepresented on the internet, it can be challenging to gather enough data to train big language models.

JAXenter: What were the greatest challenges in creating the Danish and Norwegian BERT models?

Jens Dahl Møllerhøj: Getting the training process to run fast enough required running it on custom Google hardware called TPUs. As these chips are experimental, they are not particularly well documented. Getting the algorithm running on them required quite a bit of experimentation. Moreover, renting Google’s TPUs costs a lot of money. Since TPUs are expensive to use, it is important to make the algorithms run as fast as possible to decrease the cost.

JAXenter: Do you have plans to develop BERT models for other languages?

We trained the Swedish BERT model on an astonishing 25 GB of raw text data.

Jens Dahl Møllerhøj: Yes! In fact, we have also released this Swedish BERT model. The release of the Danish and Norwegian models was a success, and we continued working on other Nordic languages. We trained the Swedish BERT model on an astonishing 25 GB of raw text data. It is ten times more data than the previously biggest Swedish BERT model.

Working on the Swedish BERT model has been a bit different because the language has a different set of characters than Danish and Norwegian. It includes English characters as well as the vowels Å, Ä, and Ö. Besides, the Swedish language is spoken by 10 million people, almost as many as the Danish and Norwegian population combined.

Now, we are planning to work on a Finnish BERT model, and we want to run a more detailed analysis of the data for different languages.

JAXenter: In what areas and by whom do you expect your open source models to be used?

Jens Dahl Møllerhøj: We expect that the Danish model can be useful for any Danish private company, educational institution, NGO or public organization in need of AI in Danish. We hope that others can further develop it and use it to improve their products and services as well as producing new solutions.

Researchers are using the models for general text classification, sentiment analysis and entity extraction.

We have already observed that other researchers are using the models for general text classification, sentiment analysis and entity extraction. At the same time, many of BotXO’s customers are running experiments and are utilizing the model for different projects.

We hope that Norwegian, as well as Swedish models, will help data scientists in Norway and Sweden to build state-of-the-art natural language processing solutions.

SEE ALSO: How to track progress and collaborate in data science and machine learning projects

Last but not least, the artificial intelligence field is developing very quickly, and an open-source dataset is very important. Sharing resources allows the exchange of ideas, and we can help each other out with our projects.

The models and instructions for data scientists and engineers as well as discussions about the use cases can be found here: github.com/botxo.

JAXenter: Thank you for the interview!

The post Nordic language BERT models: “Languages with fewer speakers are underrepresented on the internet” appeared first on JAXenter.

Source : JAXenter