OpenLID v2
It’s been just over a year since the original OpenLID paper was published, and in that time, I’ve been thinking about a number of improvements that could be made to the model. The big selling point of the OpenLID dataset was its higher reliability compared to other similar datasets: we chose not to include sources where the labels weren’t assigned by humans, and every language and every source had had a sample of the training data audited to make sure the labels were accurate (to the best of our ability). This meant that the dataset was smaller and covered fewer languages compared to similar resources, but the resulting model showed that training on data with more trustworthy labels resulted in better performance and higher domain robustness.
Thinking about reliability then, in this blog post I present the OpenLID v2 dataset and model. The biggest change has been in labelling: rather than following the categories in FLORES-200, OpenLID v2 primarily uses macrolanguage labels where available since these are a more honest representation of what the model is able to distinguish. The pre-processing scripts have also been improved leading to better sentence segmentation for more languages.
I discuss the changes in more detail below, and for even more detail, you can read my thesis!
Change 1: altered labels to favour macrolanguages
A recurring theme in language identification is that deciding which language labels to use (and how to convert between them) is a nightmare. A single language can be labeled in a wide variety of different ways depending on numerous factors (not least the politics of the labeller). Relevant for this discussion though is the level of granularity required. Some language varieties (e.g. Arabic, Malay, Chinese) fall within a macrolanguage: a set of mutually-intelligible language varieties which usually operate on a continuum of change from one to another. This makes single-label language identification very difficult, since an individual utterance may be acceptable in multiple individual languages and the line between one language and another is usually not fixed.
We based the original language varieties covered by OpenLID on the FLORES-200 evaluation dataset so that we would have a way to measure the performance of our model. For some reason, the designers of FLORES-200 favoured labels corresponding to individual language varieties over those corresponding to macro-languages e.g. als_Latn
(Tosk Albanian) rather than sqi_Latn
(Albanian macro-language). This level of specificity can be helpful, but in practice, we found that our model struggled to distinguish between very similar languages, particularly Arabic dialects. Nearly all of the lowest scores on the FLORES dev-test set were for Arabic language varieties: the highest F1 score was only 0.49 for Moroccan Arabic, which is way below the macroaverage F1 score of 0.93 for the OpenLID language identification model overall.
I ended up dedicating a whole chapter of my thesis to trying to solve this problem for Arabic dialects in particular, but in the end, I decided that the most honest thing to do for the labelling in OpenLID v2 is to label at the macrolanguage level by default. I want to keep OpenLID as a single-label classifier for simplicity and I hope that those who need further granularity can filter further to their language of choice! That said, I did consider macrolanguages on a case by case basis (as far as my knowledge would allow).
In most cases, there was only one individual member of the macrolanguage present in the languages covered (e.g. Central Aymara, Central Kanuri), so I simply relabeled these with the corresponding macrolanguage. Otherwise, I combined macrolanguages under one label under the reasoning that the classifier was more likely to be reliable at the macrolanguage level. There were two exceptions to this rule: Malay and Norwegian. In both cases, the classifier seemed able to distinguish the individual languages well and so I kept the individual labels.
You can find the script I used to relabel languages on the HuggingFace repo. I do not speak most of the languages covered by OpenLID, so please get in touch if you can advise on how best to label your language!
Change 2: preprocessing improvements
After publishing the original OpenLID dataset, one of the people using the dataset got in touch to point out that some languages had overly-long sentences. I found that the very simple sentence segmentation algorithm I had used didn’t cover enough languages and so I decided to write a slower but more powerful preprocessing pipeline in Python.
I’ve put the script I used to clean the corpus on the HuggingFace repo. It now uses a language-specific sentence-splitter as far as possible as well as normalising punctuation, replacing non-printing characters, normalising Unicode characters and removing emoji. For best results, I strongly recommend preprocessing data in a similar way prior to classification.
Final dataset and model
The OpenLID-v2 dataset covers 188 languages with a mean of 619,799 lines of data per language class. A fastText model trained on this data with the same hyperparameters as the original OpenLID model achieves a macroaverage F1 score of 0.977, improving over the previous F1 of 0.933.
I hope to add more languages to OpenLID in the future with the help of the expanded FLORES+ dataset. Please get in touch with any comments or suggestions!