Welcome!
I’m a research assistant in the Institute for Language, Cognition and Computation at the University of Edinburgh. My current research centres around data-driven approaches to natural language processing (NLP) for under-served languages.
🏗️ Current work
- Version 2.0 of the HPLT datasets is out! The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. You can access the data on the HPLT website and find more information in the paper. It was a massive team effort, as demonstrated by the 34 co-authors!
- As part of the HPLT release, we released an update to OpenLID-v2. We improved performance for some Iberian and Kurdish languages and altered the labelling scheme to align with FLORES+ (see the changelog). I also tided up the codebase to make it easier to update in the future!
- Following the very successful first shared task of the Open Language Data Initiative, you can find updated versions of FLORES+ and OLDI Seed hosted on Hugging Face. Huge thanks and appreciation to all contributors and my collaborators!
🤝 I’m part of…
HPLT: High Performance Language Technologies
We are building large monolingual and multilingual datasets in 70+ languages and using them to train powerful and efficient language and translation models. I’m working on dataset quality, ensuring the data we provide is as trustworthy and useful as possible.
OLDI: Open Language Data Initiative
We aim to improve NLP technologies by championing foundational multilingual datasets. With a focus on under-served languages, we encourage language communities to expand on and improve these datasets, which are then openly available to the research community.
ML Commons
I’m contributing to a project on better language identification for the web through the Data-centric ML Working Group.
CDT in NLP: Centre for Doctoral Training in Natural Language Processing
I did my PhD as part of the first cohort of the CDT for NLP and I love being part of such a great community of researchers! If you’d like to do your PhD at the University of Edinburgh, I recommend looking at the new CDT in Responsible and Trustworthy in-the-world NLP.
🧑💻 Selected papers
Findings of the WMT 2024 Shared Task of the Open Language Data Initiative
We ran a shared task through the Open Language Data Initiative to improve and expand the FLORES+ and MT Seed multilingual datasets. We accepted ten submissions covering 16 languages, extending the range of languages included and improving the quality of the datasets (published at WMT 2024).
Code-switched language identification is harder than you think
We investigate language identification at scale for code-switched text. We find that no current approach is adequate and give recommendations for future work (published at EACL 2024).
OpenLID: An open dataset and model for language identification
A model for fast natural language identification for 200+ languages, plus open access to all the data used for training (published at ACL 2023). Find the latest version of OpenLID here!
🎓 See Semantic Scholar for a full list of my publications!