Welcome!
I’m a Senior Research Engineer with the Common Crawl Foundation. I work on developing open data products, supporting the crawl, and championing open data at scale, particularly for under-served languages.
🏗️ Current work
- There’s a current call for papers for the First Workshop on Multilingual Data Quality Signals (WMDQS), hosted at COLM 2025. We’re inviting the submission of long and short research papers related to data quality in multilingual data. There’s also a shared task on language identification for web data - please see the website for more details!
- The Open Language Data Initiative is running a second shared task hosted at WMT 2025. We are asking for extensions and improvements to FLORES+ and Seed, as well as new high-quality, massively-parallel and open-source datasets. Full instructions on how to contribute are available on OLDI’s website.
- Version 2.0 of the HPLT datasets is out! The monolingual portion of the data contains 8T tokens covering 193 languages, while the parallel data contains 380M sentence pairs covering 51 languages. You can access the data on the HPLT website and find more information in the paper.
🤝 I’m part of…
Common Crawl Foundation
Common Crawl maintains a free, open repository of web crawl data that can be used by anyone. It aims to make wholesale extraction, transformation and analysis of open web data accessible to researchers.
OLDI: Open Language Data Initiative
We aim to improve NLP technologies by championing foundational multilingual datasets. With a focus on under-served languages, we encourage language communities to expand on and improve these datasets, which are then openly available to the research community.
ML Commons
I’m contributing to a project on better language identification for the web through the Data-centric ML Working Group. We’re part of the team running the First Workshop on Multilingual Data Quality Signals (WMDQS)!
HPLT: High Performance Language Technologies
HLPT builds large monolingual and multilingual datasets in 70+ languages and uses them to train powerful and efficient language and translation models. I was a postdoc with this project and worked on dataset quality.
CDT in NLP: Centre for Doctoral Training in Natural Language Processing
I did my PhD as part of the first cohort of the CDT for NLP and I love being part of such a great community of researchers! If you’d like to do your PhD at the University of Edinburgh, I recommend looking at the new CDT in Responsible and Trustworthy in-the-world NLP.
🧑💻 Selected papers
Findings of the WMT 2024 Shared Task of the Open Language Data Initiative
We ran a shared task through the Open Language Data Initiative to improve and expand the FLORES+ and MT Seed multilingual datasets. We accepted ten submissions covering 16 languages, extending the range of languages included and improving the quality of the datasets (published at WMT 2024).
Code-switched language identification is harder than you think
We investigate language identification at scale for code-switched text. We find that no current approach is adequate and give recommendations for future work (published at EACL 2024).
OpenLID: An open dataset and model for language identification
A model for fast natural language identification for 200+ languages, plus open access to all the data used for training (published at ACL 2023). Find the latest version of OpenLID here!
🎓 See Semantic Scholar for a full list of my publications!