About
I’m a linguist technologist who researches data-driven multilinguality. My background is cross-disciplinary: before doing my PhD in NLP at the University of Edinburgh, I studied Middle Eastern Studies and Economics, as well as working as an analyst developer, data scientist and applied researcher (among other things). At Common Crawl, I develop open data products and help support the crawl, especially for underserved languages. I like to work in the open and collaborate widely.
Current projects
-
Open Language Data Initiative (2024–present): I am one of the lead organisers of OLDI, which develops and maintains training and evaluation datasets for underserved languages such as FLORES+ and OLDI-Seed. The initiative works with language communities to build resources that support machine translation and NLP research.
-
CommonLID (2024–present): A community-driven, human-annotated benchmark for language identification, covering 109 language varieties. In the accompanying paper, we use it alongside five other common evaluation sets to show that existing evaluations overestimate LID accuracy for many languages in the web domain.
Datasets, models, software
- HPLT (dataset, 2026). Large-scale web-derived processed text data covering the languages of Europe and beyond.
- CommonLID (dataset, 2026). A community-created language identification benchmark covering 109 language varieties.
- OLDI (dataset, 2026). OLDI’s collection of foundational datasets, including FLORES+ and OLDI-Seed.
- OpenLID (model, 2023). An open model for fast language identification covering 200 language varieties.
- OpenLID (dataset, 2023). Audited text with reliable language labels covering 200 language varieties.
- Multi-Sentence Questions (dataset, 2020). 162K English multi-sentence questions extracted from Stack Exchange.
Publications
Very Large-Scale Multilingual Resources for LLMs and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models. Stephan Oepen, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Maja Buljan, Laurie Burchell, et al. Proceedings of the Fifteenth Language Resources and Evaluation Conference. 2026.
CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data. Pedro Ortiz Suarez, Laurie Burchell, Catherine Arnett, Rafael Mosquera-Gómez, Sara Hincapie-Monsalve, Thom Vaughan, et al. arXiv preprint arXiv:2601.18026. 2026.
HPLT's Second Data Release. Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Laurie Burchell, Pinzhen Chen, Mariia Fedorova, et al. Proceedings of Machine Translation Summit XX: Volume 2. 2025.
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies (HPLT). Laurie Burchell, Ona de Gibert, Nikolay Arefyev, Mikko Aulamo, Marta Bañón, Pinzhen Chen, et al. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025.
Findings of the WMT 2025 Shared Task of the Open Language Data Initiative. David Dale, Laurie Burchell, Jean Maillard, Idris Abdulmumin, Antonios Anastasopoulos, Isaac Caswell, et al. Proceedings of the Tenth Conference on Machine Translation. 2025.
Findings of the WMT 2024 Shared Task of the Open Language Data Initiative. Laurie Burchell, Jean Maillard, Antonios Anastasopoulos, Christian Federmann, Philipp Koehn, and Skyler Wang. Proceedings of the Ninth Conference on Machine Translation. 2024.
Code-Switched Language Identification is Harder Than You Think. Laurie Burchell, Alexandra Birch, Robert Thompson, and Kenneth Heafield. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024.
An Open Dataset and Model for Language Identification. Laurie Burchell, Alexandra Birch, Nikolay Bogoychev, and Kenneth Heafield. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023.
Exploring diversity in back translation for low-resource machine translation. Laurie Burchell, Alexandra Birch, and Kenneth Heafield. Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing. 2022.
The University of Edinburgh's Submission to the WMT22 Code-Mixing Shared Task (MixMT). Faheem Kirefu, Vivek Iyer, Pinzhen Chen, and Laurie Burchell. Proceedings of the Seventh Conference on Machine Translation (WMT). 2022.
The University of Edinburgh's English-German and English-Hausa Submissions to the WMT21 News Translation Task. Pinzhen Chen, Jindřich Helcl, Ulrich Germann, Laurie Burchell, Nikolay Bogoychev, Antonio Valerio Miceli Barone, et al. Proceedings of the Sixth Conference on Machine Translation. 2021.
Querent Intent in Multi-Sentence Questions. Laurie Burchell, Jie Chi, Tom Hosking, Nina Markl, and Bonnie Webber. Proceedings of the 14th Linguistic Annotation Workshop. 2020.
Talks
- Expanding Linguistic and Cultural Coverage in Common Crawl. Cohere Labs Community Talks, Online (April 2026).
- Common Crawl: open web data for everybody. AI Lab, Howest University of Applied Sciences, Belgium (April 2026).
- Community Spotlight: CommonLID. Eleuther AI, Online (February 2026).
- Multilinguality at Common Crawl: Improving Language Coverage for the Largest Open Web Corpus. 2026 Winter School on Multilinguality in LLM Development and Evaluation, Skeikampen, Norway (February 2026).
- Common Crawl: Open web data for everybody. Turing Seminar, University of Bristol (November 2025).
- Open web data in the age of LLMs. Open Data Camp 2025, Edinburgh (September 2025).
Teaching and supervising
- Masters supervision, University of Helsinki (2025–present): Co-supervising a dissertation on code-switched language identification, with Tommi Jauhiainen and Jörg Tiedemann.
- Tutorial, LREC (2026): Co-organised and presented a half-day tutorial on Low-Resource, High-Impact: Building Corpora for Inclusive Language Technologies, with Katya Artemova, Daryna Dementieva, Shu Okabe, Mariya Shmatova, and Pedro Ortiz Suarez.
- Masters supervision, University of Edinburgh (2023): Co-supervised a dissertation on improving Modern Tibetan to English machine translation through a novel dataset, with Lexi Birch.
- Summer Workshop Lead, UK Civil Service (2020): Led a three-month interdisciplinary research workshop on deep-learning approaches to code-switching in text; delivered an introductory NLP course to other participants and built a competitive token-level language identification system.
Other projects
-
ML Commons Data-Centric ML Research Working Group (2024–2025): Member of the working group, supporting development of benchmarking datasets and advising on issues in language identification.
-
High Performance Language Technologies (2024–2026): An EU-funded project providing petabytes of open monolingual and parallel data for the languages of the EU and beyond, plus models and analysis. I worked on the corpus-building pipelines, particularly language identification, and was the first author of our ACL 2025 paper.
Professional activities
Event organisation
- WMDQS (2025): program chair for the first Workshop on Multilingual Data Quality Signals, co-located with COLM in Montréal. The workshop included a shared task on language identification for web text.
- OLDI at WMT (2024, 2025, 2026): co-organiser of all three editions of the Open Language Data Initiative shared task, focused on training and evaluation data for underserved languages.
Reviewing
Conferences: ACL Rolling Review (2023–present), First Workshop on Multilingual Multicultural Evaluation (2026), EMNLP (2023), ACL (2023), Workshop on Machine Translation (2022), NAACL Student Research Workshop (2022), Dravidian LangTech (2022), ACL Student Research Workshop (2022)
Journals: IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
Grants: Horizon Europe UTTER, first Financial Support to Third Parties call (2023)
Outreach
- Web Archives for Social Sciences Datathon, University of Bristol (2025): facilitator
- Women in AI Open Day (2020, 2022): ambassador
Employment
Common Crawl Foundation (2025–present)
Principal Research Engineer (2026–present)
Senior Research Engineer (2025–2026)
Building new datasets and tools that make Common Crawl’s web data more useful, with a particular focus on improving coverage of languages other than English. The role spans applied research, collaborations with external partners, and contributing to the day-to-day running of the crawl itself.
- Co-led CommonLID: a community-built benchmark for language identification on real web data covering 109 language varieties and 350,000+ annotated lines, released on Hugging Face and accepted to ACL 2026.
- Built a new language identification system in Rust (open-source release pending), presented at the IIPC Web Archiving Conference
- Improved Common Crawl’s documentation to make the crawl easier to use for new researchers and practitioners
University of Edinburgh (2024–2025)
Postdoctoral Research Associate
Worked on the High Performance Language Technologies (HPLT) project, helping to build the infrastructure and methods behind open, large-scale multilingual datasets.
- First author on the HPLT v2 dataset paper (ACL 2025), describing an open multilingual corpus covering 193 languages and 8 trillion tokens
- Owned the language identification stage of the HPLT data pipeline; contributed to other components, including sharding
- Mentored junior researchers, supporting their development in both research and engineering
UK Civil Service (2016–2019)
Analyst Developer (2016–2017)
Applied Researcher (2017–2019)
Joined through the graduate Analyst Development Programme, then moved into an applied research role on an interdisciplinary team working with mathematicians and analysts to extract value from large, heterogeneous data sources.
- Wrote Python analytics for non-technical analysts and contributed to an internal framework for time-series analysis and downstream ML
- Led a collaborative machine-learning research project with academic partners
- Completed the Data Science Development Programme (year-long; taught courses plus an applied project)
Education
University of Edinburgh (2019–2024): PhD in Natural Language Processing with Integrated Study. Thesis: Improving natural language processing for under-served languages through increased training data diversity, supervised by Kenneth Heafield and Alexandra Birch.
University of Warwick (2013–2015): MSc and Diploma in Economics (Distinction). Dissertation: Is there statistical discrimination against workers with disabilities in the UK?, supervised by Roland Rathelot.
University of Cambridge (2009–2013): MA (Cantab.) in Middle Eastern Studies (2.i). Dissertation: How Islamic is Islamic banking?, supervised by Timothy Winter.
Skills
Daily: Python, Rust, Bash, NumPy/Pandas, Hugging Face, AWS
Used substantially: PyTorch, Slurm and HPC clusters, Nutch, Marian, Sockeye 3, Go, WARC extraction and large-scale data pipelines, SQL
Reading knowledge: Java, C/C++, Matlab
Natural languages: English (native), Spanish (advanced), Modern Standard Arabic (intermediate), Egyptian Arabic (intermediate), French (intermediate), German (basic), Persian (basic), Scottish Gaelic (basic).