Welcome!
I’m a research assistant in the Institute for Language, Cognition and Computation at the University of Edinburgh. My current research centres around data-driven approaches to natural language processing (NLP) for under-served languages.
🏗️ Current work…
- As part of building the data pipeline for HPLT, I’m investigating methods to measure the quality of multilingual web-scraped corpora. I’ve also been learning Go!
- I’m building a new 🤗 page for the Open Language Data Initiative. When it’s done, it will host the foundational FLORES and MT Seed datasets so that they’re easier to access and can be updated quickly as they grow and improve. It’ll also be hosting an improved version of the OpenLID model and corpus.
- I have officially finished my PhD! I’ll be graduating in November - looking forward to swooshing about in the gown :)
🧑💻 Past projects…
Code-switched language identification is harder than you think
We investigate language identification at scale for code-switched text. We find that no current approach is adequate and give recommendations for future work (published at EACL 2024).
OpenLID: An open dataset and model for language identification
A model for fast natural language identification for 200+ languages, plus open access to all the data used for training (published at ACL 2023).
Exploring diversity in back translation for low-resource machine translation
We define and measure diversity in training data for low-resource machine translation, investigating the effect of different kinds of diversity on final performance (published at DeepLo 2022).
Querent intent in multi-sentence questions
Multi-sentence questions (MSQs) are sequences of questions which need to be answered as a unit. We identify five types of MSQs and create a new labelled dataset (published at LAW 2020).
🎓 See Semantic Scholar for a full list of my publications!
🤝 I’m part of…
HPLT: High Performance Language Technologies
We are building large monolingual and multilingual datasets in 70+ languages and using them to train powerful and efficient language and translation models. I’m working on dataset quality, ensuring the data we provide is as trustworthy and useful as possible.
OLDI: Open Language Data Initiative
We aim to improve NLP technologies by championing foundational multilingual datasets. With a focus on under-served languages, we encourage language communities to expand on and improve these datasets, which are then openly available to the research community.
ML Commons
I’m contributing to a project on better language identification for the web through the Data-centric ML Working Group.
CDT in NLP: Centre for Doctoral Training in Natural Language Processing
I did my PhD as part of the first cohort of the CDT for NLP and I love being part of such a great community of researchers! If you’d like to do your PhD at the University of Edinburgh, I recommend looking at the new CDT in Responsible and Trustworthy in-the-world NLP.