Welcome!

I’m a research assistant in the Institute for Language, Cognition and Computation at the University of Edinburgh. My current research centres around data-driven approaches to natural language processing (NLP) for under-served languages.

🏗️ Current work…

  • Following the very successful first shared task of the Open Language Data Initiative, you can find updated versions of FLORES+ and [OLDI Seed] (https://huggingface.co/datasets/openlanguagedata/oldi_seed) hosted on Hugging Face. Huge thanks and appreciation to all contributors and my collaborators!
  • I’ve made an updated version of the OpenLID dataset and model. The new version aims to be more reliable as well as improving language preprocessing. Please see this blog post to find out more about the main changes.
  • I’m now working on bitexting for HPLT: finding the parallel data within our monolingual corpora. I’m also looking at implementing Data Portraits to allow fast inspection of what’s in our datasets.
  • I have officially finished my PhD! I’ll be graduating in November - looking forward to swooshing about in the gown :)

🧑‍💻 Past projects…

Code-switched language identification is harder than you think

We investigate language identification at scale for code-switched text. We find that no current approach is adequate and give recommendations for future work (published at EACL 2024).

OpenLID: An open dataset and model for language identification

A model for fast natural language identification for 200+ languages, plus open access to all the data used for training (published at ACL 2023).

Exploring diversity in back translation for low-resource machine translation

We define and measure diversity in training data for low-resource machine translation, investigating the effect of different kinds of diversity on final performance (published at DeepLo 2022).

Querent intent in multi-sentence questions

Multi-sentence questions (MSQs) are sequences of questions which need to be answered as a unit. We identify five types of MSQs and create a new labelled dataset (published at LAW 2020).

🎓 See Semantic Scholar for a full list of my publications!

🤝 I’m part of…

HPLT: High Performance Language Technologies

We are building large monolingual and multilingual datasets in 70+ languages and using them to train powerful and efficient language and translation models. I’m working on dataset quality, ensuring the data we provide is as trustworthy and useful as possible.

OLDI: Open Language Data Initiative

We aim to improve NLP technologies by championing foundational multilingual datasets. With a focus on under-served languages, we encourage language communities to expand on and improve these datasets, which are then openly available to the research community.

ML Commons

I’m contributing to a project on better language identification for the web through the Data-centric ML Working Group.

CDT in NLP: Centre for Doctoral Training in Natural Language Processing

I did my PhD as part of the first cohort of the CDT for NLP and I love being part of such a great community of researchers! If you’d like to do your PhD at the University of Edinburgh, I recommend looking at the new CDT in Responsible and Trustworthy in-the-world NLP.