I am a Principal Research Engineer at the Common Crawl Foundation. I work on open multilingual data at scale, especially for underserved and under-represented languages.
News
-
16 May 2026 — Spoke at the Low Resource, High Impact tutorial with colleagues from Toloka and TUM at LREC 2026 in Palma, Mallorca.
-
23 April 2026 — Gave a talk about Common Crawl and our research into under-served languages at Howest University of Applied Sciences in Kortrijk, Belgium.
-
21 April 2026 — Presented recent work on “Improved language identification for web crawl data” at the IIPC Web Archiving Conference in Brussels.
-
6 April 2026 — Our paper CommonLID: Re-evaluating State-of-the-Art Language Identification Performance on Web Data was accepted to main conference at ACL 2026.
-
25 February 2026 — Community Spotlight Talk with co-authors Pedro Ortiz Suarez and Catherine Arnett for Eleuther AI, covering our recent work on CommonLID.