The working title of my doctoral thesis is “Data Augmentation and Filtering for Low-resource Neural Machine Translation.”

Like any deep learning system, the models used for neural machine translation (NMT) generally require a large amount of data to attain high performance. However, the vast majority of the world’s languages only have a limited amount of text available for training translation systems, and what data is available is often noisy and/or domain-specific. Such languages are referred to as “low-resource languages”, in contrast with higher-resource languages such as Mandarin Chinese, French, and German. Creating effective low-resource NMT systems is an active area of research in the field of natural language processing.

My research looks at ways to make the most of the data we do have for low resource languages. I focus on two approaches: generating artificial training data (augmentation), and selecting the highest-quality examples from the available training data (filtering). In this way, I aim to improve the downstream performance of low-resource NMT systems and make high-quality translation available for more of the world’s languages.

Current project

In my current project, I aim to study the relationship between quantitative measures of diversity in the training corpus and final NMT system performance. Previous work has shown that more ‘diverse’ training data produces stronger NMT systems. However, there is no rigorous definition of what is meant by ‘diversity’ in this context, and little research on what kinds of diversity are most important for NMT.

To investigate this phenomenon, I create several pseudo-parallel corpora using back-translation, a common data augmentation technique. By varying the generation method, I alter the amount of diversity in each corpus, and I measure this diversity with a range of metrics. Finally, I measure the relationship between diversity and final NMT system performance. By exploring the interaction between different aspects of diversity and final system performance, I aim to encourage a more evidence-based approach to the future development of data augmentation techniques.