French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus

Image credit: Alix Chagué

Abstract

This paper investigates the impact of different types and size of training corpora on language models. By asking the fundamental question of quality versus quantity, we compare four French corpora by pre-training four different ELMos and evaluating them on dependency parsing, POS-tagging and Named Entities Recognition downstream tasks. We present and asses the relevance of a new balanced French corpus, CaBeRnet, that features a representative range of language usage, including a balanced variety of genres (oral transcriptions, newspapers, popular magazines, technical reports, fiction, academic texts), in oral and written styles. We hypothesize that a linguistically representative corpus will allow the language models to be more efficient, and therefore yield better evaluation scores on different evaluation sets and tasks.

Type
Publication
In 8th Workshop on the Challenges in the Management of Large Corpora
Pedro Ortiz Suarez
Pedro Ortiz Suarez
Senior Research Scientist

I’m a Senior Research Scientist at the Common Crawl Foundation.