home fr/fr edit page issue tracker

This page pertains to UD version 2.

Introduction

In version 2.1, there are 6 different corpora for French:

French

Github repository

The French data come from the universal Google dataset (version 2.0): a mix of random sentences sampled from Google News, from Blogger, from Wikipedia and from Google local reviews. The conversion to the UD POS and UD dependencies have been performed automatically, using heuristic rules and fixed lists of words (produced by native speakers of the language).

More information about the original Google dataset can be found in the following paper:

Universal Dependency Annotation for Multilingual Parsing Ryan McDonald, Joakim Nivre, Yvonne Quirmbach-Brundage, Yoav Goldberg, Dipanjan Das, Kuzman Ganchev, Keith Hall, Slav Petrov, Hao Zhang, Oscar Tackstrom, Claudia Bedini, Nuria Bertomeu Castello and Jungmee Lee Proceedings of ACL 2013

Since version 1.0, many improvements and corrections were made to the automatically converted data.

Subtypes of relations used in the version 2.1

acl:relcl
aux:caus
aux:pass
csubj:pass
flat:foreign
flat:name
nmod:poss
nsubj:pass

French-FTB

Github repository

French-PUD

Github repository

French-ParTUT

Github repository

UD_French-ParTUT data is derived from the already-existing parallel treebank Par(allel)TUT.

ParTUT is a morpho-syntactically annotated collection of Italian/French/English parallel sentences, which includes texts from different sources and representing different genres and domains, released in several formats.

ParTUT comprises approximately 167,000 tokens, with an average amount of 2,100 sentences per language. The texts of the collection currently available were gathered from a large number of sources and domains:

ParTUT data can be downloaded here and here (CoNLL format only).

Corpus splitting

The corpus was randomly splitted using a script. In order to meet the CoNLL 2017 Shared Task requirements, and considering the limited amount of data available for the French section, we splitted the treebank so as to obtain at least 10K words for both test and development sets, leaving the remaining part for training. The corpus is thus partitioned as follows:

Moreover, in order to preserve the 1:1 correspondence among the three language sections, all of them were partitioned in the same way; therefore the same sentences, in the same order, are found in the training, development and test set of the English and Italian treebanks as well.

Basic statistics

References

Acknowledments

We are deeply grateful to Project Syndicate© for letting us download and exploit their articles as text material under the terms of educational use.

French-Sequoia

Github repository

The UD_French-Sequoia corpus is an automatic conversion of the surface representation of the French Sequoia corpus. The conversion was done with the Grew software.

Basic statistics and splitting

The whole corpus contains 70,624 tokens in 3,099 sentences.

In UD_French-Sequoia, data were randomly split into:

References

Marie Candito, Guy Perrier, Bruno Guillaume, Corentin Ribeyre, Karën Fort, Djamé Seddah and Éric de la Clergerie. (2014) Deep Syntax Annotation of the Sequoia French Treebank. Proc. of LREC 2014, Reykjavic, Iceland.

Guy Perrier, Marie Candito, Bruno Guillaume, Corentin Ribeyre, Karën Fort and Djamé Seddah. (2014) Un schéma d’annotation en dépendances syntaxiques profondes pour le français. Proc. of TALN 2014, Marseille, France.

French-Spoken

Github repository