UD for French 
In version 2.17, the French language is composed of nine treebanks but two of them don’t contain modern French:
- UD_French-ALTS contains data from the 16th century which is quite different from modern French.
- UD_French-PoitevinDIVITAL contains data in Poitevin-Saintongeais (Glottocode: poit1241)
The description below is relative to the seven modern French corpora.
Tokenization and Word Segmentation
- In general, words are delimited by whitespace characters and punctuations are considered as separated words.
- Only numbers can contain spaces (following the regexp
[0-9 ,]+). - There are several closed classes of contractions that are treated as multi-word tokens and segmented to individual syntactic words. For instance, au -> à + le, auquel -> de + lequel. Note that du and des are ambiguous and can be split or not depending of their usage.
For more details, see tokenization.
Morphology
Tags
This is an overview only. For more detailed discussion and examples, see the list of French POS tags and French features.
French uses all 17 universal POS categories:
- (PART) is used only in UD_French-ParTUT for the negation particle ne (which is annotated
ADVin other treebanks) Grew-match - All French corpora only have five auxiliary verbs (AUX) (Grew-match)
- être (to be) is used as copula, as tense auxiliary and as passive auxiliary
- avoir (to have) is used as tense auxiliary
- faire (to make) and refaire (to make again) are used in causative constructions
- voir (to see) in the specific construction se voir
Nominal Features
- Nominal words (NOUN, PROPN and PRON) have an inherent Gender feature with one of two values:
MascorFem. - The two values of the Number feature are
SingandPlur. The following parts of speech inflect for number: NOUN, PROPN, PRON, ADJ, DET, VERB, AUX (finite and participles). - Case has 4 possible values:
Nom,Dat,Acc,Com. It occurs only with personal pronouns (PRON). The “case” (i.e., role w.r.t. predicates or other phrases) of other nominals is expressed using prepositions, not morphologically. - Definite has 2 values:
Ind,Def. It is used to distinguish the indefinite and definite articles (DET).
Polarity
- Polarity is used only with the
Negvalue to mark the negative adverbs ne, pas, plus, jamais.
Verbal Features
- Infinitives only have the
VerbForm=Inffeature. - Finite verbs always have one of the four values of Mood:
Ind,Imp,SubandCnd. - Finite verbs can have one of four values of Tense:
Past,Imp,Pres,Fut. - Past participles have
VerbForm=Part,Tense=Past, Gender (MascorFem), and Number (SingorPlur).
Pronouns, Determiners, Quantifiers
- PronType is used with pronouns (PRON) and determiners (DET).
- NumType is used with numerals (NUM) and adjectives (ADJ).
- The
Poss=Yesfeature marks possessive personal determiners (e.g. mon “my”), - The
Reflex=Yesfeature is always used on PRON together withPronType=Prsand it marks reflexive pronouns (me, te, se, nous, vous). - Person is a lexical feature of personal pronouns (PRON) and has three values,
1,2and3. Person is not marked on other types of pronouns and on nouns, although they can be almost always interpreted as the 3rd person. - layered features Number[psor] and Person[psor] are used on possessive personal determiners to indicate possessor related features.
Note that since version 2.17, the four treebanks built from SUD (GSD, Sequoia, ParisStories and Rhapsodie) use a more detailed feature system:
Number[cxtx]andGender[cxtx]when the corresponding feature is not morphologically marked but can be inferred from the context.Number[lex]andGender[lex]when the corresponding feature is lexical and not morphological , such as the Gender of nounsTense[denom], which is used for denominative features (associated withVerbForm=Part). See the following paper for more details:Sylvain Kahane, Bruno Guillaume, Léna Brun, and Simeng Song. 2025. Status of morphosyntactic features Illustration with written and spoken French UD treebanks. In Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025), pages 154–159, Ljubljana, Slovenia. Association for Computational Linguistics.
See French features) for links.
Syntax
This is an overview only. For more detailed discussion and examples, see the list of French relations.
Core Arguments, Oblique Arguments and Adjuncts
- The dominant word order in French is SVO, but other word orders are also possible.
- Nominal subject (nsubj) is generally a bare noun phrase without preposition. If it is a personal pronoun, it must be in the nominative form. The morphology of a finite verb (or auxiliary) cross-references the person and number of its subject.
- Direct nominal object (obj) is a bare noun phrase or a pronoun with accusative case.
The accusative pronoun is a clitic and its position in the word order is fixed, the pronoun is before the verb (exept when
Mood=Imp). - The relation iobj is used for indirect object when they are pronoun and they come with no preposition.
- The relation obl for oblique argument is given with subtype
argormodin most treebanks (see table below).
Relations Overview
The following relation subtypes are used in French:
| Corpus | FQB | GSD | ParisStories | ParTUT | PUD | Rhapsodie | Sequoia |
|---|---|---|---|---|---|---|---|
| acl:relcl | 77 | 3240 | 310 | 301 | 227 | 507 | 520 |
| advcl:cleft | 17 | 212 | 40 | 78 | 20 | ||
| aux:caus | 3 | 250 | 16 | 13 | 9 | 27 | 34 |
| aux:pass | 247 | 3401 | 105 | 241 | 226 | 134 | 759 |
| aux:tense | 503 | 3837 | 1012 | 568 | 492 | 948 | |
| csubj:pass | 26 | 1 | 2 | 1 | 4 | ||
| dep:comp | 15 | 27 | 40 | 5 | |||
| expl:comp | 176 | 211 | 298 | 28 | 293 | 44 | |
| expl:pass | 687 | 23 | 33 | 57 | |||
| expl:pv | 1017 | 49 | 2 | 242 | |||
| expl:subj | 333 | 931 | 314 | 83 | 425 | 237 | |
| flat:foreign | 131 | 1075 | 3 | 113 | 6 | 136 | |
| flat:name | 581 | 7005 | 31 | 61 | 252 | 161 | 807 |
| iobj:agent | 24 | 1 | 1 | 1 | |||
| nmod:appos | 4 | 121 | |||||
| nsubj:caus | 1 | 132 | 4 | 4 | 4 | 14 | 16 |
| nsubj:outer | 23 | 23 | 14 | 3 | |||
| nsubj:pass | 240 | 3666 | 41 | 224 | 200 | 123 | 620 |
| obj:agent | 111 | 3 | 9 | 4 | 12 | ||
| obj:lvc | 554 | 84 | 68 | 2 | |||
| obl:agent | 30 | 1554 | 2 | 69 | 1 | 3 | 281 |
| obl:arg | 570 | 8670 | 508 | 80 | 812 | 1608 | |
| obl:mod | 611 | 15927 | 1057 | 81 | 1118 | 2392 | |
| parataxis:insert | 183 | 15 | 126 | ||||
| parataxis:parenth | 27 | 39 |
Treebanks
There are nine French UD treebanks:
- UD_French-ALTS
- UD_French-FQB
- UD_French-GSD
- UD_French-ParisStories
- UD_French-ParTUT
- UD_French-PoitevinDIVITAL
- UD_French-PUD
- UD_French-Rhapsodie
- UD_French-Sequoia
Note that the UD_French-FTB was now retired because it was not updated to follow the latest validation contraints.