home edit page issue tracker

This page pertains to UD version 2.

UD for English

UD English contains data from multiple treebanks created by different teams at different times and with often different conversion tools (from gold constituent treebanks, such as the English Web Treebank for English-EWT, or from different gold dependency treeebanks, such as English-GUM). As a result, differences may sometimes be found across treebanks, though an effort is made to harmonize them when issues are identified.

Tokenization and Word Segmentation

Words are generally delimited by whitespace or punctuation. No tokens in any of the UD English corpora currently contain whitespace. Multiword tokens should be used for English clitics, such as ‘ll (reduced form of the auxiliary will), _n’t (reduced form of not) and ‘s (possessive clitic). For example, don’t = do + n’t). As of mid 2021, multiword tokens are used in the following English corpora: GUM, GUMReddit, and EWT, are partially used in ParTUT (used for forms like ain’t and can’t but not for forms that are concatenative like John’s or she’ll), but are not used in: PUD, LinES, Pronouns, or ESL. If multiword tokens are not present, clitics in English can usually be identified by using the SpaceAfter=No annotation and it also allows distinguishing between otherwise identical token sequences, such as “can not” versus “cannot”.

Units that should be regarded as separate syntactic words include:

Units that are not tokenized apart include:

Morphology

Tags

All corpora use the full range of UPOS tags. The XPOS column uses the Penn Treebank tagset (as extended in subsequent LDC corpus releases). Note that XPOS does not have a simple mapping to UPOS tags, as UD guidelines enforce complex relations between dependency relations and POS tags: for example, since the relation advmod must generally have the tag ADV, UPOS may have ADV for some non-adverbial XPOS tags, and vice versa.

Closed class auxiliaries (tagged AUX) include:

Features

All treebanks currently contain whitespace information, except for English-ESL. Morphological features are included in all corpora except English-ESL. In some corpora these are added automatically using CoreNLP (EWT, GUM) and in some cases supplemented using information from other annotation layers (e.g. GUM).

Syntax

Standard deprels are used, except for clf which is not used in any treebank. Commonly used custom subtypes include obl:npmod for oblique nominals (corresponds to Stanford Dependencies npadvmod), nmod:tmod and obl:tmod for temporal nouns used adverbially (e.g. “today”), based on the Stanford Dependencies label tmod. Additionally, passives are distinguished (nsubj:pass, csubj:pass), pre-nominal possessives (nmod:poss), predeterminers (det:predet for “both” in “both the children), preconj (cc:preconj for “either” in “either X or Y”) and a special compound subtype for phrasal verb particles (compound:prt for “up” in “pick up”).

For more information, see the list of English relations.

Treebanks

There are six English UD treebanks:

Comparative statistics for tags in the treebanks are available here:

https://universaldependencies.org/treebanks/en-comparison.html