home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

UD for English

UD English contains data from multiple treebanks created by different teams at different times and with often different conversion tools (from gold constituent treebanks, such as the English Web Treebank for English-EWT, or from different gold dependency treeebanks, such as English-GUM). As a result, differences may sometimes be found across treebanks, though an effort is made to harmonize them when issues are identified.

Tokenization and Word Segmentation

Words are generally delimited by whitespace or punctuation. No tokens in any of the UD English corpora currently contain whitespace. Multiword tokens should be used for English clitics, such as ‘ll (reduced form of the auxiliary will), _n’t (reduced form of not) and ‘s (possessive clitic). For example, don’t = do + n’t). As of mid 2021, multiword tokens are used in the following English corpora: GUM, GUMReddit, and EWT, are partially used in ParTUT (used for forms like ain’t and can’t but not for forms that are concatenative like John’s or she’ll), but are not used in: PUD, LinES, Pronouns, or ESL. If multiword tokens are not present, clitics in English can usually be identified by using the SpaceAfter=No annotation and it also allows distinguishing between otherwise identical token sequences, such as “can not” versus “cannot”.

Units that should be regarded as separate syntactic words include:

Clitic auxiliaries (‘ll, ‘m, ‘s, ‘ve, ‘d, …)
Possessive genitive markers (‘s, ‘)
Clitic negation (n’t, and also not in cannot)
Most hyphenated terms (search-engine becomes 3 words: search, -, engine)

Units that are not tokenized apart include:

Acronyms (FBI, U.S.)
Abbreviations without spaces (e.g., i.e.)
Some hyphenated words, with common prefixes or occasionally suffixes, such as e-mail or co-ordinated

Morphology

Features

All treebanks currently contain whitespace information, except for English-ESL. Morphological features are included in all corpora except English-ESL. In some corpora these are added automatically using CoreNLP (EWT, GUM) and in some cases supplemented using information from other annotation layers (e.g. GUM).

The English-specific documentation pages for the tags AUX, DET, and PRON also discuss morphological features.

Syntax

Standard deprels are used, except for clf which is not used in any treebank. Commonly used custom subtypes include obl:npmod for oblique nominals (corresponds to Stanford Dependencies npadvmod), nmod:tmod and obl:tmod for temporal nouns used adverbially (e.g. “today”), based on the Stanford Dependencies label tmod. Additionally, passives are distinguished (nsubj:pass, csubj:pass), pre-nominal possessives (nmod:poss), predeterminers (det:predet for “both” in “both the children), preconj (cc:preconj for “either” in “either X or Y”) and a special compound subtype for phrasal verb particles (compound:prt for “up” in “pick up”).

For more information, see the list of English relations.

Treebanks

There are ten active English UD treebanks:

The following treebank is retired (no longer being maintained or included in releases):

English-ESL

Comparative statistics for tags in the treebanks are available here: