home edit page issue tracker

This page pertains to UD version 2.

UD English Pronouns

Language: English (code: en)
Family: Indo-European, Germanic

This treebank has been part of Universal Dependencies since the UD v2.5 release.

The following people have contributed to making this treebank part of UD: Robert Munro.

Repository: UD_English-Pronouns
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.5

License: CC BY-SA 4.0

Genre: grammar-examples

Questions, comments? General annotation questions (either English-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [rmunro (æt) alumni • stanford • edu]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually
UPOS annotated manually, natively in UD style
XPOS annotated manually
Features annotated manually, natively in UD style
Relations annotated manually, natively in UD style

Description

UD English-Pronouns is dataset created to make pronoun identification more accurate and with a more balanced distribution across genders. The dataset is initially targeting the Independent Genitive pronouns, “hers”, (independent) “his”, (singular) “theirs”, “mine”, and (singular) “yours”.

The Independent Genitive pronoun “hers” is wrongly classified as a noun or adjective by the most widely used parsers (in October of 2019). This includes (alphabetically) Amazon Comprehend, Google Natural Language API, and the Stanford Parser. This error was traced to the data - not the algorithms themselves - and so this dataset was created to fix these errors.

Cause of errors: missing examples and annotations

The main cause of the errors in widely used syntactic parsers is most likely because “hers” is rare in the existing datasets and completely absent from any standard test data. The pronoun “hers” only occurs three times in the entire Universal Dependencies datasets (in of October 2019). Of those three times, it is never marked with “Case=Gen”, “Poss=Yes”, or “PronType=Prs”, which would be the correct list of morphological features (FEATS) for “hers” in any context.

In one case of the three occurrences, “hers” was correctly annotated as “P3SG-GEN-INDEP” in the Language-specific part-of-speech tag (XPOS) field. But this field is largely ignored by general purpose syntactic parsers.

The three examples are in the training data, so the complete absense of “hers” in the development and test data might have let this error slip under the radar.

In general, Masculine pronouns are 4x more frequent than Feminine pronouns in the UD English datasets that have been compiled to date (October 2019). So this also contributes to the why it might have been missed.

So, the errors are a combination of the inherent imbalance in the datasets, and by extension the sources they are drawn from, and gaps annotation to-date. There are also linguistic reasons for the gap, as outlined below.

Inherent gender bias

Feminine Independent Genitive pronoun

The Feminine Dependent and Independent Genitive pronouns differ from the Masculine Genitive pronouns by having two forms, “her” and “hers”, instead of using the same for both, “his”. For example: “her car”, “car of hers” / “his car”, “car of his”.

It is almost certain that the most popular syntactic parsers correctly identify the Masculine Independent Genitive pronoun correctly because “his” is the same form as the Dependent Genitive.

So, while the errors result from arbitrary linguistic distinctions that are not any person’s fault, they have resulted in a situation that patterns with gender bias. For example, if you are building an information extraction system that relies on pronouns to know who possesses what, it will be more accurate for information about people referred to by Masculine pronouns than by Feminine ones.

Singular Neutral Genitive pronoun

Every instance of “they/them/their/theirs” in the existing datasets are annotated as plurals, so this also presents a potential gender bias in the data. Many individuals prefer “they/them/their/theirs” as their singular personal pronouns. So, this dataset is also targeting examples of singular “theirs”.

Other Genitive pronouns

For comprehensiveness across the most widely used Independent Genitive Singular pronouns, “mine”, and (singular) “yours” are also included. There are a very large number of additional variations of singular pronouns used in variants of English (like “ze”). Their existence is acknowledged here and the dataset can be extended to these with find-and-replace for the “they/them/their/theirs” variants in the dataset.

Grammatical Diversity

The Independent Genitives can occur in more syntactic contexts than any other pronoun: more than all the other pronouns combined. So, the new dataset is adding a lot more grammatical diversity to the overall Universal Dependencies dataset, too!

How the dataset was created and is structured

The dataset was created manually, targeting grammatical diversity. For example, there are sentences with “hers” appearing as the subject, object, indirect object, and oblique arguments; sentences with “hers” in a conjunction; sentences with “hers” in a complement clause; etc.

The Majority sentences are completely unique in terms of their dependency tree and constituents. For the sentences that share the same dependency tree and constituent structure, they alternate an important linguistic feature in English, like regular and irregular verbs, or linguistic features that would have different syntax and/or morphology in other languages, like the locative/ablative case distinction.

The “comment” field describes exactly what grammatical structure(s) are captured in each sentence. The Independent Genitive is the most flexible pronoun, able to appear in a sentence almost anywhere any noun phrase can appear.

A “previous” field is also added to include a viable previous sentence. Independent pronouns refer to non-syntactically local entities and those entities are typically made salient through context. It is too unnatural for the sentences to have an explicit entity that it refers to within the same sentence. So, the previous sentence should help. Here are two examples of these comments:

```

Acknowledgments

The dataset was created by Robert Munro while writing Human-in-the-Loop Machine Learning (Manning Publications)

Please cite this book if using this dataset.

Statistics of UD English Pronouns

POS Tags

ADJADPADVAUXCCONJDETNOUNNUMPARTPRONPUNCTSCONJVERB

Features

CaseDefiniteDegreeGenderMoodNumberNumTypePersonPolarityPossPronTypeTenseVerbForm

Relations

advcladvmodamodapposauxaux:passcaseccccompconjcopcsubjdetdet:predetexpliobjmarknmodnsubjobjoblorphanparataxispunctrootxcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Relations Overview