UD English Pronouns
Language: English (code: en
)
Family: IE
This treebank has been part of Universal Dependencies since the UD v2.5 release.
The following people have contributed to making this treebank part of UD: Robert Munro.
Repository: UD_English-Pronouns
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15
License: CC BY-SA 4.0
Genre: grammar-examples
Questions, comments? General annotation questions (either English-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [rmunro (æt) alumni • stanford • edu]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.
Annotation | Source |
---|---|
Lemmas | annotated manually |
UPOS | annotated manually, natively in UD style |
XPOS | annotated manually |
Features | annotated manually, natively in UD style |
Relations | annotated manually, natively in UD style |
Description
UD English-Pronouns is dataset created to make pronoun identification more accurate and with a more balanced distribution across genders. The dataset is initially targeting the Independent Genitive pronouns, “hers”, (independent) “his”, (singular) “theirs”, “mine”, and (singular) “yours”.
The Independent Genitive pronoun “hers” is wrongly classified as a noun or adjective by the most widely used parsers (in October of 2019). This includes (alphabetically) Amazon Comprehend, Google Natural Language API, and the Stanford Parser. This error was traced to the data - not the algorithms themselves - and so this dataset was created to fix these errors.
Cause of errors: missing examples and annotations
The main cause of the errors in widely used syntactic parsers is most likely because “hers” is rare in the existing datasets and completely absent from any standard test data. The pronoun “hers” only occurs three times in the entire Universal Dependencies datasets (in of October 2019). Of those three times, it is never marked with “Case=Gen”, “Poss=Yes”, or “PronType=Prs”, which would be the correct list of morphological features (FEATS) for “hers” in any context.
In one case of the three occurrences, “hers” was correctly annotated as “P3SG-GEN-INDEP” in the Language-specific part-of-speech tag (XPOS) field. But this field is largely ignored by general purpose syntactic parsers.
The three examples are in the training data, so the complete absense of “hers” in the development and test data might have let this error slip under the radar.
In general, Masculine pronouns are 4x more frequent than Feminine pronouns in the UD English datasets that have been compiled to date (October 2019). So this also contributes to the why it might have been missed.
So, the errors are a combination of the inherent imbalance in the datasets, and by extension the sources they are drawn from, and gaps annotation to-date. There are also linguistic reasons for the gap, as outlined below.
Inherent gender bias
Feminine Independent Genitive pronoun
The Feminine Dependent and Independent Genitive pronouns differ from the Masculine Genitive pronouns by having two forms, “her” and “hers”, instead of using the same for both, “his”. For example: “her car”, “car of hers” / “his car”, “car of his”.
It is almost certain that the most popular syntactic parsers correctly identify the Masculine Independent Genitive pronoun correctly because “his” is the same form as the Dependent Genitive.
So, while the errors result from arbitrary linguistic distinctions that are not any person’s fault, they have resulted in a situation that patterns with gender bias. For example, if you are building an information extraction system that relies on pronouns to know who possesses what, it will be more accurate for information about people referred to by Masculine pronouns than by Feminine ones.
Singular Neutral Genitive pronoun
Every instance of “they/them/their/theirs” in the existing datasets are annotated as plurals, so this also presents a potential gender bias in the data. Many individuals prefer “they/them/their/theirs” as their singular personal pronouns. So, this dataset is also targeting examples of singular “theirs”.
Other Genitive pronouns
For comprehensiveness across the most widely used Independent Genitive Singular pronouns, “mine”, and (singular) “yours” are also included. There are a very large number of additional variations of singular pronouns used in variants of English (like “ze”). Their existence is acknowledged here and the dataset can be extended to these with find-and-replace for the “they/them/their/theirs” variants in the dataset.
Grammatical Diversity
The Independent Genitives can occur in more syntactic contexts than any other pronoun: more than all the other pronouns combined. So, the new dataset is adding a lot more grammatical diversity to the overall Universal Dependencies dataset, too!
How the dataset was created and is structured
The dataset was created manually, targeting grammatical diversity. For example, there are sentences with “hers” appearing as the subject, object, indirect object, and oblique arguments; sentences with “hers” in a conjunction; sentences with “hers” in a complement clause; etc.
The Majority sentences are completely unique in terms of their dependency tree and constituents. For the sentences that share the same dependency tree and constituent structure, they alternate an important linguistic feature in English, like regular and irregular verbs, or linguistic features that would have different syntax and/or morphology in other languages, like the locative/ablative case distinction.
The “comment” field describes exactly what grammatical structure(s) are captured in each sentence. The Independent Genitive is the most flexible pronoun, able to appear in a sentence almost anywhere any noun phrase can appear.
A “previous” field is also added to include a viable previous sentence. Independent pronouns refer to non-syntactically local entities and those entities are typically made salient through context. It is too unnatural for the sentences to have an explicit entity that it refers to within the same sentence. So, the previous sentence should help. Here are two examples of these comments:
```
Acknowledgments
The dataset was created by Robert Munro while writing Human-in-the-Loop Machine Learning (Manning Publications)
Please cite this book if using this dataset.
Statistics of UD English Pronouns
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – NOUN – NUM – PART – PRON – PUNCT – SCONJ – VERB
Features
Definite – Degree – Gender – Mood – Number – NumType – Person – Polarity – Poss – PronType – Tense – VerbForm
Relations
advcl – advmod – amod – appos – aux – aux:pass – case – cc – ccomp – conj – cop – det – det:predet – expl – iobj – mark – nmod – nsubj – obj – obl – orphan – parataxis – punct – root – xcomp
Tokenization and Word Segmentation
- This corpus contains 285 sentences, 1640 tokens and 1705 syntactic words.
- This corpus contains 330 tokens (20%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 3 types of words that contain both letters and punctuation. Examples: 's, n't, 'll
- This corpus contains 65 multi-word tokens. On average, one multi-word token consists of 2.00 syntactic words.
- There are 9 types of multi-word tokens. Examples: It's, car's, dealer's, isn't, Hers'll, His'll, Mine'll, Theirs'll, Yours'll.
Morphology
Tags
- This corpus uses 13 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, NOUN, NUM, PART, PRON, PUNCT, SCONJ, VERB
- This corpus does not use the following tags: PROPN, INTJ, SYM, X
- This corpus contains 5 word types tagged as particles (PART): 's, n't, not, nt, to
- This corpus contains 7 lemmas tagged as pronouns (PRON): her, his, it, my, their, there, your
- This corpus contains 3 lemmas tagged as determiners (DET): all, any, the
- This corpus contains 2 lemmas tagged as auxiliaries (AUX): be, will
- Out of the above, 1 lemmas occurred sometimes as AUX and sometimes as VERB: be
- There are 4 (de)verbal forms:
- Fin
- AUX: 's, is, was, 'll, ai
- VERB: cleaned, like, drove, sold, gave, is, knew, liked, Take, accelerated
- Ger
- VERB: using
- Inf
- VERB: clean
- Part
- VERB: cleaned, seeing, cleaning, painted
Nominal Features
- Fem
- PRON: hers
- Masc
- PRON: his
- Neut
- PRON: mine, theirs, yours
- Plur
- NOUN: dealers, cars, bumps
- VERB-Fin: like, sell, sold
- Sing
- AUX-Fin: 's, is, was, ai
- NOUN: dealer, car, paint
- PRON: hers, his, mine, theirs, yours, it
- VERB-Fin: cleaned, drove, is, sold, accelerated, broke, get, hit, parks, saw
- Def
- DET: the
Degree and Polarity
- Pos
- ADJ: easy
- Neg
- PART: n't, not, nt
Verbal Features
- Imp
- VERB-Fin: Take
- Ind
- AUX-Fin: 's, is, was, ai
- VERB-Fin: cleaned, like, drove, sold, gave, is, knew, liked, accelerated, broke
- Sub
- VERB-Fin: get
- Past
- AUX-Fin: was
- VERB-Fin: cleaned, drove, sold, gave, knew, liked, accelerated, broke, came, got
- VERB-Part: cleaned, painted
- Pres
- AUX-Fin: 's, is, ai
- VERB-Fin: like, is, Take, do, get, parks, sell, sells
- VERB-Part: seeing, cleaning
Pronouns, Determiners, Quantifiers
- Art
- DET: the
- Ind
- DET: Any
- Prs
- PRON: hers, his, mine, theirs, yours, it
- Tot
- DET: all
- Card
- NUM: One
- Yes
- PRON: hers, his, mine, theirs, yours
- 1
- AUX-Fin: was, ai
- PRON: mine, it
- VERB-Fin: cleaned, like, drove, sold, accelerated, broke, get, hit, saw, sell
- 2
- AUX-Fin: ai
- PRON: yours, it
- VERB-Fin: cleaned, like, drove, sold, accelerated, broke, get, hit, saw, sell
- 3
- AUX-Fin: 's, is, was, ai
- PRON: hers, his, theirs, it
- VERB-Fin: cleaned, like, sold, drove, is, parks, sells, accelerated, broke, get
Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: be.
- This corpus uses 1 lemmas as auxiliaries (aux). Examples: will.
- This corpus uses 1 lemmas as passive auxiliaries (aux:pass). Examples: be.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB-Fin--NOUN (140)
- VERB-Fin--PRON (70)
- VERB-Part--NOUN (10)
- VERB-Part--PRON (5)
- obj
- VERB-Fin--NOUN (35)
- VERB-Fin--PRON (5)
- VERB-Part--NOUN (10)
- VERB-Part--PRON (10)
- iobj
- VERB-Fin--PRON (5)