UD English CHILDES
Language: English (code: en
)
Family: IE
This treebank has been part of Universal Dependencies since the UD v2.16 release.
The following people have contributed to making this treebank part of UD: Xiulin Yang, Zhuoxuan Ju, Lanni Bu, Zoey Liu, Nathan Schneider.
Repository: UD_English-CHILDES
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.16
License: CC BY-SA 4.0
Genre: spoken
Questions, comments? General annotation questions (either English-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [xy236 (æt) georgetown • edu]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
Annotation | Source |
---|---|
Lemmas | assigned by a program, with some manual corrections, but not a full manual verification |
UPOS | assigned by a program, with some manual corrections, but not a full manual verification |
XPOS | assigned by a program, not checked manually |
Features | not available |
Relations | annotated manually, natively in UD style |
Description
This repository contains Universal Dependencies (UD) trees for utterances from child–adult spoken interactions in English, drawn from CHILDES transcripts.
This treebank is built based on three existing treebanks (details under References). We compile, harmonize, and manually correct major UD-style annotations of CHILDES data into a consistent, unified UD format, resulting in a gold-standard treebank of 48K sentences and 236K tokens.
Overall Statistics
Child | Corpus | Child Age Range | Gold Sents | Gold Toks |
---|---|---|---|---|
Laura | Braunwald | 1;3–7;0 (1;3–7;0) | 4,622 | 21,079 |
Adam | Brown | 1;6–5;2 (1;6–5;2) | 16,736 | 84,643 |
Eve | Brown | 1;6–5;1 (1;6–5;2) | 2,207 | 8,497 |
Abe | Kuczaj | 2;4–5;0 (2;4–5;0) | 4,167 | 22,437 |
Sarah | Brown | 1;6–5;2 (1;6–5;2) | 5,347 | 23,233 |
Lily | Providence | 0;11–4;0 (0;11–4;0) | 1,499 | 6,337 |
Naima | Providence | 1;3–3;11 (0;11–4;0) | 2,534 | 14,360 |
Violet | Providence | 0;11–4;0 (0;11–4;0) | 721 | 1,857 |
Thomas | Thomas | 2;0–4;11 (2;0–4;11) | 4,240 | 20,333 |
Emma | Weist | 2;2–4;10 (2;1–5;0) | 2,423 | 13,730 |
Roman | Weist | 2;2–4;9 (2;1–5;0) | 3,653 | 20,557 |
Overall | NA | NA | 48,183 | 236,941 |
Train, dev, test split statistics
split | Children | Corpus | Gold Sents |
---|---|---|---|
Train | Adam, Lily, Naima, Sarah, Roman, Laura, Abe | Brown, Providence, Weist, Kuczaj, Braunwald | 34,732 |
Dev | Adam, Lily, Naima, Sarah, Roman, Laura, Abe | Brown, Providence, Weist, Kuczaj, Braunwald | 3,860 |
Test | Eve, Violet, Emma, Thomas | Brown, Providence, Weist, Thomas | 9,591 |
Example
```
Acknowledgments
We acknowledge Ida Szubert, Omri Abend, Samuel Gibbon, Louis Mahon, Sharon Goldwater, Mark Steedman, and Emily Prud’hommeaux for their contributions to the original UD treebanking efforts. We also thank Brian MacWhinney for helpful discussions.
Statistics of UD English CHILDES
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X
Features
Relations
acl – acl:relcl – advcl – advmod – amod – appos – aux – aux:pass – case – cc – cc:preconj – ccomp – compound – compound:prt – conj – cop – csubj – dep – det – det:predet – discourse – dislocated – expl – fixed – flat – goeswith – iobj – mark – nmod – nmod:poss – nmod:tmod – nsubj – nsubj:outer – nsubj:pass – nummod – obj – obl – obl:npmod – obl:tmod – obl:unmarked – orphan – parataxis – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 36737 sentences, 213547 tokens and 226470 syntactic words.
- This corpus contains 36742 tokens (17%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 13 types of words that contain both letters and punctuation. Examples: 's, n't, 're, 'm, 'll, 've, 'd, o'clock, ones', Peters', Sophies', cats', it's
- This corpus contains 12905 multi-word tokens. On average, one multi-word token consists of 2.00 syntactic words.
- There are 425 types of multi-word tokens. Examples: don't, it's, that's, I'm, wanna, gonna, you're, what's, hafta, he's, didn't, can't, let's, I'll, there's, where's, they're, doesn't, we're, here's, she's, won't, isn't, we'll, who's, you've, aren't, haven't, needta, wasn't, you'll, I've, daddy's, what're, mommy's, dat's, one's, we've, gotta, sposta, baby's, hasta, Mummy's, I'd, useta, wouldn't, couldn't, it'll, you'd, where'd.
Morphology
Tags
- This corpus uses 17 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
- This corpus contains 20 word types tagged as particles (PART): 's, Jwww, a, cha, da, dat, dey, dis, dra, haf, n, n't, na, not, s, sa, t, ta, to, wa
- This corpus contains 90 lemmas tagged as pronouns (PRON): a, all, anybody, anyone, anything, be, bumpa, cha, de, dem, dere, dese, dey, dis, dwww, everybody, everyone, everything, ge, ha, haf, he, hee, her, hers, herself, him, himself, his, hiss, i, it, its, itself, jwww, le, me, mine, my, myself, nobody, noone, nothing, one, our, ours, ourselves, rwww, sh, she, some, somebody, someone, something, swww, th, that, the, theirs, them, themselves, there, these, they, this, those, to, twww, u, uhhuh, uhoh, uhuh, uma, us, w, we, whaddya, what, whatever, which, who, whoever, whom, whose, y, ya, you, your, yours, yourself
- This corpus contains 42 lemmas tagged as determiners (DET): a, all, anoder, another, any, awoh, both, da, de, dere, dese, det, dey, dis, dose, each, every, h, half, le, ne, no, of, pe, quite, s, some, stoy, such, that, the, there, these, this, those, uhhuh, uhoh, um, what, which, whichever, yer
- Out of the above, 19 lemmas occurred sometimes as PRON and sometimes as DET: a, all, de, dere, dese, dey, dis, le, some, that, the, there, these, this, those, uhhuh, uhoh, what, which
- This corpus contains 15 lemmas tagged as auxiliaries (AUX): be, can, could, do, get, have, may, might, must, need, ought, shall, should, will, would
- Out of the above, 6 lemmas occurred sometimes as AUX and sometimes as VERB: be, do, get, have, need, will
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
Other Features
- ExtPos
- ADP
- ADP: up
- ADV
- ADJ: Less, upside
- ADP: of, out, because, at, in
- ADV: how, as, instead, upside, kind, so, at, more, of, rather
- NOUN: kind, upside, sort
- PRON: what
- ADP
- Typo
- Yes
- ADP: up
- ADV: inside, may
- INTJ: Uh, Mm, Ah, Wee, Whoo
- NOUN: Beep, Bok, Night, P, Woof, u
- PRON: Who, it
- PROPN: pa, Arf, R, hm
- VERB: let
- Yes
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: be.
- This corpus uses 14 lemmas as auxiliaries (aux). Examples: do, be, can, will, have, would, could, should, might, must, may, shall, get, need.
- This corpus uses 4 lemmas as passive auxiliaries (aux:pass). Examples: be, get, have, will.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (1639)
- VERB--NOUN-ADP('s) (3)
- VERB--PRON (16574)
- obj
- VERB--NOUN (6461)
- VERB--NOUN-ADP(on) (1)
- VERB--NOUN-ADP(up) (1)
- VERB--PRON (6674)
- VERB--PRON-ADP(about) (1)
- VERB--PRON-ADP(for) (1)
- iobj
- VERB--NOUN (25)
- VERB--PRON (610)
Relations Overview
- This corpus uses 12 relation subtypes: acl:relcl, aux:pass, cc:preconj, compound:prt, det:predet, nmod:poss, nmod:tmod, nsubj:outer, nsubj:pass, obl:npmod, obl:tmod, obl:unmarked
- The following 2 relation types are not used in this corpus at all: clf, list