home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

UD English CHILDES

Language: English (code: en)
Family: IE

This treebank has been part of Universal Dependencies since the UD v2.16 release.

The following people have contributed to making this treebank part of UD: Xiulin Yang, Zhuoxuan Ju, Lanni Bu, Zoey Liu, Nathan Schneider.

Repository: UD_English-CHILDES
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.18

License: CC BY-SA 4.0

Genre: spoken

Questions, comments? General annotation questions (either English-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [xy236 (æt) georgetown • edu]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.

Annotation	Source
Lemmas	assigned by a program, with some manual corrections, but not a full manual verification
UPOS	assigned by a program, with some manual corrections, but not a full manual verification
XPOS	assigned by a program, not checked manually
Features	not available
Relations	annotated manually, natively in UD style

Description

This repository contains Universal Dependencies (UD) trees for utterances from child–adult spoken interactions in English, drawn from CHILDES transcripts.

This treebank is built based on three existing treebanks (details under References). We compile, harmonize, and manually correct major UD-style annotations of CHILDES data into a consistent, unified UD format, resulting in a gold-standard treebank of 48K sentences and 236K tokens.

Overall Statistics

Child	Corpus	Child Age Range	Gold Sents	Gold Toks
Laura	Braunwald	1;3–7;0 (1;3–7;0)	4,622	21,079
Adam	Brown	1;6–5;2 (1;6–5;2)	16,770	84,643
Eve	Brown	1;6–5;1 (1;6–5;2)	2,207	8,497
Abe	Kuczaj	2;4–5;0 (2;4–5;0)	4,167	22,437
Sarah	Brown	1;6–5;2 (1;6–5;2)	5,347	23,233
Lily	Providence	0;11–4;0 (0;11–4;0)	1,499	6,337
Naima	Providence	1;3–3;11 (0;11–4;0)	2,534	14,360
Violet	Providence	0;11–4;0 (0;11–4;0)	721	1,857
Thomas	Thomas	2;0–4;11 (2;0–4;11)	4,240	20,333
Emma	Weist	2;2–4;10 (2;1–5;0)	2,423	13,730
Roman	Weist	2;2–4;9 (2;1–5;0)	3,653	20,557
Overall	NA	NA	48,183	236,941

Train, dev, test split statistics

split	Children	Corpus	Gold Sents
Train	Adam, Lily, Naima, Sarah, Roman, Laura, Abe	Brown, Providence, Weist, Kuczaj, Braunwald	34,732
Dev	Adam, Lily, Naima, Sarah, Roman, Laura, Abe	Brown, Providence, Weist, Kuczaj, Braunwald	3,860
Test	Eve, Violet, Emma, Thomas	Brown, Providence, Weist, Thomas	9,591

Example

```

Acknowledgments

We acknowledge Ida Szubert, Omri Abend, Samuel Gibbon, Louis Mahon, Sharon Goldwater, Mark Steedman, and Emily Prud’hommeaux for their contributions to the original UD treebanking efforts. We also thank Brian MacWhinney for helpful discussions.

Statistics of UD English CHILDES

POS Tags

ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X

Features

ExtPos – Typo

Relations

acl – acl:relcl – advcl – advmod – amod – appos – aux – aux:pass – case – cc – cc:preconj – ccomp – compound – compound:prt – conj – cop – csubj – csubj:pass – dep – det – det:predet – discourse – dislocated – expl – fixed – flat – goeswith – iobj – list – mark – nmod – nmod:npmod – nmod:poss – nmod:unmarked – nsubj – nsubj:outer – nsubj:pass – nummod – obj – obl – obl:npmod – obl:unmarked – orphan – parataxis – punct – reparandum – root – vocative – xcomp

Tokenization and Word Segmentation

This corpus contains 48183 sentences, 289817 tokens and 302740 syntactic words.

This corpus contains 48188 tokens (17%) that are not followed by a space.

This corpus does not contain words with spaces.

This corpus contains 21 types of words that contain both letters and punctuation. Examples: 's, n't, 're, 't, 'm, 'll, 've, 'd, o'clock, all_gone, dum-dum, my_goodness, night-night, cock-a-doodle-doo, ones', Peters', Sophies', Ups-a-daisy, cats', dark_time, it's

This corpus contains 12905 multi-word tokens. On average, one multi-word token consists of 2.00 syntactic words.
There are 425 types of multi-word tokens. Examples: don't, it's, that's, I'm, wanna, gonna, you're, what's, hafta, he's, didn't, can't, let's, I'll, there's, where's, they're, doesn't, we're, here's, she's, won't, isn't, we'll, who's, you've, aren't, haven't, needta, wasn't, you'll, I've, daddy's, what're, mommy's, dat's, one's, we've, gotta, sposta, baby's, hasta, Mummy's, I'd, useta, wouldn't, couldn't, it'll, you'd, where'd.

Morphology

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

ExtPos
- ADP
  - ADP: out, because, of, up
  - ADV: instead, what
  - SCONJ: Because
- ADV
  - ADJ: more
  - ADP: of, at, As, because, in, out
  - ADV: how, as, each, kind, at, instead, of, upside, 'd, Less
  - NOUN: upside, kind, sort
  - PRON: what
- PRON
  - PRON: what
- SCONJ
  - ADP: in
  - ADV: So
  - SCONJ: so, as

Typo
- Yes
  - ADP: up
  - ADV: inside, may
  - INTJ: Uh, Mm, Ah, Wee, Whoo
  - NOUN: Beep, Bok, Night, P, Woof, u
  - PRON: Who, it
  - PROPN: pa, Arf, R, hm
  - VERB: let

Syntax

Auxiliary Verbs and Copula

This corpus uses 1 lemmas as copulas (cop). Examples: be.

This corpus uses 14 lemmas as auxiliaries (aux). Examples: do, be, can, will, have, would, could, should, might, may, must, shall, get, need.
This corpus uses 6 lemmas as passive auxiliaries (aux:pass). Examples: be, get, have, do, might, will.

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

nsubj
- VERB--NOUN (2084)
- VERB--NOUN-ADP('s) (3)
- VERB--NOUN-ADP(on) (1)
- VERB--PRON (22904)
- VERB--PRON-ADP(for) (1)

obj
- VERB--NOUN (8738)
- VERB--NOUN-ADP(at) (3)
- VERB--NOUN-ADP(follow) (1)
- VERB--NOUN-ADP(in) (3)
- VERB--NOUN-ADP(like) (4)
- VERB--NOUN-ADP(of) (5)
- VERB--NOUN-ADP(on) (5)
- VERB--NOUN-ADP(through) (1)
- VERB--NOUN-ADP(to) (4)
- VERB--NOUN-ADP(up) (1)
- VERB--NOUN-ADP(with) (4)
- VERB--PRON (9380)
- VERB--PRON-ADP(about) (2)
- VERB--PRON-ADP(at) (3)
- VERB--PRON-ADP(for) (4)
- VERB--PRON-ADP(from) (1)
- VERB--PRON-ADP(in) (1)
- VERB--PRON-ADP(like) (6)
- VERB--PRON-ADP(of) (1)
- VERB--PRON-ADP(on) (3)
- VERB--PRON-ADP(with) (7)

iobj
- VERB--NOUN (33)
- VERB--PRON (776)

Relations Overview

This corpus uses 13 relation subtypes: acl:relcl, aux:pass, cc:preconj, compound:prt, csubj:pass, det:predet, nmod:npmod, nmod:poss, nmod:unmarked, nsubj:outer, nsubj:pass, obl:npmod, obl:unmarked
The following 1 relation types are not used in this corpus at all: clf