UD Chinese CFL
Language: Chinese (code: zh
)
Family: Sino-Tibetan
This treebank has been part of Universal Dependencies since the UD v2.1 release.
The following people have contributed to making this treebank part of UD: John Lee, Herman Leung, Keying Li.
Repository: UD_Chinese-CFL
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.14
License: CC BY-SA 4.0
Genre: learner-essays
Questions, comments? General annotation questions (either Chinese-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [keyingli3-c (æt) my • cityu • edu • hk, tswong-c (æt) my • cityu • edu • hk, jsylee (æt) cityu • edu • hk]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.
Annotation | Source |
---|---|
Lemmas | assigned by a program, with some manual corrections, but not a full manual verification |
UPOS | annotated manually, natively in UD style |
XPOS | not available |
Features | assigned by a program, with some manual corrections, but not a full manual verification |
Relations | annotated manually, natively in UD style |
Description
The Chinese-CFL UD treebank is manually annotated by Keying Li with minor manual revisions by Herman Leung and John Lee at City University of Hong Kong, based on essays written by learners of Mandarin Chinese as a foreign language. The data is in Simplified Chinese.
.CONLLUX (extension files)
[NOTE: This is a temporary measure for procedures whose descriptions are not yet available in the UD guidelines.]
Included is an additional .conllux
file for the .conllu
file of the same name. The .conllux
counterpart file contains extra information not ordinarily stored in any of the 10 columns in the CONLL-U format. The non-duplicate columns in .conllux
for this treebank are columns 3 (distributional tag), 6 (distributional head), 7 (distributional relation), and 10 (alignment). [If data in columns 3, 6, and 7 in the .conllux
file are the same as their counterparts in .conllu
, that means the distributional annotation is the same as the morphological annotation. For more information on “distributional” vs. “morphological” annotation, see descriptions further below.]
ALIGNMENTS
Alignments are linked to native-Chinese-speaker corrections (by Keying Li) of the learner sentences; storage of the corrected sentences are to be determined. All sentences pertaining to the learner corpus have a sent_id beginning with CFL-
; original learner sentences have the parallel-treebank extension /ori
in the sent_id, whereas the corrected sentences have the extension /crr
in the sent_id. Each alignment includes the full sent_id followed by ‘#’ and the index of the token aligned. Additional alignments in a one-to-many alignment is offset by commas (e.g. CFL_A_1-5/crr#5,CFL_A_1-5/crr#6
means the token is aligned to tokens 5 and 6 of the corrected (‘crr’) sentence of ‘CFL_A_1-5’).
BASIC STATISTICS
Tree count: 451 Word count: 7256 Token count: 7256 Dep. relations: 45 of which 13 language specific POS tags: 15 Category=value feature pairs: 0
GENERAL COMMENTS
A “literal annotation” is preferred, i.e., one should annotate “as if the sentence were as syntactically well-formed as it can be, possibly ignoring meaning” (Ragheb and Dickinson, 2014).
WORD SEGMENTATION
Non-words are allowed only when there are spelling errors resulting from orthographic or phonetic confusion. An orthographic confusion must involve characters with similar appearance, e.g., between 了 and 子 in *花花公了.
Phonetic confusion must involve characters with the same pronunciation but different tones, e.g., between 關 and 管 in the sentence *不關多貴我也買; or, characters with easily confusable pairs such as {j, zh} and {x, sh}.
In these cases, the lemma of the misspelt word is its corrected version. For example, the lemma of *花花公了 is 花花公子, and the lemma of 不關 is 不管.
LEMMA
The lemma is the same as the word, except when the word contains a spelling error.
POS TAGGING
POS tagging is performed on the basis of the lemma, rather than the word. Hence, in the sentence *不關多貴我也買, 不關 is not tagged as VERB but rather as SCONJ, on account of its lemma 不管.
When determining the POS, one usually considers both the “morphological evidence”, i.e., the linguistic form of the word, as well as the “distributional evidence”, i.e., its syntactic use in the sentence. In a well-formed sentence, these two kinds of evidence should agree; in learner text, however, they may conflict (Ragheb and Dickinson, 2014).
Consider the word 可怕 kepa “scary” in the sentence *我可怕他 “*I scary him”. Morphological evidence suggests the word 可怕 kepa “scary” should be tagged as an adjective (ADJ), reflecting its normal usage. Distributional evidence suggests it should be tagged as a verb, since the trailing pronoun 他 ta “him” implies its use as a verb with a direct object.
When these two kinds of evidence contradict one another, the morphological evidence prevails. The example sentence is thus tagged as:
我/PN 可怕/ADJ 他/PN
However, we also include the “distributional POS tag” in column 3 of the .conllux
file.
DEPENDENCY RELATIONS
Missing words
When a word seems missing in the learner sentence, we annotate according to the UD guidelines on promotion by head elision. For example, in the sentence fragment 在中國最近幾年 zai zhongguo zuijin ji nian “in China recent few years”, we promote 年 nian “year” to be the root. Although both 中國 zhongguo “China” and 年 nian “year” would be obl
dependents if a verb was present, 年 nian “year” is promoted because it is closer to the expected location of the verb.
Word-order errors
The annotation should assume no word order error. For example, in the sentence *我被了他打一頓. The aspect particle 了 le usually modifies the verb that precedes it immediately, and is probably misplaced in this sentence. It is most likely intended to modify 打 da “hit”, and should immediately follow da rather than 被 bei, the passive marker.
To adhere to the principle of “literal annotation”, rather than annotating le as the child of 打 da “hit” with the aux
relation, we annotate 了 le as the child of 被 bei with the dep
relation.
dep
(unspecified dependency)
When learner errors make it difficult to characterize the grammatical relation between a word and the rest of the sentence, we use the dep
relation. Typically, when the POS tag differs from the distributional POS tag, the dep
relation is needed.
Consider the sentence *我可怕他 “*I scary him”. From the point of view of its POS tag, it is unclear how the word 可怕 kepa “scary”, as an adjective, relates to the pronoun. We thus consider kepa as the head of 他 ta “him” with the dep
relation.
When a word has a different distributional POS tag, we also include a “distributional” dependency relation on the basis of the word’s distributional POS tag. This relation is stored in column 4 of the .conllux
file. In the example sentence above, the word 可怕 kepa “scary”, as a verb, is the head of 他 ta “him” with the obj
relation.
REFERENCES Marwa Ragheb and Markus Dickinson. 2014. Developing a Corpus of Syntactically-annotated Learner Language for English. Proceedings of the 13th International Workshop on Treebanks and Linguistic Theories (TLT).
Acknowledgments
This work is partially supported by a Strategic Research Grant (Project no. 7004494) from City University of Hong Kong.
Statistics of UD Chinese CFL
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – VERB
Features
Relations
acl – advcl – advmod – advmod:df – amod – appos – aux – case – case:loc – cc – ccomp – clf – compound – compound:dir – compound:ext – compound:vo – compound:vv – conj – cop – csubj – dep – det – discourse – discourse:sp – dislocated – flat – iobj – mark – mark:adv – mark:rel – nmod – nsubj – nsubj:outer – nsubj:pass – nummod – obj – obl – obl:agent – obl:patient – obl:tmod – parataxis – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 451 sentences and 7256 tokens.
- This corpus contains 7256 tokens (100%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus does not contain words that contain both letters and punctuation.
Morphology
Tags
- This corpus uses 15 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, VERB
- This corpus does not use the following tags: SYM, X
- This corpus contains 11 word types tagged as particles (PART): 了, 吗, 吧, 呢, 和, 啊, 嗬, 地, 得, 没, 的
- This corpus contains 42 lemmas tagged as pronouns (PRON): 一切, 为什么, 人, 人家, 什么, 他, 你, 你们, 其, 其中, 其他, 别, 别人, 到处, 另外, 各, 咱们, 哪个, 哪儿, 哪里, 大家, 女士们, 她, 它, 怎么样, 我, 我们, 我门, 所有, 有的, 每, 自己, 谁, 这, 这儿, 这样, 这里, 那, 那儿, 那样, 那里, 首先
- This corpus contains 30 lemmas tagged as determiners (DET): 一些, 一点, 个, 什么, 以上, 几, 别的, 前, 另, 各, 哪, 哪个, 很多, 所有, 整, 有些, 有的, 本, 每, 许多, 许许多多, 这, 这些, 这样, 这里, 那, 那个, 那些, 那样, 那里
- Out of the above, 12 lemmas occurred sometimes as PRON and sometimes as DET: 什么, 各, 哪个, 所有, 有的, 每, 这, 这样, 这里, 那, 那样, 那里
- This corpus contains 17 lemmas tagged as auxiliaries (AUX): 了, 会, 可以, 可能, 应该, 得, 必须, 想, 愿意, 敢, 是, 有, 着, 能, 要, 过, 需要
- Out of the above, 9 lemmas occurred sometimes as AUX and sometimes as VERB: 了, 会, 得, 想, 是, 有, 要, 过, 需要
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
- Neg
- AUX: 没有, 不得
- PART: 没
- VERB: 没有, 不好意思, 不见, 不通, 不顾
Verbal Features
Pronouns, Determiners, Quantifiers
Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: 是.
- This corpus uses 16 lemmas as auxiliaries (aux). Examples: 了、 着、 要、 会、 能、 想、 过、 可以、 应该、 得、 敢、 需要、 可能、 有、 必须、 愿意.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (126)
- VERB--NOUN-ADP(在) (1)
- VERB--PRON (393)
- obj
- VERB--NOUN (394)
- VERB--PRON (91)
- iobj
- VERB--NOUN (1)
- VERB--PRON (5)
- VERB--PRON-ADP(给) (1)
Relations Overview
- This corpus uses 14 relation subtypes: advmod:df, case:loc, compound:dir, compound:ext, compound:vo, compound:vv, discourse:sp, mark:adv, mark:rel, nsubj:outer, nsubj:pass, obl:agent, obl:patient, obl:tmod
- The following 5 relation types are not used in this corpus at all: expl, fixed, list, orphan, goeswith