UD for Chinese
Tokenization and Word Segmentation
- As a language without whitespace, most of the Chinese treebanks are pre-tokenized manually or automatically.
See more details on Chinese tokenization.
Morphology
This is an overview only. For more detailed discussion and examples, see the list of Chinese POS tags and details for each treebank in Chinese morphology.
Tags
- Chinese uses all 17 universal POS categories, including particles (PART).
- The following words are particles in Chinese:
- the genitive/associative/relativizer/nominalizer marker 的 de
- 得 de in V-得 extent/descriptive constructions (see compound:ext)
- the manner adverbializer 地 de
- the “et cetera” markers 等 děng
- sentence-final particles (句末助詞 jùmuò zhùcí)
- and the object relativizer 所 suǒ.
- Possessive determiners are considered determiner (DET)
instead of pronoun (PRON)
- 其 qí “his/her/its; their”
- 本 běn “our”
- Chinese auxiliary verbs (AUX) are:
- Modal and modal-like auxiliaries, which are preverbal and do not have to be adjacent to the verb
- can be pre-modified by the negator 不 bù but cannot take object and be post-modified by aspect marker
- 能 néng “be able to”
- 會 huì “will”
- 可以 kěyǐ “can”
- 應該 yīnggāi “should”
- 肯 kěn “be willing to”
- 敢 gǎn “dare”
- 有 yǒu (perfective)
- 沒有 / méiyǒu (negative perfective)
- Note that certain modal auxiliaries can also function as main verbs, usually when they have a direct object or full clausal complement
- Copulae, in the narrow sense of pure linking words for nonverbal predication
- 是 shì
- 為 wéi
- Aspect markers, which must come immediately after the verb
- 了 le (perfective)
- 著/着 zhe (durative)
- 過 guo (experiential)
- Modal and modal-like auxiliaries, which are preverbal and do not have to be adjacent to the verb
Features
Since Chinese is mostly isolating, few features are needed in the treebanks:
- NounType
=Clf
is used to distinguish classifiers from common NOUNs - Person is used to categorize personal pronouns
- NumType is used for cardinal (
Card
) and ordinal (Ord
) numerals - Number
=Plur
is used in the rare cases where a plural suffix is used: 我們 wǒmen “we” vs. 我 wǒ “I” - Polarity
=Neg
is used with the negators 不, 未, 沒, 別, 無
Syntax
This is an overview only. For more detailed discussion and examples, see the list of Chinese POS tags and details for each treebank in Chinese syntax.
Core Arguments, Oblique Arguments and Adjuncts
- Nominal subject (nsubj) is a noun phrase in the nominative case, without preposition.
- A finite subordinate clause may serve as the subject and is labeled
csubj
.
- A finite subordinate clause may serve as the subject and is labeled
- For the purpose of UD the objects are divided to core objects, labeled obj or iobj, and oblique objects, labeled obl:arg.
- All prepositional objects are considered oblique.
- A clausal complement, which is labeled ccomp, is a full clause that functions like an object of verb. It has its own subject and it is not obligatorily coreferent with any of the arguments of the matrix verb. On the contrary, the open clausal complement labled xcomp functions like an object of another verb, but unlike the ccomp dependent, it obligatorily lacks an overt subject.
- Extra attention has to be paid to the distinction of subject-less ccomp from subject-control xcomp.
- In Mandarin Chinese oblique nominal may include prepositional phrases, preverbal coverb phrases where we treat the coverbs as prepositions, and comparative objects introduced by 比 bǐ “than”. They are labeled obl. Other cases are:
- For nominals that don’t include a nominal marker, specifically locational pronouns and nouns, such as 這裡 zhèlǐ “here”, 那裡 nàlǐ “there”, 前面 qiánmiàn “front side”.
- For temporal nouns functioning as adjuncts, see obl:tmod.
- For nominals introduced by 被 bèi and 把 bǎ, see obl:agent and obl:patient, respectively.
- In passive clauses, the subject is labeled with nsubj:pass or csubj:pass, respectively.
- The auxiliary verb in periphrastic passive is labeled aux:pass.
Non-verbal Clauses
- Different copula words in Mandarin Chinese are labeled cop:
- The copula verb 是 shi (be) is used in equational, possessive and benefactory nonverbal clauses.
- The words 為 wéi “be, be as” and 非 fēi “not be” are also included if they are the only verb in a sentence.
Compounds
- Compounds are very frequent in basically disyllabic Mandarin Chinese. The compound relation is used primarily for noun-noun compounds. The latter nominal is typically the head of the compound.
- For verb and verb-object compounds, see compound:dir, compound:ext, compound:vo, and compound:vv.
- This applies to any nominal preceding and modifying another nominal unless the relationship between the two is a possessive one (see nmod).
Relations Overview
- The following relation subtypes are used in Chinese:
- nsubj:pass for nominal subjects of passive verbs
- csubj:pass for clausal subjects of passive verbs
- obl:arg for prepositional objects
- obl:agent for agents of passive verbs
- obl:patient for objects in BA construction
- obl:tmod for temporal modifiers
- aux:pass for passive auxiliaries
- dislocated:vo for dislocated objects of verb-object compound
- advmod:df for duration or frequency adverbial modifiers
- discourse:sp for sentence particles
- mark:adv for manner adverbializers
- mark:rel for adjectival, relativizer, and nominalizer 的 DE
- case:loc for cardinal numbers that are attached as children of the counted noun but govern its case
- compound:dir for directional verb compounds
- compound:ext for extent and descriptive verb compounds
- compound:vo for verb-object compounds
- compound:vv for verb-verb compounds
- The following relation types are not used in Chinese at all: expl
Treebanks
There are 6 Chinese UD treebanks: