home edit page issue tracker

This page pertains to UD version 2.

UD for Classical Chinese

Tokenization and Word Segmentation

There are neither spaces nor punctuations between words or sentences. Every word consists of a single character, except for several (proper) nouns.



The predicate-object-final structure of very early Chinese texts had only three categories of words: predicate, object, and final. Here in our linguistic model we tentatively call them “verb” “noun” and “particle” respectively. Several words were specialised to be used as verbs, several as nouns, but most of them had been used in two or three categories around Zhou (周) dynasty.

At that era, we can observe very early modifier usages of verbs. Several verbs were specialised to be used as adverbial modifiers, afterwards caused adverbs. In between verbs and adverbs, auxiliary verbs were almost specialised to auxiliary uses, but incidentally used as verbs. Adjective usages of verbs were not specialised as adjectives at that era, on the other hand, some caused prepositions.

For POS-tagging of classical Chinese texts in UD, we use VERB ADV AUX ADP and SCONJ to fill UPOS field of each verb-origin word, following the overview of modifier usages mentioned above. For noun-origin words we use NOUN PROPN PRON NUM and ADV (noun-origin adverbs including 何), categorising them in rather nowadays point of view. For particle-origin words we use PART CCONJ and INTJ, keeping up with the guideline of UD v2. We rarely use SYM, and do not use ADJ DET PUNCT or X.




There is one Classical Chinese UD treebank: