home edit page issue tracker

This page pertains to UD version 2.

PUNCT: punctuation


Punctuation marks are non-alphabetical characters and character groups used in many languages to delimit linguistic units in printed text.

Punctuation is not taken to include logograms such as $, %, and §, which are instead tagged as SYM. (Hint: if it corresponds to a word that you pronounce, such as dollar or percent, it is SYM and not PUNCT.)

Spoken corpora contain symbols representing pauses, laughter and other sounds; we treat them as punctuation, too. In these cases it is even not required that all characters of the token are non-alphabetical. One can represent a pause using a special character such as #, or using some more descriptive coding such as [:pause].



PUNCT in other languages: [bg] [ca] [cs] [da] [en] [es] [et] [fi] [fr] [ga] [grc] [hy] [it] [ja] [kk] [kpv] [myv] [no] [pt] [ru] [sl] [sv] [tr] [uk] [u] [urj] [yue] [zh]