home edit page issue tracker

This page pertains to UD version 2.

PUNCT: punctuation


Punctuation marks are non-alphabetical characters and character groups used to delimit linguistic units in printed text.

Punctuation is not taken to include logograms such as $, %, and §, which are instead tagged as SYM.



Prague Dependency Treebank

The PDT texts are from the early 1990s and there are no e-mail addresses. If they were there, the PDT tokenization rules would break them up on all dots and at signs. The same holds for telephone numbers. For example, tel.: (05) 4321 6014  is analyzed as eight tokens (NOUN PUNCT PUNCT PUNCT NUM PUNCT NUM NUM).


PUNCT in other languages: [bej] [bg] [ca] [cs] [cy] [da] [el] [en] [es] [et] [fi] [fr] [ga] [grc] [hy] [hyw] [it] [ja] [ka] [kk] [kpv] [ky] [myv] [no] [pt] [ru] [sl] [sv] [tr] [tt] [uk] [u] [urj] [xcl] [yue] [zh]