This is part of archived UD v1 documentation. See http://universaldependencies.org/ for the current version.
home issue tracker

Tokenization

Ancient Greek tokenization is almost strictly graphic word-based, the language showing strict correspondence between graphic word and morphological word. More precisely, note the following:

Sentence split

Sentence split is based on the punctuation found in the original edited text, which has been added/modified by the editor of manuscript(s).

In the standard editorial practice, four punctuation marks can be found: comma (, encoded as U+002C “COMMA”), period (. encoded as U+002E “FULL STOP”), the point above the line (corresponding in function to the English colon and semicolon; it is encoded as ·, i.e., U+00B7 “MIDDLE DOT”), and interrogation mark (; encoded as U+003B “SEMICOLON”).

All of these are, except the comma, taken to mark the end of a sentence. If a sentence is particularly long, the editor of a digital text can sometimes decide to split such a sentence into two or more sentences, regarless of the punctuation of the original text.