UD for Old Irish

Treebank Classification and Pre-tokenization Considerations

Both spelling and word-separation in Old Irish texts can be highly irregular. In modern editions some editors attempt to faithfully reproduce the text as it survives in the original manuscript, other editors alter the text by standardising spelling and word spacing, by capitalising proper nouns, and by introducing forms of punctuation not present in the original manuscript. A tokenization standard based solely on the conventions generally adhered to in such altered editions would be inappropriate, and in some cases impossible to apply to editions which aim to preserve manuscript orthography most closely, as these editorial changes can drastically affect word boundaries, and hence, the forms which tokens may take.

To mark this distinction all Old Irish treebanks should identify in their README documentation which type of edition they represent by using either the “diplomatic” or “critical” designation. This information should also be included in the treebank name and URL using the abbreviations Dip and Crit (for example, the Diplomatic St. Gall Glosses treebank URL ends: …/UD_Old_Irish-DipSGG).

For the purpose of choosing the correct designation for a new treebank, the following definitions should be adhered to.



Tokenization and Word Segmentation

Words are not necessarily delimited by whitespace characters or punctuation in Old Irish texts. Manuscript sources tend to combine unstressed syntactic words (including common clitics like the copula and definite article) with surrounding parts-of-speech bearing a stress. This practice results in a large proportion of words appearing between space characters being purely orthographic. Some treebanks may employ whitespace characters between all syntactic words in accordance with the prescribed Old Irish tokenization guidelines. Where this is done, the README documentation for the treebank should state so clearly. See, for example, the Diplomatic St. Gall Glosses treebank, which employs spacing in this way.

In Roman script the whitespace character can sometimes be used to delineate word boundaries (as described above), however, ogham script has a discrete space mark consisting of a stemline devoid of any other markings.

Punctuation is infrequent in manuscript sources, however, punctuation characters not present in the original manuscript material may be introduced by editors of some modern editions. Aside from these, the following exceptions occur:

No multiword tokens occur. Where adjectives or nouns precede other nouns they generally remain separate tokens as with “sen-” in the term “sen-grec”. The following exceptions occur:

Some general advice on tokenization follows which may not be intuitive to those familiar with Old Irish:






