home edit page issue tracker

This page pertains to UD version 2.

UD for Old Irish

Treebank Classification and Pre-tokenisation Considerations

Both spelling and word-separation in Old Irish texts can be highly irregular. In modern editions some editors attempt to faithfully reproduce the text as it survives in the original manuscript. These are generally referred to as diplomatic editions. Other editors may alter the text so that it does not resemble exactly the contents of a single manuscript source. This may be done with the aim of emulating a theorised earlier exemplar from which one or more existing manuscript sources are believed to have been copied. In such cases the resulting work is generally referred to as a critical edition. Editors may also alter texts by standardising spelling, by silently introducing word spacing, by capitalising certain letter characters in accordance with modern orthographic practice, and by introducing forms of punctuation not present in the original manuscript. While these changes are not necessarily associated with critical editions, they alter the text in such a manner that it cannot be referred to as entirely diplomatic. Texts edited in such a manner will therefore be referred to broadly as “critical editions” here also.

It is necessary to mark a distinction between diplomatic editions and editions which may have been edited to any extent (here “critical editions”). To mark this distinction all Old Irish treebanks should identify in their README documentation which type of edition they represent by using either the “diplomatic” or “critical” designation. This information should also be included in the treebank name and URL using the abbreviations Dip and Crit (for example, the Diplomatic St. Gall Glosses treebank URL ends: …/UD_Old_Irish-DipSGG).

For the purpose of choosing the correct designation for a new treebank, the following definitions should be adhered to.

Diplomatic:

Critical:

Tokenisation and Word Segmentation

Words are not necessarily delimited by whitespace characters or punctuation in Old Irish texts. Manuscript sources tend to combine unstressed syntactic words (including common clitics like the copula and definite article) with surrounding parts-of-speech bearing a stress. This practice results in many compound words which are purely orthographic, but comprised of two or more lexical words.

In Roman script the whitespace character can sometimes be used to delineate word boundaries (as described above), however, Ogham script has a discrete space mark consisting of a stemline devoid of any other markings.

Orthographic combinations of discrete lexical words should be separated during tokenisation:

Prepositional pronouns (conjugated prepositions) should not be separated during tokenisation, as these are deemed to be discrete words in their own right.

Punctuation is infrequent in manuscript sources, however, punctuation characters not present in the original manuscript material may be introduced by editors of some modern editions. Aside from these, the following exceptions occur:

No multiword tokens occur. Where adjectives or nouns precede other nouns they generally remain separate tokens as with “sen-” in the term “sen-grec”.

Some general advice on tokenisation follows which may not be intuitive to those familiar with Old Irish:

Morphology

POS-Tags


Features

Syntax

References

Bergin, Osborn. “On the Syntax of the Verb in Old Irish.” Ériu, vol. 12, 1938, pp. 197–214.

Doyle, Adrian, John P. McCrae, and Clodagh Downey. (2019). A Character-Level LSTM Network Model for Tokenizing the Old Irish text of the Würzburg Glosses on the Pauline Epistles. In Proceedings of the Celtic Language Technology Workshop, pages 70–79, Dublin, Ireland. European Association for Machine Translation. https://www.aclweb.org/anthology/W19-6910/

Doyle, Adrian and John P. McCrae. (2024). Developing a Part-of-speech Tagger for Diplomatically Edited Old Irish Text. In Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024, pages 11–21, Torino, Italia. ELRA and ICCL. https://aclanthology.org/2024.lt4hala-1.2/

McCone, Kim. (1997). The Early Irish Verb - Second Edition Revised with Index. An Sagart, Maynooth.

Ó hUiginn, Ruairí. “Notes on Old Irish Syntax.” Ériu, vol. 38, 1987, pp. 177–183.

Stifter, David. (2006). Sengoidelc. Syracuse University Press, New York.

Thurneysen, Rudolf. (1946). A Grammar of Old Irish. Binchy, D. A. and Bergin, Osborn (tr.), Reprinted 2010, Dublin Institute for Advanced Studies.

Treebanks

There are two Old Irish UD treebanks: