home edit page issue tracker

This page pertains to UD version 2.

UD Irish Cadhan

Language: Irish (code: ga)
Family: Indo-European, Celtic

This treebank has been part of Universal Dependencies since the UD v2.11 release.

The following people have contributed to making this treebank part of UD: Kevin Scannell, Theodorus Fransen.

Repository: UD_Irish-Cadhan
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.13

License: CC BY-SA 4.0

Genre: fiction, nonfiction, bible, poetry

Questions, comments? General annotation questions (either Irish-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [kscanne (æt) gmail • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.

Annotation Source
Lemmas annotated manually
UPOS annotated manually, natively in UD style
XPOS not available
Features annotated manually, natively in UD style
Relations annotated manually, natively in UD style

Description

This is the Cadhan Aonair UD treebank, consisting of 150 sentences randomly sampled from six pre-standard Irish texts. It was subsequently augmented with a late Early Modern Irish syllabic poem representing 43 sentences, described in a separate section below.

Irish underwent a major spelling standardization in the 1940’s and 1950’s, and as a result it can be challenging to apply modern language technologies to older, “pre-standard” texts. For many years now, the general strategy for tagging and parsing older Irish texts has been to pre-process them with an automatic standardizer (Scannell, 2014), and to then use existing tools designed for the modern language. This approach has been successful, but has some inherent limitations. First and foremost, since there are no resources for directly tagging or parsing pre-standard texts, the standardizer must do its job without the benefit of linguistic annotations. This places an upper bound on the performance of the standardizer, and therefore on the full pipeline for analyzing older texts. In addition, there are certain grammatical phenomena that have all but disappeared in the modern language (e.g. the dative case); these cannot be properly handled with the existing approach.

Our primary aim in creating this treebank was to establish a test set for evaluating lemmatization, tagging, and parsing of pre-standard Irish texts. This should enable experimentation with various approaches that we hope will eventually outperform the existing pipeline. Although the test set is quite small (150 sentences, 3804 tokens), we hope to expand it enough to allow the training of a parser designed to act directly on pre-standard texts.

The corpus contains 25 sentences each from six different books published between 1602 and 1936. Texts published in the late 19th century and early 20th century are much easier to process than older texts. The orthography, while quite different from the standard, is much more consistent than what one finds in texts published before the 1880s. We selected three books published in this later period, one from each of the major Irish dialects: Deoraidheacht by Pádraic Ó Conaire (1910, Connacht Irish), Peig by Peig Sayers (1936, Munster Irish), and Scairt an Dúthchais, a translation of Jack London’s Call of the Wild by Niall Ó Domhnaill (1932, Ulster Irish). We then selected three older (and consequently more challenging) texts to round out the corpus: Foras Feasa ar Éirinn by Seathrún Céitinn (1634), the 1602 translation of the Gospel of John by Uilliam Ó Domhnaill, and Cín Lae Amhlaoibh, a diary kept by Amhlaoibh Ó Súilleabháin between 1827 and 1835.

The annotations were produced by standardizing the texts, parsing them with a UDPipe model trained on the modern Irish treebank, projecting the annotations back to the source texts, and then manually correcting the results. Full details are available in Scannell (2022).

Acknowledgments

References

Statistics of UD Irish Cadhan

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJVERBX

Features

AbbrAspectCaseDefiniteDegreeForeignFormGenderMoodNounTypeNumberNumTypePartTypePersonPolarityPossPrepFormPronTypeReflexTenseTypoVerbForm

Relations

aclacl:relcladvcladvmodamodapposcasecase:vocccccompcompound:prtconjcopcsubj:cleftcsubj:copdetdislocatedfixedflat:foreignflat:namemarkmark:prtnmodnmod:possnsubjnummodobjoblobl:prepobl:tmodparataxispunctrootvocativexcompxcomp:pred

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Relations Overview