UD Irish Cadhan
Language: Irish (code: ga
)
Family: IE
This treebank has been part of Universal Dependencies since the UD v2.11 release.
The following people have contributed to making this treebank part of UD: Kevin Scannell, Theodorus Fransen.
Repository: UD_Irish-Cadhan
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15
License: CC BY-SA 4.0
Genre: fiction, nonfiction, bible, poetry
Questions, comments? General annotation questions (either Irish-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [kscanne (æt) gmail • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
Annotation | Source |
---|---|
Lemmas | annotated manually |
UPOS | annotated manually, natively in UD style |
XPOS | not available |
Features | annotated manually, natively in UD style |
Relations | annotated manually, natively in UD style |
Description
This is the Cadhan Aonair UD treebank, consisting of 150 sentences randomly sampled from six pre-standard Irish texts. It was subsequently augmented with a late Early Modern Irish syllabic poem representing 43 sentences, described in a separate section below.
Irish underwent a major spelling standardization in the 1940’s and 1950’s, and as a result it can be challenging to apply modern language technologies to older, “pre-standard” texts. For many years now, the general strategy for tagging and parsing older Irish texts has been to pre-process them with an automatic standardizer (Scannell, 2014), and to then use existing tools designed for the modern language. This approach has been successful, but has some inherent limitations. First and foremost, since there are no resources for directly tagging or parsing pre-standard texts, the standardizer must do its job without the benefit of linguistic annotations. This places an upper bound on the performance of the standardizer, and therefore on the full pipeline for analyzing older texts. In addition, there are certain grammatical phenomena that have all but disappeared in the modern language (e.g. the dative case); these cannot be properly handled with the existing approach.
Our primary aim in creating this treebank was to establish a test set for evaluating lemmatization, tagging, and parsing of pre-standard Irish texts. This should enable experimentation with various approaches that we hope will eventually outperform the existing pipeline. Although the test set is quite small (150 sentences, 3804 tokens), we hope to expand it enough to allow the training of a parser designed to act directly on pre-standard texts.
The corpus contains 25 sentences each from six different books published between 1602 and 1936. Texts published in the late 19th century and early 20th century are much easier to process than older texts. The orthography, while quite different from the standard, is much more consistent than what one finds in texts published before the 1880s. We selected three books published in this later period, one from each of the major Irish dialects: Deoraidheacht by Pádraic Ó Conaire (1910, Connacht Irish), Peig by Peig Sayers (1936, Munster Irish), and Scairt an Dúthchais, a translation of Jack London’s Call of the Wild by Niall Ó Domhnaill (1932, Ulster Irish). We then selected three older (and consequently more challenging) texts to round out the corpus: Foras Feasa ar Éirinn by Seathrún Céitinn (1634), the 1602 translation of the Gospel of John by Uilliam Ó Domhnaill, and Cín Lae Amhlaoibh, a diary kept by Amhlaoibh Ó Súilleabháin between 1827 and 1835.
The annotations were produced by standardizing the texts, parsing them with a UDPipe model trained on the modern Irish treebank, projecting the annotations back to the source texts, and then manually correcting the results. Full details are available in Scannell (2022).
Acknowledgments
- Thanks to Teresa Lynn for her many years of work on the Irish treebank, without which none of this research would be possible.
- Thanks to my undergraduate students Sai Shreyas Bhavanasi and Jianjun Zhang at Saint Louis University for many discussions that helped me understand the mathematics behind cross-lingual word embeddings more deeply.
- This project arose out of conversations with Charlie Dillon at the Royal Irish Academy in early 2020 just before the COVID pandemic; my thanks to Charlie and the RIA for hosting me during that visit, and for inspiring this line of research.
References
- Scannell, Kevin P. (2014) Statistical models for text normalization and machine translation, Proceedings of the 1st Celtic Language Technology Workshop at COLING 2014, Baile Átha Cliath, 23 August 2014.
- Scannell, Kevin P. (2022) Diachronic Parsing of Pre-Standard Irish, Proceedings of the 4th Celtic Language Technology Workshop (CLTW 2022) at LREC 2022, Marseille, France, 20 June 2022.
Statistics of UD Irish Cadhan
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – VERB
Features
Abbr – Aspect – Case – Definite – Degree – Foreign – Form – Gender – Mood – NounType – Number – NumType – PartType – Person – Polarity – Poss – PrepForm – PronType – Reflex – Tense – Typo – VerbForm
Relations
acl – acl:relcl – advcl – advmod – amod – appos – case – case:voc – cc – ccomp – compound:prt – conj – cop – csubj:cleft – csubj:cop – det – dislocated – fixed – flat – flat:name – mark – mark:prt – nmod – nmod:poss – nsubj – nummod – obj – obl – obl:prep – obl:tmod – parataxis – punct – root – vocative – xcomp – xcomp:pred
Tokenization and Word Segmentation
- This corpus contains 193 sentences, 4709 tokens and 4783 syntactic words.
- This corpus contains 562 tokens (12%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 78 types of words that contain both letters and punctuation. Examples: 's, d', 'na, d'á, m', n-a, .i., ó'n, 'sa', 'san, a-ta, n-am, agam-sa, do'n, droch-nós, n-Éirinn, n-áit, orm-sa, th', 'e, 'n, 'n-a, 'nar, 'á, Shean-Ghallaibh, a-bhus, a-mháin, a-niú, a-nuas, an-mheidhir, chroidhe-se, chughaibh-se, d'ár, de'n, dhuit-si, duit-si, fa'r, h-ala, h-anfa, h-iomhadh, h-éin, lat-sa, leath-phinginighe, lán-cheaptha, lér', mbocht-chara, mion-roinn, monairc-si, n-abrom, n-agh
- This corpus contains 71 multi-word tokens. On average, one multi-word token consists of 2.04 syntactic words.
- There are 50 types of multi-word tokens. Examples: 'sa, 'sar, 'sdo, 'sgan, dot, fad, gidheadh, 'fhios, 'sda, 'sgo, anadhbharsin, aoinne, dorinne, lem', 'sní, Cilldaluadh, FitzUrsula, ad, adeir, adeirthear, aoinní, ara, arsa'n, cait, ceidImpeir, cia, céidEmpir, dhiáidhsin, dhobhí, dhona, dochluinim, dochualadar, dochuáidh, dochí, dochúaidh, doním, dó-dhéag, eintíre, fearso, ger, id, im', it, leacoidhre, lálá, neitheadhso, neithese, shiar-thuaith, tréd, ód.
Morphology
Tags
- This corpus uses 15 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, VERB
- This corpus does not use the following tags: SYM, X
- This corpus contains 44 word types tagged as particles (PART): 'nar, 's, An, Nir, Ua, Ui, Uí, a, ar, d', d'á, da, dar, dho, do, dob, dá, fa'r, far, go, gor, gu, gur, lér', mac, mic, na, nach, nachar, nar, ni, nior, noch, ná, nách, nár, ní, níor, níos, ro, ré, Í, Ó, ór'
- This corpus contains 22 lemmas tagged as pronouns (PRON): a, ar, cad, ceachtar, cé, ea, féin, iad, mise, mé, seisean, seo, siad, sibh, sin, sinn, sé, sí, tusa, tú, é, í
- This corpus contains 14 lemmas tagged as determiners (DET): a, an, aon, bhur, do, eile, gach, mo, na, seo, sin, uile, ár, úd
- Out of the above, 3 lemmas occurred sometimes as PRON and sometimes as DET: a, seo, sin
- This corpus contains 1 lemmas tagged as auxiliaries (AUX): is
- There are 4 (de)verbal forms:
- Cop
- AUX: is, gurab, nach, as, ba, ní, fá, gur, nar, budh
- PART: dob
- SCONJ: 's, mas, ós
- Inf
- NOUN: bheith, chur, dhéanamh, déanamh, rádh, scríobhadh, thabhairt, adhradh, breith, bualadh
- Part
- ADJ: Cinnte, beite, briste, ceaptha, cuachta, cuirthe, dúnta, foghlumtha, fágtha, fáighte
- Vnoun
- NOUN: cur, tabhairt, teacht, dul, gabháil, baint, breith, brúghadh, ceapadh, dol
Nominal Features
- Fem
- ADJ: ghloin, mhaith, shuthoin, aintréin, bheg, bhán, breagh, buidhe, direach, doirbhe
- ADP: aici, uirre, dhi, di, lei, léithi, ría
- DET: na, a, n-a, ná
- NOUN: beatha, leith, bliadhna, cuid, oidhche, réir, thoil, laimh, linn, láimh
- PRON: sí, í, si, sise
- PROPN: Éireann, Éirinn, Danann, Eireann, Uladh, n-Éirinn, Callain, Casga, Chill, Chnámhchoill
- Masc
- ADJ: beag, mór, dil, maith, Caoimh, Caoin, Sasanach, aisteach, allta, amhain
- ADP: ann, aige, air, 'na, na, d'á, dá, dó, as, da
- DET: a, an, n-a, do, na
- NOUN: lá, duine, fhios, saoghal, bith, fear, la, ainm, creidimh, mac
- PRON: sé, é, hé, se, e, seision, eisean, seisean
- PROPN: Iósa, Dia, Dé, Sacsaibh, Ursula, Beare, Bheannchair, Bhuck, Comhghall, Dhia
- Plur
- ADJ: dearga, dlightheacha, doirbhe, dubha, feósaidhe, forruadha, maslaightheacha, ruadha, uathbhásach', éugcruaidhe
- ADP: aca, againn, díobh, orra, riú, 'na, da, dhóibh, diobh, leó
- DET: na, a, ar, bhur, mur, bar, bhar
- NOUN: bliadhna, Riogha, comhmbráithribh, daoine, dhaoinibh, dáoinibh, eachtrann, fhearaibh, mac, monairc
- PRON: siad, iad, sibh, iád, hiád, sibh-se, sinn, siád, íad
- PROPN: Sacsaibh, Gall, Caomhánaigh, Fhidhic, Gaileoin, Grég, Romhán, Tuathaibh, gConnachtaibh, mead
- VERB: Táid, atáid, Bhamar, anfaidís, bhfacadar, bhfuairsiod, bhfúairset, bhíd, bhídís, choimhédadar
- Sing
- ADJ: beag, mór, breagh, buidhe, dil, dubh, fada, feargach, ghloin, maith
- ADP: ann, aige, air, 'na, agam, liom, don, na, d'á, dá
- DET: an, a, mo, na, do, m', d, n-a, t, mó
- NOUN: lá, duine, fhios, saoghal, tan, beatha, bith, la, leith, ainm
- PRON: sé, é, mé, tú, hé, sí, tusa, se, í, mise
- PROPN: Iósa, dia, Éireann, Uladh, Éirinn, Dé, Ursula, Beare, Bheannchair, Bhuck
- VERB: déna, fuairais, Bíodh, Tuig, biodh, chualas, féuch, thrasgrais, ttugais, Bhíos
- Dat
- ADJ: aintréin
- NOUN: leith, laimh, láimh, comhmbráithribh, dhaoinibh, droing, dáoinibh, fhearaibh, gcéill, Bhreathnachaibh
- PROPN: Éirinn, Sacsaibh, n-Éirinn, Tuathaibh, gConnachtaibh
- Gen
- ADJ: Caoimh, allta, bhig, bhuidhe, buidhe, caitliceach, chatharmaigh, chruim, faon, gallda
- DET: na, an, ná
- NOUN: creidimh, Mathghamhna, anma, athar, cogaidh, domhain, eachtrann, fhir, hoidhche, mac
- PROPN: Éireann, Uladh, Dé, Bheannchair, Danann, Eireann, Gall, Laighean, Bhuck, Cairbre
- Nom
- ADJ: beag, mór, breagh, dubh, fada, feargach, ghloin, maith, mhaith, shuthoin
- NOUN: lá, duine, fhios, saoghal, tan, bith, la, ainm, beatha, bliadhna
- PROPN: Dia, Iósa, Ursula, Beare, Comhghall, Dhia, Dhía, Iosa, Pilát, Séamas
- Voc
- ADJ: dil
- NOUN: thighearna, léaghthóir, shaoghail, athuir, bhuachaill, chealguire, dhaltha, mheic, rún, úa
- PROPN: Thoirrdhealbhaidh
- Def
- DET: an, na, gach, gac, 'e, 'n, ná, san
- NOUN: tan, fhios, la, lá, shaoghail, éis, ais, bhfearann, bhflaitheas, duine
- NOUN-Inf: admhail, bhfhaicsin, choinneáil, chur, dhol, dhéanamh, fhios, geimhliughadh, leighes, mheas
- PROPN: Iósa, dia, Éireann, Buck, Uladh, Éirinn, Dé, Sacsaibh, Ursula, Bangor
Degree and Polarity
- Cmp
- ADJ: mó
- Cmp,Sup
- ADJ: buaine, mó, fearr, luaithe, mo, fheárr, mhó
- Pos
- ADJ: maith, amháin, iomdha, ionann, mór, cóir, geal, mithidh, sásta, Mó
- Neg
- AUX-Cop: nach, ní, nar, ni, nír, muna, ná'r
- PART: ní, nach, ná, níor, nior, na, nách, Nir, nachar, nar
- VERB: rabh, raibh, bheadh, fhuil, Níl, anfaidís, bainfeadh, beadh, bean, bhena
Verbal Features
- Hab
- VERB: bhíd, bhíonn, biodh, bionn
- Imp
- VERB: bhíodh, Teidheadh, bhídís, chleachtadh, mbiodh, ndéantaoi, riomhthaoi, thógadh, tugadh
- Cnd
- VERB: mbeadh, bheadh, leigfeadh, rachadh, Bhrisfeadh, Tharraingeochadh, anfaidís, bainfeadh, beadh, bhiadh
- Imp
- PART: na, ná
- VERB: déna, Bíodh, Tuig, féuch, Biodh, Smuain, Tabhair, Treig, abair, bean
- Ind
- VERB: bhí, raibh, rabh, adubhairt, bhfuil, thug, atá, tug, arsa, lean
- Int
- AUX-Cop: Nach, an
- Sub
- PART: Go
- VERB: bhfuilnge, mbeith, mbera, n-iarra, sábháilidh, ttugadh
- Fut
- VERB: Brisfidh, Inneosad, bhaileochaidh, bheidh, bhias, bhuailfidh, bhéas, chreidfe, chreidfios, chuire
- Past
- AUX-Cop: gurab, ba, fá, gur, nar, budh, dobadh, nír, dob, dobudh
- PART: gur, níor, nior, 'nar, Nir, a, ar, dar, dob, fa'r
- PART-Cop: dob
- VERB: bhí, raibh, rabh, adubhairt, thug, tug, arsa, lean, bhíodh, chuir
- Pres
- AUX-Cop: is, nach, gurab, as, ní, ni, Ag, an, darab, gor
- VERB: bhfuil, atá, ta, tá, a-ta, Adeir, Táid, atáid, fhuil, ngairthear
Pronouns, Determiners, Quantifiers
- Art
- ADP: don, ó'n, 'sa', 'san, san, annsa, den, do'n, de'n, isin
- DET: an, na, 'e, 'n, ná, san
- Dem
- DET: eile, so, sin, soin, seo, se, úd, adaí, oile
- PRON: sin, so, Seo, shoin
- Emp
- ADP: agam-sa, orm-sa, chughaibh-se, dhamhsa, dhuit-si, dhíobhse, duit-si, fúthasan, ionnadsa, ionnamsa
- PRON: tusa, mise, meisi, seision, eisean, seisean, sibh-se, sise
- VERB: mbéidision
- Ind
- DET: aon, uile, ein
- PRON: ceachtar
- Int
- ADV: cá
- PRON: cé, cad, Cía, Gidh, cia, créd, céard
- Prs
- ADP: 'á, ghá
- Rel
- ADP: ar, d'á, d'ár, dá
- AUX-Cop: nach, nar, dobadh, fá, ba, dob, dobudh
- PART: a, do, d'á, dho, noch, da, dá, 'nar, ar, dar
- PRON: a, ar
- VERB: atá, tá, a-ta, bhias, áta
- Card
- NUM: trí, ceithre, dhá, tri, míle, sé, ceid, cuig, céad, céid
- Ord
- NUM: dara, treas, chéad, naomhadh, seiseadh, t-ochtmhadh, mhíle
- Yes
- ADP: 'na, da, d'á, na, dá, ana, ina, dhá, 'n-a, 'á
- DET: a, mo, do, ar, m', d, n-a, bhur, t, mur
- Yes
- PRON: féin, fein, fhéin
- 0
- VERB: cuireadh, ngairthear, ngoirthear, rugadh, adeirthear, buaileadh, chonnaictheas, deirtear, dtáinigtheas, dubhradh
- 1
- ADP: agam, liom, againn, orm, agam-sa, dhom, dhíom, linn, orm-sa, asam
- DET: mo, ar, m', mó
- PRON: mé, mise, meisi, me, mhé, sinn
- VERB: chualas, Bhamar, Bhíos, Dhíolas, Feicim, Fuaras, Guidhim, Inneosad, Rugas, Thangas
- 2
- ADP: dhuit, leat, libh, duit, ort, uáit, agad, agaibh, chugaibh, chughaibh-se
- DET: do, d, bhur, t, mur, th', bar, bhar, d', t'
- PRON: tú, sibh, tusa, sibh-se, thú, tu
- VERB: déna, fuairais, Tuig, féuch, thrasgrais, ttugais, Rugais, Smuain, Tabhair, Treig
- 3
- ADP: ann, 'na, aige, air, aca, da, d'á, na, dá, díobh
- DET: a, n-a, do, na
- PRON: sé, é, siad, hé, iad, sí, se, í, e, ea
- VERB: Bíodh, Táid, atáid, biodh, anfaidís, bhfacadar, bhfuairsiod, bhfúairset, bhíd, bhídís
Other Features
- Abbr
- Yes
- ADJ: .i.
- ADV: .i.
- Yes
- Foreign
- Yes
- PROPN: Buck, Bhuck, Hanmer, Bangor, Dyea, François, Hibernia, Hiberus, Klondike, Mercedes
- Yes
- Form
- Direct
- PART: a, do, noch, ro
- VERB: atá, tá, a-ta, áta
- Direct,Len
- PART: dho
- Ecl
- AUX-Cop: mba
- DET: gach
- NOUN: bhfearann, bhflaitheas, ndeireadh, bhfear, ccoir, dtaobh, gcomhnaidhe, gcédna, gcéill, n-áit
- NOUN-Inf: bhfhaicsin, bplanntughadh, mbeith, ndol, ngabail, ngeineamhain, ttecht
- NUM: naon, náon, ttrí
- PROPN: n-Éirinn, bhFailghe, bhFréamhainn, gConnachtaibh, mBaile, n-Áird, nAodh, nAssardha, neabhra
- VERB: bhfuil, mbeadh, ngairthear, ngoirthear, ttugadh, ttugais, bhfacadar, bhfhuilim, bhfuair, bhfuairsiod
- Ecl,Emp
- NOUN: Ndíasa, mbreithirsean, natharsa
- Emp
- NOUN: ainmsean, monairc-si, sonsan, tsáoghailsi
- Emp,Len
- NOUN: chroidhe-se
- VERB: ghlacadarsan
- HPref
- ADJ: haereach, haireach, holc
- NOUN: hoidhche, Hiudaighe, Híudaidhe, h-ala, h-anfa, h-éin, haicmeadha, haimsir, haimsire, haithrighe
- PRON: hé, hiád
- PROPN: hAodh, hÉireann
- VERB: háitigheadh
- Indirect
- PART: a, d'á, da, dá, 'nar, ar, dar, fa'r, far, lér'
- Len
- ADJ: mhór, ghloin, mhaith, shuthoin, bheg, bhig, bhuidhe, bhán, chatharmaigh, cheart
- ADP: dhe, dho, dhochum, dhom, dhá, dhíom, dhó, dhamh, dhamhsa, dhi
- NOUN: bheith, fhios, chur, dhéanamh, shaoghail, thoil, thabhairt, thighearna, thús, bhocsa
- NOUN-Inf: bheith, chur, dhéanamh, thabhairt, bhriseadh, choinneáil, chongbhail, chuma, dhol, dhul
- NUM: dhá, chéad, mhíle, thrí, cheithre, chúig, dhó, fhichid, sheachtmhoghad, tri
- PART: dho
- PRON: fhéin, mhé, shoin, thú
- PROPN: Bheannchair, Bhuck, Dhia, Dhía, Chairbre, Chesar, Chill, Chomhghaill, Chomhghall, Chriosd
- SCONJ: dhá
- VERB: bhí, thug, bheadh, bhíodh, chuir, Dhearc, bhi, chualas, fhuil, fhág
- VF
- AUX-Cop: gurab, darab, dob, dárab
- PART-Cop: dob
- Direct
- NounType
- NotSlender
- ADJ: dearga, doirbhe, dubha, feósaidhe, forruadha, ruadha, úra
- Slender
- ADJ: uathbhásach'
- Strong
- NOUN: Níuduidheadh, bpóilíní, dhearbhraithreach, dtairngeadh, dtairrngeadh, mbáillí, ndáoine, neitheadh
- PROPN: nAssardha
- Weak
- NOUN: eachtrann, mac, Bolg, Gall, bhflaitheas, ccóigidh, crecht, deisgiobal, fear, gcor
- PROPN: Gall, Grég, Romhán, mead
- NotSlender
- PartType
- Ad
- PART: go, gu
- Cmpl
- PART: go, gu, nach, ná, nachar, nar, nách, nár
- Comp
- PART: níos, a
- Inf
- PART: do, a, d', dho
- Pat
- PART: mac, Ua, Ó, Ui, Uí, mic, Í
- Sup
- PART: 's, dob
- PART-Cop: dob
- Vb
- PART: do, a, ní, d', gur, An, níor, dho, go, nior
- Voc
- PART: a
- Ad
- PrepForm
- Cmpd
- ADP: i, ar, do, re, tar, d', go, le, ós, ima
- NOUN: leith, eis, linn, ndiaidh, nós, reír, éis, cceann, coinne, cois
- Cmpd
- Typo
- Yes
- NOUN: Righthigh
- SCONJ: da
- VERB: ndubhairt
- Yes
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: is.
- This corpus does not contain auxiliaries.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (1)
- VERB--NOUN-Gen (1)
- VERB--NOUN-Nom (97)
- VERB--NOUN-Nom-ADP(ach) (2)
- VERB--NOUN-Nom-ADP(le) (1)
- VERB--PRON (84)
- obj
- VERB--NOUN-Gen (1)
- VERB--NOUN-Nom (100)
- VERB--PRON (27)
Relations Overview
- This corpus uses 11 relation subtypes: acl:relcl, case:voc, compound:prt, csubj:cleft, csubj:cop, flat:name, mark:prt, nmod:poss, obl:prep, obl:tmod, xcomp:pred
- The following 2 main types are not used alone, they are always subtyped: compound, csubj
- The following 10 relation types are not used in this corpus at all: iobj, expl, discourse, aux, clf, list, orphan, goeswith, reparandum, dep