UD Scottish Gaelic ARCOSG
Language: Scottish Gaelic (code: gd
)
Family: IE
This treebank has been part of Universal Dependencies since the UD v2.5 release.
The following people have contributed to making this treebank part of UD: Colin Batchelor.
Repository: UD_Scottish_Gaelic-ARCOSG
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15
License: CC BY-SA 4.0
Genre: nonfiction, fiction, news, spoken
Questions, comments? General annotation questions (either Scottish Gaelic-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [colin • r • batchelor (æt) googlemail • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
Annotation | Source |
---|---|
Lemmas | annotated manually in non-UD style, automatically converted to UD |
UPOS | annotated manually in non-UD style, automatically converted to UD |
XPOS | annotated manually |
Features | annotated manually in non-UD style, automatically converted to UD |
Relations | annotated manually in non-UD style, automatically converted to UD |
Description
A treebank of Scottish Gaelic based on the Annotated Reference Corpus Of Scottish Gaelic (ARCOSG).
The Scottish Gaelic treebank takes data from ARCOSG, the Annotated Reference Corpus of Scottish Gaelic (Lamb et al. 2016) with the annotation scheme based on that in the Irish UD treebank. Full bibliographic details are to be had there.
It contains eight subcorpora of a varying number of original files, each of approximately 1000 tokens. All files listed below are in the training set unless they are explicitly marked as being in test or dev. In the ARCOSG documentation the names of contributors are largely given in Gaelic, which I have kept and glossed with their names in English where they will be familiar to non-Gaelic speakers.
- Conversation. c01 is in test, c03 in dev and the rest in train. These are transcripts of interviews in the Western Isles from 1998 to 2000. In c03 and c04 speakers 2, 4 and 5 are children.
- Sport. s06 is in test, s08 in dev and the rest in train. s01 to s05 are Radio nan Gàidheal commentary on a match between Scotland and Australia; s06 to s10 on Scotland vs. Yugoslavia.
- Oral narrative.
- n01: Na Trì Leinntean Canaich (test)
- n02: Conall Gulban (dev)
- n03: Na Fiantaichean
- n04: Gille an Fheadain Duibh
- n05: Bodach Ròcabarraigh
- n06: Iain Beag MacAnndra
- n07: Fear a’ Churracain Ghlais
- n08: Boban Saor
- n09: Bean ‘ic Odrum
- n10: Blàr Chàirinis
- News scripts from Radio nan Gàidheal in the early 1990s.
- ns01: Màiri Anna NicUalraig (Mary Ann Kennedy)
- ns02: Dòmhnall Moireasdan
- ns03: Iseabail NicIllinnein
- ns04: Innes Rothach
- ns05: Innes Rothach (test)
- ns06: Pàdraig MacAmhlaigh (dev)
- ns07: Dòmhnall Moireasdan (test)
- ns08: Màiri Anna NicUalraig (dev)
- ns09: Seumas Domhnallach
- ns10: Seumas Domhnallach
- Public interview
- p01: Peataichean, conversation on Coinneach MacÌomhair’s programme
- p02: Fred MacAulay and Martin MacDonald
- p03: John MacInnes and William Matheson
- p04: Geamaichean Sholais 1, conversation on Coinneach MacÌomhair’s programme (test)
- p05: Geamaichean Sholais 2 (dev)
- p06: Bonn Comhraidh, 1980s political discussion programme
- p07: Conversation on Coinneach MacÌomhair’s programme 2000-01-17 part 1
- p08: Conversation on Coinneach MacÌomhair’s programme 2000-01-17 part 2
- Fiction
- f01: Am Fainne by Eilidh Watt
- f02: from Cùmhnantan by Tormod MacGill-Eain
- f03: Droch Àm by Pòl MacAonghais (test)
- f04: Spàl Tìm by Cailean T. MacCoinneach
- f05: Teine a Loisgeas by Eilidh Watt
- f06: Beul na h-Oidhche by Somhairle MacGill-Eain (Sorley Maclean)
- f07: from An t-Aonaran by Iain Mac a’ Ghobhainn (Iain Crichton Smith)
- f08: Briseadh na Cloiche by Iain Moireach (dev)
- Formal prose:
- fp01: Trì Ginealaichean by D. E. Dòmhnallach
- fp02: Nua-Bhàrdachd Ghàidhlig by Dòmhnall MacAmhlaigh (Donald MacAulay)
- fp03: Mairead N. Lachlainn by Somhairle MacGill-Eain (test)
- fp04: from Bith-eòlas (‘Biology’), a translation by Ruairidh MacThòmais (Derick Thomson)
- fp05: Aramach am Bearnaraidh
- fp06: Blàr a’ Chumhaing by Iain A. MacDonald
- fp07: Na Marbhrannan by Coinneach D. MacDhòmhnaill
- fp08: Cainnt is Cànan by J. MacInnes
- fp09: from Dòmhnall Uilleam Stiùbhart (Donald William Stewart)’s unpublished PhD thesis (dev)
- Popular writing: columns from The Scotsman:
- pw01: An Cuir am Papa… by Aileig O Hianlaidh (Alex O’Henley)
- pw02: A bith mar Chorra… by Joina NicDhomnaill (test)
- pw03: Pàdraig Sellar by Ùisdean MacIllinnein
- pw04: A’ Cur Às Dhuinn Fhìn by Aonghas Mac-a-Phì
- pw05: Aon Dùthaich by Murchadh MacLeòid
- pw06: Blas a’ Ghuga by Coinneach MacLeòid (dev)
- pw07: Luchd-ciùil by Criosaidh Dick
- pw08: Na Gàidheil Ùra by Criosaidh Dick
- pw09: A’ Siubhail gu Rèidh by Tormod Domhnallach (dev)
- pw10: Poileaticeans by Niall M. Brownlie
- pw11: Oifigeir Gàidhlig by Aileig O Hianlaidh (test)
See https://universaldependencies.org/gd/index.html for detailed linguistic documentation.
Acknowledgments
We wish to thank all of the contributors to ARCOSG and fellow Celtic language UD developers Teresa Lynn, Kevin Scannell, Johannes Heinecke and Fran Tyers.
References
- Colin Batchelor, 2019. Universal dependencies for Scottish Gaelic: syntax, in Proceedings of CLTW2019 at Machine Translation Summit XVII, Dublin, August
- Lamb, William, Sharon Arbuthnot, Susanna Naismith, and Samuel Danso. 2016. Annotated Reference Corpus of Scottish Gaelic (ARCOSG), 1997–2016 [dataset]. Technical report, University of Edinburgh; School of Literatures, Languages and Cultures; Celtic and Scottish Studies. https://doi.org/10.7488/ds/1411.
- Lynn, Teresa and Jennifer Foster, [Universal Dependencies for Irish] (http://www.nclt.dcu.ie/~tlynn/Lynn_CLTW2016.pdf), CLTW 2016, Paris, France, July 2016
Statistics of UD Scottish Gaelic ARCOSG
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X
Features
Case – Definite – Degree – Foreign – Form – Gender – Mood – Number – NumForm – NumType – PartType – Person – Polarity – Poss – PronType – Reflex – Tense – Typo – VerbForm
Relations
acl – acl:relcl – advcl – advmod – amod – appos – aux:pass – case – case:voc – cc – ccomp – compound – conj – cop – csubj:cleft – csubj:cop – csubj:outer – dep – det – discourse – dislocated – fixed – flat – flat:foreign – flat:name – mark – mark:prt – nmod – nmod:poss – nsubj – nsubj:outer – nsubj:pass – nummod – obj – obl – obl:smod – obl:tmod – orphan – parataxis – punct – reparandum – root – vocative – xcomp – xcomp:pred
Tokenization and Word Segmentation
- This corpus contains 4741 sentences, 86089 tokens and 89958 syntactic words.
- This corpus contains 5200 tokens (6%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 1242 types of words that contain both letters and punctuation. Examples: a', 's, a’, [Name], a-mach, b', 'n, ars’, a-steach, bh', co-dhiubh, th', a-staigh, a's, [Placename], h-uile, ’s, ma-thà, an-diugh, a-rithist, ars', dh’fhalbh, 'dol, a-null, d’, a-nis, h-Alba, a-nuas, ge-tà, 'm, h-eileanan, a-muigh, ‘s, a-nise, 'g, a-sin, taobh-sa, a-nall, a-rèir, 'ic, dh’Alba, an-dràsda, h-Astràilianaich, a-seo, dh’fhàg, co-dhiù, ‘n, b’, d', dh’fheuch
- This corpus contains 3835 multi-word tokens. On average, one multi-word token consists of 2.01 syntactic words.
- There are 230 types of multi-word tokens. Examples: ann, aca, air, ga, aige, dhan, 'se, dha, agad, leatha, ris, 'na, 'ga, orra, againn, dhaibh, na, san, dhen, sa, agam, dhiubh, se, 'sa, a'm, riutha, leis, 'san, aice, bhon, dhuinn, oirre, dhomh, dhut, mun, roimhe, às, agaibh, den, dheth, gan, dhi, leotha, dhe, dhuibh, don, fodha, ort, rium, orm.
Morphology
Tags
- This corpus uses 17 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
- This corpus contains 45 word types tagged as particles (PART): 'g, 'ga, 'ic, 'm, 'n, 's, Mc, Mhic, Nic, O, a, a', ach, ag, air, am, an, a’, b', bu, cha, chan, d', do, d’, g', ga, gan, gu, gum, gun, gur, mac, na, nach, nam, nan, nas, r', ri, Ó, ‘, ‘n, ’, ’g
- This corpus contains 77 lemmas tagged as pronouns (PRON): 'd, 'n, a, a-chèile, a-seo, a-sin, a-siud, aige, an, ann, ar, b'e, bith, brith, bè, c'à, car, carson, cia, ciamar, co, cuin, cuin', cuine, cà, cà', càil, càit, càit', càite, cáit, cèile, céile, cò, có, diamar, do, dè, dé, e, fein, fèin, féin, gar, ge, gu, i, iad, mar, mheud, mi, mis', mo, na, péin, sean, seo, seothach, shean, shin, sib', sibh, sibh-se, sin, sineach, sinn, siod, siodach, siud, siudach, son, thu, thus', ur, àsan, è, ì
- This corpus contains 19 lemmas tagged as determiners (DET): 'sa, a, a', an, ar, do, eile, gach, mo, sa, san, seo, sin, sineach, siud, the, ud, uile, ur
- Out of the above, 10 lemmas occurred sometimes as PRON and sometimes as DET: a, an, ar, do, mo, seo, sin, sineach, siud, ur
- This corpus contains 2 lemmas tagged as auxiliaries (AUX): is, rach
- Out of the above, 1 lemmas occurred sometimes as AUX and sometimes as VERB: rach
- There are 3 (de)verbal forms:
- Fin
- AUX: chaidh, thèid, deach, tèid, rachadh
- VERB: tha, bha, robh, eil, bheil, chaidh, bhiodh, fhuair, thuirt, ars’
- Inf
- NOUN: bhith, chur, dhèanamh, thoirt, dhol, feuchainn, cur, ràdh, toirt, thogail
- Vnoun
- NOUN: dol, ràdh, tighinn, feuchainn, iarraidh, faighinn, cur, dèanamh, coimhead, ruith
Nominal Features
- Fem
- ADJ: eile, mhòr, ùr, àrd, shaor, mhath, bheag, mhór, beaga, Buidhe
- DET: na, an, a’, a, a', 'n, nan, nam, ‘n, am
- NOUN: bliadhna, buille, bhliadhna, obair, cuid, Gàidhlig, tè, aghaidh, dòigh, leithid
- PRON: i, a, ise, h-i, h-ì
- PROPN: [Name], Màiri, Anna, Mairearad, Inis, Catrìona, Ann, Mo, Sarah, Belle
- Masc
- ADJ: eile, beag, ùr, mòr, math, mór, òg, dubh, ghoirid, ùra
- DET: an, na, a’, a, am, nan, a', 'n, nam, 'm
- NOUN: duine, fear, fhios, taobh, rud, daoine, latha, àite, taigh, leth
- PRON: e, esan, a, h-e, ise, è, aige, mise, sinne
- PROPN: [Name], Iain, Dòmhnall, Tormod, Mhurchaidh, Alasdair, Aonghais, Garaidh, Labhruinn, lain
- Dual
- ADJ: bhuig
- DET: an
- NOUN: bhois, bhròig, cuidhean
- Plur
- ADJ: eile, ùra, beaga, mòra, móra, dùthchail, annasach, làidir, Èireannach, ionadail
- DET: na, nan, an, nam, am, ar, ur, ’n, 'n, 'm
- NOUN: daoine, dhaoine, cluicheadairean, rudan, h-Astràilianaich, h-eileanan, beathaichean, mionaidean, bliadhnaichean, cùisean
- PRON: iad, sinn, sibh, an, iadsan, sinne, sibhse, ar, àsan, ur
- Sing
- ADJ: eile, ùr, beag, mhòr, mòr, math, àrd, ghoirid, òg, mór
- DET: an, a’, a, na, a', am, 'n, mo, do, 'm
- NOUN: duine, fear, fhios, taobh, rud, latha, àite, bliadhna, buille, taigh
- PRON: e, i, mi, a, thu, esan, mise, ise, tu, thusa
- Dat
- ADJ: eile, ùr, ghoirid, ùra, dubh, mór, móra, Albannach, Eòrpach, annasach
- NOUN: taobh, àite, àm, aghaidh, leth, thaobh, duine, dòigh, ceann, bhliadhna
- PROPN: [Name], Dòmhnall, Iain, Dhòmhnall, Garaidh, Labhruinn, Màiri, Tormod, dh’[Name], lain
- Gen
- ADJ: eile, Ghlais, àrd, Buidhe, mhòir, Bhàin, ùr, Ruaidh, bhig, dùthchail
- DET: na, an, a’, nan, a', nam, am, a
- NOUN: bliadhna, Gàidhlig, pàirce, latha, obrach, taighe, dùthcha, dìon, pàrlamaid, Gaidhealtachd
- PROPN: [Name], Iain, Dhòmhnaill, Sheumais, lain, Dhonnchaidh, Brìde, Lachlainn, Mairearaid, Ràghaill
- Nom
- ADJ: eile, ùr, beag, mhòr, mòr, math, shaor, àrd, òg, mòra
- NOUN: fhios, fear, duine, rud, daoine, ball, latha, buille, bliadhna, taobh
- PROPN: [Name], Iain, Dòmhnall, Màiri, Tormod, Alasdair, Anna, Eachann, Garaidh, Murchadh
- Voc
- ADJ: dhuibh, bhochd, òig
- NOUN: dhuine, 'ille, Rìgh, ghràidh, 'illean, bhalaich, ghràidhein, 'ill', bheadragain, bhròinein
- PROPN: [Name], Mhurchaidh, Aonghais, Iain, Raghnaill, Dhòmhnaill, Anna, Choinnich, Sheonaidh, Ann
- Def
- DET: an, na, a’, a', am, nan, 'n, nam, 'm, ‘n
Degree and Polarity
- Cmp,Sup
- ADJ: fhearr, fhaide, fheàrr, motha, mhotha, lugha, tràithe, àirde, luaithe, shine
- ADV: tràithe
- Aff
- AUX: gur, an, gun
- Neg
- AUX: chan, nach, cha
- PART: cha, nach, chan, na
Verbal Features
- Ind
- AUX-Fin: chaidh, thèid, deach, tèid, rachadh
- VERB-Fin: tha, bha, robh, eil, bheil, chaidh, bhiodh, fhuair, thuirt, ars’
- Int
- AUX: an, nach
- Fut
- AUX-Fin: thèid, tèid
- VERB-Fin: bidh, feumaidh, bhios, bi, gheibh, nì, faodaidh, bheir, bhi, thèid
- Past
- AUX: b', bu, chaidh, b’, deach
- AUX-Fin: chaidh, deach
- PART: do, d’, d'
- VERB-Fin: bha, robh, chaidh, fhuair, thuirt, ars’, chuir, thàinig, bh', rinn
- Pres
- AUX: 's, is, gur, as, chan, nach, an, cha, ‘s, gun
- PART: cha
- VERB-Fin: tha, eil, bheil, th', thà, th’, thathar, thathas, 'eil, 'l
Pronouns, Determiners, Quantifiers
- Art
- DET: an, na, a’, sin, a', seo, am, a, h-uile, nan
- Dem
- PRON: sin, seo, siud, sineach, a-sin, an, a-seo, seothach, siod, a
- Int
- PART: an, a, am, 'm, na, 'n, ‘n
- PRON: dè, cò, dé, ciamar, carson, gu, cà, có, cuin', mheud
- Prs
- DET: a, an, mo, do, am, ar, m', d', t', ur
- PRON: e, iad, i, mi, a, thu, sinn, esan, fhèin, sibh
- Rel
- AUX: as, is, 's
- PART: a, nach, a'
- PRON: na
- Card
- NUM: aon, dà, deug, trì, dhà, fhichead, fichead, ceithir, seachd, mìle
- Ord
- NUM: chiad, cheud, dàrna, naodhamh, ochdamh, t-seachdamh, naoidheamh, treas, dara, ceathramh
- Yes
- DET: a, an, mo, do, am, ar, m', d', t', ur
- PRON: a, an, mo, ar, do, ur
- Yes
- PRON: fhèin, fhéin, chèile, fhìn, a, chéile, péin, a-chèile, fhein, fèin
- 0
- AUX-Fin: Rachadh
- VERB-Fin: rinneadh, thathar, thugadh, chuireadh, thathas, dh'fhaoidte, faodar, feumar, fhuaras, rugadh
- 1
- DET: mo, ar, m', m’, ’r
- PRON: mi, sinn, mise, sinne, mo, ar, mis', mis’, àsan
- VERB-Fin: chanainn, rachainn, bhithinn, bithinn, bhiomaid, chanain-sa, dh’aontaichinn, faigheamaid, faighinn, Bitheamaid
- 2
- DET: do, d', t', ur, bhur, d’
- PRON: thu, sibh, tu, thusa, sibhse, tusa, do, thus', ur, sibh-se
- VERB-Fin: feuch, can, cuir, abair, bi, gabh, till, trobhad, Cumaibh, saoil
- 3
- DET: a, an, am, ’n, 'n, 'm
- PRON: e, iad, i, a, esan, ise, an, iadsan, àsan, h-e
Other Features
- Foreign
- Yes
- ADJ: okay, extra, flat, fresh, important, spotless, British, Celtic, English, First
- ADV: really, exactly, straight, Celtic, absolutely, alright, forward, particular, still, totally
- CCONJ: so
- DET: the
- INTJ: well, okay, right, so, A, really, sorry, thanks
- NOUN: tug-of-war, Shir, contract, vet, Radio, council, point, terrorists, tribunal, van
- NOUN-Vnoun: sublet
- NUM: fifty, forty-thousand, three
- PROPN: Sir, Dad, Apprentice, Aquaculture, Backpackers, Bhridge, Boys, Centre, Community, Green
- VERB-Fin: dhifferentiates, test
- X: the, a, of, on, I, Isles, and, in, poverty, Cheatharnaigh
- Yes
- Form
- Emp
- ADP: shon-sa, dheidhinn-sa
- NOUN: taobh-sa, bheachd-sa, ìre-sa, bheachd-san, bliadhna-sa, aobhar-sa, athair-san, bhràithrean-sa, bhàta-sa, cumail-san
- NOUN-Vnoun: cumail-san, leughadh-ne
- PRON: esan, mise, ise, thusa, iadsan, sinne, sibhse, tusa, àsan, mis'
- Emp
- NumForm
- Digit
- NUM: 1751, 1674, 2, 1692, 1702, 1651, 1660, 1686, 1689, 1690
- Roman
- NUM: II
- Word
- NUM: aon, dà, deug, trì, dhà, fhichead, fichead, ceithir, chiad, cheud
- Digit
- PartType
- Ad
- PART: gu
- Cmpl
- PART: gun, gu, gum, nach, g', 'g, gan, gur
- Comp
- PART: nas, na, bu, b', 's
- Inf
- PART: a, 'ic, a'
- Num
- PART: a
- Pat
- PART: mac, 'ic, Nic, O, Mhic, Ó, Mc
- Vb
- PART: a, cha, chan, an, nach, am, na, 'm, 'n, a'
- Voc
- PART: a, a'
- Ad
- Typo
- Yes
- PROPN: lain
- VERB-Fin: dh’fhabh
- Yes
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: is.
- This corpus uses 1 lemmas as passive auxiliaries (aux:pass). Examples: rach.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB-Fin--NOUN (86)
- VERB-Fin--NOUN-Dat (12)
- VERB-Fin--NOUN-Gen (2)
- VERB-Fin--NOUN-Nom (2469)
- VERB-Fin--PRON (4024)
- obj
- VERB-Fin--NOUN (29)
- VERB-Fin--NOUN-Dat (4)
- VERB-Fin--NOUN-Nom (862)
- VERB-Fin--PRON (352)
Relations Overview
- This corpus uses 15 relation subtypes: acl:relcl, aux:pass, case:voc, csubj:cleft, csubj:cop, csubj:outer, flat:foreign, flat:name, mark:prt, nmod:poss, nsubj:outer, nsubj:pass, obl:smod, obl:tmod, xcomp:pred
- The following 2 main types are not used alone, they are always subtyped: aux, csubj
- The following 5 relation types are not used in this corpus at all: iobj, expl, clf, list, goeswith