home edit page issue tracker

This page pertains to UD version 2.

UD for Scottish Gaelic

At present UD for Scottish Gaelic contains a single corpus, the Annotated Reference Corpus of Scottish Gaelic.

Tokenisation and Word Segmentation

Words are delimited by whitespace or punctuation. There are no multiword tokens.

Reconstructing spacing

Context: ARCOSG does not contain the original texts, so we have to reconstruct them in a consistent way. We use GOC (Gaelic Orthographic Conventions, https://www.sqa.org.uk/files_ccc/SQA-Gaelic_Orthographic_Conventions-En-e.pdf) for consistency in reconstructing spacing, but don’t apply any other corrections.

According to the latest GOC:

Also (not covered explicitly by GOC but shown in examples):

If an elided a’ or ag before a verbal noun is indicated by , close this up.

Close up around the hyphen in a-measg, a-rèir, a-thaobh and similar but don’t close up around hyphens if they’re being used as dashes. Also don’t attempt to bring into line with GOC by adding or taking away hyphens.

Also close up dhà-na-tri (see fp05_012).

Multiword tokens

The original version of ARCOSG contains tokens that contain spaces. For UD, however, we need to split these up. For the moment we duplicate the UPOS and the XPOS for each of the words. PROPNs have a flat:name relation; others have a fixed relation but this needs to be improved.

Some difficult cases follow:

na b’/na bu

Ideally this should be exactly parallel with nas.

sam bith

rud sam bith, ‘whatever’ and so forth. Currently both sam and bith are marked as ADJ but there is clearly internal structure.

Conversely, there are single tokens in ARCOSG that correspond to more than one word in the UD sense. Here are the most common families:



Standard UPOS tags are used throughout. Generally we follow the choices made in the Irish UD treebanks.


Gaelic has two genders (masculine and feminine), four cases (nominative/accusative, genitive, dative and vocative), three numbers (singular, dual and plural), the usual three persons and an impersonal form.

The words fèin and cheile take Reflex=Yes.

The indicative mood is default and we mark the conditional (Cnd), imperative (Imp) and interrogative (Int) moods. The tenses we mark are

We also follow Irish in marking three pronoun types (Emp = emphatic, Int = interrogative and Rel = relative), polarity (Neg on negative particles) and the following particle types: Ad (adverbialiser), Comp (comparative), Cmpl (complement), Inf (agreement particle), Int (interrogative), Num (numerical), Pat (patronymic), Vb (verbal) and Voc (vocative).

We also have Foreign=Yes for words that are in Irish or English according to the original ARCOSG tagging.


VSO clause structure

Main clauses and subordinate clauses are VSO. The subject almost invariably follows the verb but

However, if there is an externally-controlled complement then the object follows the verbal noun if it is in the progressive aspect with a nominal object, but precedes it if it is in the progressive aspect with a pronominal object.

The same applies to the usual form of the passive:

Core arguments, oblique arguments and adjuncts

The core arguments are marked by nsubj and obj if they are noun phrases. Oblique arguments and adjuncts are marked by obl when they are prepositional phrases. Occasionally they are noun phrases in which case we use obl:tmod if they indicate a stretch of time or obl:smod if they indicate a distance.

In terms of clausal subjects csubj:cop is used for expressions like:

Language-specific labels

With one exception, these follow Irish:

Some specific cases

The verbal noun

Annotate as a NOUN and an xcomp:pred of the VERB.

In inversion structures, the object is obj of the verbal noun.

With aspect markers (continuous tenses and depictives)

ag, air, ri and so forth preceding it have a case relationship as in Irish.

Inversion structures

The noun preceding it is an obj of it.

agus, is and ’s’

air ais

While ais is tagged as Nf in phrases like air ais no air adhart there seems to be no good reason to treat the first half differently from the second half, so air is case of ais and ais is the head and obl of whatever it is modifying.


Auxiliary use: we follow the Irish UD treebank and treat bi as a VERB, and the verbal noun as a NOUN linked back to bi with an xcomp:pred deprel.

Predicative use: again, we follow Irish and use xcomp:pred for predicative adjectives, PPs and adverbs. There is a construction exemplified in c02_009a, c02_009b and c02_010 bi… agam… ri dhol… and in this case we assume that the PP with aig is the quirky experiencer and ri is the predicate.

However (see f01_028), there are also uses of bi for extent in time (n03_041) and space.

còrr is and friends

Example taken from pw01_015a: in còrr is deich bliadhna, bliadhna is conjoined with còrr and deich is a nummod of bliadhna.

From ns04_053: in thachair an tubaist còrr is bliadhna gu leth còrr is obl:tmod of thachair because the phrase as a whole is a time phrase.

dè cho…

‘how’ as in ‘how big’. remains PRON and cho is advmod of the succeeding adjective.


When this is tagged as Vm-2s the sense in which it is usually used is ‘to try to’, in which case it is linked to the higher clause with an xcomp deprel. For example n04_002: … gu robh e ‘dol a dh’fhalbh feuch a faigheadh…, feuch is an xcomp of dh’fhalbh.

fhios agad and variants

‘you know’. Treat as parataxis as it is explicitly excluded from discourse. See also parataxis below.

an ìre mhath

This means ‘almost’. See s08_061b for an example. Use nmod.


‘S, b’, bu, ‘se, ‘sann and so on are cop and the root is whatever has been fronted by it. We treat ‘S e as a fixed expression where e has a fixed relation with the AUX. Likewise ‘S ann, except of course ann is divided up into an and e and both have a fixed relation with the AUX.

Following Cox in Geàrr Ghràmar na Gàidhlig, p. 284, in phrases like is ann a cheannaich mi bainne, cheannaich is still the root even though it’s preceded by a relative particle.

Again we follow Irish and whatever comes after the root is a subject, be it a nominal subject, nsubj, or a clausal subject, csubj:cleft or csubj:cop.


Mas (‘if’) is divided into the two words ma (SCONJ) and is (AUX).

nach maireann

(as in Dr Calum MacGilleathain nach maireann, ‘The late Dr Calum Maclean’) This is acl:relcl of the deceased because nach is the negative relativiser.


Where you have a big long sentence with lots of “ars’ esan” and “ars’ ise”s in it, treat them like punctuation and make them parataxis of the most contentful content word in the nearest quoted text so as to avoid non-projectivity. Sentence n01_038 is an example of this.

an t-seachdain seo chaidh and others

‘last week’, literally ‘this week that went’. Treat chaidh as being acl:relcl of t-seachdain (pw05_005, also ceud in the sense of ‘century’: see fp01_034).


In most dialects the person (or thing) that can follows the preposition do so is of course obl. In some, however, you can say, for example, ’s urrainn mi, so in this case mi is nmod of urrainn.


There is one Scottish Gaelic UD treebank: