home edit page issue tracker

This page pertains to UD version 2.

UD for Scottish Gaelic

At present UD for Scottish Gaelic contains a single corpus, the Annotated Reference Corpus of Scottish Gaelic.

Tokenisation and Word Segmentation

Words are delimited by whitespace or punctuation. There are no multiword tokens. There are however multitoken words.

Reconstructing spacing

Context: ARCOSG does not contain the original texts, so we have to reconstruct them in a consistent way. We use GOC (Gaelic Orthographic Conventions, https://www.sqa.org.uk/files_ccc/SQA-Gaelic_Orthographic_Conventions-En-e.pdf) for consistency in reconstructing spacing, but don’t apply any other corrections.

According to the latest GOC:

Also (not covered explicitly by GOC but shown in examples):

If an elided a’ or ag before a verbal noun is indicated by , close this up.

Close up around the hyphen in a-measg, a-rèir, a-thaobh and similar but don’t close up around hyphens if they’re being used as dashes. Also don’t attempt to bring into line with GOC by adding or taking away hyphens.

Also close up dhà-na-tri (see fp05_012).

Multiword tokens

The original version of ARCOSG contains tokens that contain spaces. For UD, however, we need to split these up. The XPOS is duplicated for each of these words but the UPOS need not be.

PROPNs have a flat:name relation; others use fixed.

Multitoken words

Conversely, there are single tokens in ARCOSG that correspond to more than one word in the UD sense. Here are the most common families:

Morphology

Parts of speech

Standard UPOS tags are used throughout. Generally we follow the choices made in the Irish UD treebanks.

Features

Gaelic has two genders (masculine and feminine), four cases (nominative/accusative, genitive, dative and vocative), three numbers (singular, dual and plural), the usual three persons and an impersonal form.

The words fèin and cheile take Reflex=Yes.

The indicative mood is default and we mark the conditional (Cnd), imperative (Imp) and interrogative (Int) moods. The tenses we mark are

We also follow Irish in marking three pronoun types (Emp = emphatic, Int = interrogative and Rel = relative), polarity (Neg on negative particles) and the following particle types: Ad (adverbialiser), Comp (comparative), Cmpl (complement), Inf (agreement particle), Int (interrogative), Num (numerical), Pat (patronymic), Vb (verbal) and Voc (vocative).

We also have Foreign=Yes for words that are in Irish or English according to the original ARCOSG tagging.

Syntax

VSO clause structure

Main clauses and subordinate clauses are VSO. The subject almost invariably follows the verb but

However, if there is an externally-controlled complement then the object follows the verbal noun if it is in the progressive aspect with a nominal object, but precedes it if it is in the progressive aspect with a pronominal object.

Core arguments, oblique arguments and adjuncts

The core arguments are marked by nsubj and obj if they are noun phrases. Oblique arguments and adjuncts are marked by obl when they are prepositional phrases. Occasionally they are noun phrases in which case we use obl:tmod if they indicate a stretch of time or obl:smod if they indicate a distance.

In terms of clausal subjects csubj:cop is used for expressions like:

Language-specific labels

With three exceptions, these follow Irish:

Some specific cases

The verbal noun

Annotate as a NOUN.

With aspect markers (continuous tenses and depictives)

Here it has VerbType=VNoun. ag, air, ri and so forth preceding it have a case relationship as in Irish. Here it is an xcomp:pred of the verb bi.

Inversion structures and rach-passives

Here it has VerbType=Inf. Usually it is preceded by an infinitive particle a but this is elided where it begins with a vowel or fh. In inversion structures, the object is obj of the verbal noun, with the exception of rach-passives where it is nsubj:pass or exceptionally nsubj:outer.

agus, is and ’s’

air ais

In ARCOSG, ais is tagged as Nf (fossilized noun). However there are phrases like air ais no air adhart in which there seems to be no good reason to treat the first half differently from the second half, even if ais is no longer productive.

c04_024: ‘she did not write back yet’

1	cha	cha	PART	Qn	PartType=Vb|Polarity=Neg	3	mark:prt	_	_
2	do	do	PART	Q--s	Tense=Past	3	mark:prt	_	_
3	sgrìobh	sgrìobh	VERB	V-s	Tense=Past	0	root	_	_
4	i	i	PRON	Pp3sf	Gender=Fem|Number=Sing|Person=3	3	nsubj	_	_
5	air	air	ADP	Sp	_	6	case	_	_
6	ais	ais	NOUN	Nf	_	3	obl	_	_
7	fhathast	fhathast	ADV	Rt	_	3	advmod	_	_

bi

Auxiliary use: we follow the Irish UD treebank and treat bi as a VERB, and the verbal noun as a NOUN linked back to bi with an xcomp:pred deprel.

Predicative use: again, we follow Irish and use xcomp:pred for predicative adjectives, PPs and adverbs. There is a construction exemplified in c02_009a, c02_009b and c02_010 bi… agam… ri dhol… and in this case we assume that the PP with aig is the quirky experiencer and ri is the predicate.

However (see f01_028), there are also uses of bi for extent in time (n03_041) and space.

còrr is and friends

Example taken from pw01_015a: in còrr is deich bliadhna, bliadhna is conjoined with còrr and deich is a nummod of bliadhna.

From ns04_053: in thachair an tubaist còrr is bliadhna gu leth còrr is obl:tmod of thachair because the phrase as a whole is a time phrase.

dè cho…

‘how’ as in ‘how big’. remains PRON and cho is advmod of the succeeding adjective.

feuch

When this is tagged as Vm-2s the sense in which it is usually used is ‘to try to’, in which case it is linked to the higher clause with an xcomp deprel. For example n04_002: … gu robh e ‘dol a dh’fhalbh feuch a faigheadh…, feuch is an xcomp of dh’fhalbh.

fhios agad and variants

‘you know’. Treat as parataxis as it is explicitly excluded from discourse. See also parataxis below.

foreign words

Usually English (en) but sometimes Early Modern Irish (ghc).

If they’re the names of institutions (mostly in the news subcorpus) or borrowings being used in a matter-of-fact way (mostly in the conversation subcorpus) then they are tagged with their original parts of speech and joined by flat. OrigLang=en (or whichever language) goes in the MISC column. If they’re being used appositively or are titles of works, or are reported speech in another language, then tag everything with X and use flat:foreign to join them. They have Foreign=Yes and no other features in the morphology column. Lang=en goes in the MISC column.

an ìre mhath

This means ‘almost’. See s08_061b for an example. Use nmod.

is

‘S, b’, bu, ‘se, ‘sann and so on are cop and the root is whatever has been fronted by it. We treat ‘S e as a fixed expression where e has a fixed relation with the AUX. Likewise ‘S ann, except of course ann is divided up into an and e and both have a fixed relation with the AUX.

Following Cox in Geàrr Ghràmar na Gàidhlig, p. 284, in phrases like is ann a cheannaich mi bainne, cheannaich is still the root even though it’s preceded by a relative particle.

Again we follow Irish and whatever comes after the root is a subject, be it a nominal subject, nsubj, or a clausal subject, csubj:cleft or csubj:cop.

mas

Mas (‘if’) is divided into the two words ma (SCONJ) and is (AUX).

nach maireann

(as in Dr Calum MacGilleathain nach maireann, ‘The late Dr Calum Maclean’) This is acl:relcl of the deceased because nach is the negative relativiser.

parataxis

Where you have a big long sentence with lots of “ars’ esan” and “ars’ ise”s in it, treat them like punctuation and make them parataxis of the most contentful content word in the nearest quoted text so as to avoid non-projectivity. Sentence n01_038 is an example of this.

an t-seachdain seo chaidh and others

‘last week’, literally ‘this week that went’. Treat chaidh as being acl:relcl of t-seachdain (pw05_005, also ceud in the sense of ‘century’: see fp01_034).

urrainn

In most dialects the person (or thing) that can follows the preposition do so is of course nmod. In some, however, you can say, for example, ’s urrainn mi, so in this case mi is nmod of urrainn.

vocables

There are no vocables in ARCOSG, but in the event of a future poetry/song corpus the words in them should be connected by flat.


Treebanks

There is one Scottish Gaelic UD treebank:

References