home edit page issue tracker

This page pertains to UD version 2.

UD for Scottish Gaelic

At present UD for Scottish Gaelic contains a single corpus, the Annotated Reference Corpus of Scottish Gaelic.

Tokenisation and Word Segmentation

Words are delimited by whitespace or punctuation. There are no multiword tokens.

Reconstructing spacing

Context: ARCOSG does not contain the original texts, so we have to reconstruct them in a consistent way. We use GOC (Gaelic Orthographic Conventions, https://www.sqa.org.uk/files_ccc/SQA-Gaelic_Orthographic_Conventions-En-e.pdf) for consistency in reconstructing spacing, but don’t apply any other corrections.

According to the latest GOC:

Also (not covered explicitly by GOC but shown in examples):

If an elided a’ or ag before a verbal noun is indicated by , close this up.

Close up around the hyphen in a-measg, a-rèir, a-thaobh and similar but don’t close up around hyphens if they’re being used as dashes. Also don’t attempt to bring into line with GOC by adding or taking away hyphens.

Also close up dhà-na-tri (see fp05_012).

Multiword tokens

The original version of ARCOSG contains multiword tokens. For UD, however, we need to split these up. For the moment we duplicate the UPOS and the XPOS for each of the words. PROPNs have a flat relation; others have a fixed relation but this needs to be improved.

Some difficult cases follow:

na b’/na bu

Ideally this should be exactly parallel with nas.

sam bith

Currently both are marked as ADJ but there is clearly internal structure.

Morphology

Tags

Standard UPOS tags are used throughout. Generally we follow the choices made in the Irish UD treebanks.

Features

Gaelic has two genders (masculine and feminine), four cases (nominative/accusative, genitive, dative and vocative), three numbers (singular, dual and plural), the usual three persons and an impersonal form.

The words fèin and cheile take Reflex = Yes.

The indicative mood is default and we mark the conditional (Cnd), imperative (Imp) and interrogative (Int) moods. The tenses we mark are

We also follow Irish in marking three pronoun types (Emp = emphatic, Int = interrogative and Rel = relative), polarity (Neg on negative particles) and the following particle types: Ad (adverbialiser), Comp (comparative), Cmpl (complement), Inf (agreement particle), Int (interrogative), Num (numerical), Pat (patronymic), Vb (verbal) and Voc (vocative).

We also have Foreign=Yes for words that are in Irish or English according to the original ARCOSG tagging.

Syntax

VSO clause structure

Main clauses and subordinate clauses are VSO. The subject almost invariably follows the verb but

However, if there is an externally-controlled complement then the object follows the verbal noun if it is in the progressive aspect with a nominal object, but precedes it if it is in the progressive aspect with a pronominal object.

The same applies to the usual form of the passive:

Core arguments, oblique arguments and adjuncts

The core arguments are marked by nsubj and obj if they are noun phrases. Oblique arguments and adjuncts are marked by obl when they are prepositional phrases. Occasionally they are noun phrases in which case we use obl:tmod if they indicate a stretch of time or obl:smod if they indicate a distance.

In terms of clausal subjects csubj:cop is used for expressions like:

Language-specific labels

With one exception, these follow Irish:


Treebanks

There is one Scottish Gaelic UD treebank:

References