home edit page issue tracker

This page pertains to UD version 2.

UD Irish IDT

Language: Irish (code: ga)
Family: Indo-European, Celtic

This treebank has been part of Universal Dependencies since the UD v1.0 release.

The following people have contributed to making this treebank part of UD: Teresa Lynn, Jennifer Foster, Sarah McGuinness, Abigail Walsh, Jason Phelan, Kevin Scannell.

Repository: UD_Irish-IDT
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.13

License: CC BY-SA 3.0

Genre: news, fiction, web, legal, government

Questions, comments? General annotation questions (either Irish-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [teresa • lynn (æt) adaptcentre • ie; jennifer • foster (æt) dcu • ie]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion
UPOS annotated manually in non-UD style, automatically converted to UD, with some manual corrections of the conversion
XPOS assigned by a program, with some manual corrections, but not a full manual verification
Features assigned by a program, with some manual corrections, but not a full manual verification
Relations assigned by a program, with some manual corrections, but not a full manual verification

Description

A Universal Dependencies 4910-sentence treebank for modern Irish.

The Irish UD Treebank (IUDT) is a conversion of the Irish Dependency Treebank (IDT), which was part of a PhD research project by Teresa Lynn at Dublin City University, Ireland (Lynn, 2016).

—- The (smaller) IDT dataset has also been released on [GitHub] (https://github.com/tlynn747/IrishDependencyTreebank). —-

The Treebank contains 4910 sentences.

The first 2924 of which were taken from the New Corpus of Ireland-Irish (NCII), with text from books, newswire, websites and other media. These sentences are a subset of a gold-standard POS-tagged corpus for Irish made available by Elaine Uí Dhonnchadha of Trinity College Dublin. —-

The subsequent 1986 sentences were taken from a corpus of Irish public administration translations and are available under the Open Data (PSI) directive for sharing of pubic data: Citizens information website: (20%) Dublin City Council (DCC): (25%) DEpartment of Culture, Heritage and the Gaeltacht (DCHG):(9%) Udaras na Gaeltachta: (25%) EUbookshop: (21%)

The conversion from the IDT annotation scheme to the UD annotation scheme for the first release (1020 IDT trees) was designed by Teresa Lynn and Jennifer Foster at Dublin City University, Ireland. The mapping to UD is reported in Lynn et al., (2016) Conversion of sentences 1-1020 was automatic, with manual review. Subsequent updates or changes have been a combination of automatic labelling and manual review. All trees with sentence ID greater than 1021 were created through an automatic pre-parsing approach followed by manual review.

The UD Treebank is split into two sets as follows:

Note: the 451 dev trees were taken from the set of newly annotated trees in the v2.5 release. Selection of test sentences haven’t changed since v1.0 (but annotations and quality have!)

Acknowledgments

We wish to thank all of the contributors to the original IDT annotation, including Elaine Uí Dhonnchadha for her gold POS-tagged corpus and linguistic advice. We would also like to acknowledge linguistic advice offered by Kevin Scannell in the conversion to UD effort.

Expansion of the IUDT from 2019-2021 is funded by the Irish Government Department of Culture, Heritage and the Gaeltacht under the GaelTech project.

This research is partially supported by Science Foundation Ireland through the ADAPT Centre for Digital Content Technology. The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

Statistics of UD Irish IDT

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJSYMVERBX

Features

AbbrAspectCaseDefiniteDegreeDialectForeignFormGenderMoodNounTypeNumberNumTypePartTypePersonPolarityPossPrepFormPronTypeReflexTenseTypoVerbForm

Relations

aclacl:relcladvcladvmodamodapposcasecase:vocccccompcompoundcompound:prtconjcopcsubj:cleftcsubj:copdetdiscoursedislocatedfixedflatflat:foreignflat:namegoeswithlistmarkmark:prtnmodnmod:possnsubjnsubj:outernummodobjoblobl:prepobl:tmodorphanparataxispunctrootvocativexcompxcomp:pred

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Relations Overview