home edit page issue tracker

This page pertains to UD version 2.

It appears that you have Javascript disabled. Please consider enabling Javascript for this page to see the visualizations.

UD Irish TwittIrish

Language: Irish (code: ga)
Family: IE

This treebank has been part of Universal Dependencies since the UD v2.8 release.

The following people have contributed to making this treebank part of UD: Lauren Cassidy, Teresa Lynn, Jennifer Foster, Sarah McGuinness.

Repository: UD_Irish-TwittIrish
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.18

License: CC BY-SA 4.0

Genre: social

Questions, comments? General annotation questions (either Irish-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [lauren • cassidy (æt) adaptcentre • ie; teresa • lynn (æt) adaptcentre • ie]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation	Source
Lemmas	annotated manually
UPOS	annotated manually, natively in UD style
XPOS	not available
Features	not available
Relations	annotated manually, natively in UD style

Description

A Universal Dependencies treebank of 2596 tweets in modern Irish.

The TwittIrish treebank contains 2596 Irish language tweets from two corpora: 1297 tweets from the New Twitter Corpus [NTC] and 1299 tweets from the Lynn Twitter Corpus [LTC].

NTC consists of 25000 tweets posted between 2010 and 2019 randomly sampled from a database of 14111 users who have tweeted in Irish.
LTC consists of 1493 tweets posted between 2009 and 2014 randomly sampled from 950000 tweets by 8000 users. Lemmas and POS-tags were added to LTC a part of a PhD research project by Dr. Teresa Lynn at Dublin City University, Ireland (Lynn, 2016) (Lynn, Scannell and Maguire, 2015). The LTC data was further annotated with Code-Switching information (Lynn and Scannell, 2019). The LTC data can be found here: https://github.com/tlynn747/IrishTwitterPOS.

Irish language tweets were identified by Kevin Scannell as part of the Indigenous Tweets website project http://indigenoustweets.com/. Non-Irish tweets were filtered out using a simple character-trigram language identifier.

The conversion from the LTC annotation scheme to the UD annotation scheme was designed by Lauren Cassidy as part of an PhD project, supervised by Dr. Teresa Lynn and Dr. Jennifer Foster at Dublin City University, Ireland. The conversion was automatic, with manual review, in consultation with other researchers working on UD annotation of User Generated Content (Sanguinetti et al., 2020).

Trees were parsed automatically using the Irish UD Treebank [IUDT] (Lynn and Foster, 2016) as training data, followed by manual review. The IUDT can be found here https://github.com/UniversalDependencies/UD_Irish-IDT.

Acknowledgments

We wish to thank all of the contributors to the IUDT annotation, Kevin Scannell for providing data and linguistic advice, and James Barry for improving the accuracy of automatic parsing by experimenting with different models.

The creation of TwittIrish treebank from 2019-2023 is funded by the Irish Government The Department of Tourism, Culture, Arts, Gaeltacht, Sport and Media under the GaelTech project.

This research is partially supported by Science Foundation Ireland through the ADAPT Centre for Digital Content Technology. The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.

References

TwittIrish: A Universal Dependencies Treebank of Tweets in Modern Irish (Cassidy et al., ACL 2022)
Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations (Sanguinetti et al., Lang Resources & Evaluation 2022)
Code-switching in Irish tweets: A preliminary analysis (Lynn and Scannell, 2019)
Irish Dependency Treebanking and Parsing (Lynn, Dublin City University 2016)
Universal Dependencies for Irish (Lynn and Foster, CLTW 2016)
Minority Language Twitter: Part-of-Speech Tagging and Analysis of Irish Tweets (Lynn et al., W-NUT 2015)

Statistics of UD Irish TwittIrish

POS Tags

ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X

Features

Relations

acl – acl:relcl – advcl – advmod – amod – appos – aux – case – case:voc – cc – ccomp – compound – compound:prt – conj – cop – csubj – csubj:cleft – csubj:cop – dep – det – det:poss – discourse – discourse:emo – expl – fixed – flat – flat:foreign – flat:name – goeswith – iobj – list – mark – mark:prt – nmod – nmod:poss – nmod:tmod – nsubj – nsubj:outer – nummod – obj – obl – obl:prep – obl:tmod – orphan – parataxis – parataxis:hashtag – parataxis:rt – parataxis:sentence – parataxis:url – punct – reparandum – root – vocative – vocative:mention – xcomp – xcomp:pred

Tokenization and Word Segmentation

This corpus contains 2596 sentences and 47790 tokens.

This corpus contains 6493 tokens (14%) that are not followed by a space.

This corpus does not contain words with spaces.

This corpus contains 3724 types of words that contain both letters and punctuation. Examples: #gaeilge, @user241, @user1140, @user263, @user27, @user288, d', @user412, @user635, #gaa, 's, #tg4, @user187, @user660, @user619, @user880, @user1530, @user1478, @user1697, b', n-éirí, @user229, @user640, @user1639, @user1648, @user663, @user886, @user423, @user505, @user1523, @user312, d’, #clg, :D, @user791, @user891, Foinse.ie, @user1158, @user1349, @user1368, @user1747, @user292, @user850, m', #lágaeilge, #snag, @user1175, @user1606, @user402, @user792

Morphology

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

This corpus uses 2 lemmas as copulas (cop). Examples: is, be.

This corpus uses 10 lemmas as auxiliaries (aux). Examples: be, will, do, can, have, would, could, might, must, should.

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

nsubj
- VERB--NOUN (731)
- VERB--NOUN-ADP(do) (1)
- VERB--PRON (637)
- VERB--PRON-ADP(chun) (1)

obj
- VERB--NOUN (453)
- VERB--NOUN-ADP(chun) (1)
- VERB--NOUN-ADP(do) (1)
- VERB--NOUN-ADP(le) (1)
- VERB--PRON (110)
- VERB--PRON-ADP(on) (1)

iobj
- VERB--NOUN (1)
- VERB--PRON (1)

Relations Overview

This corpus uses 21 relation subtypes: acl:relcl, case:voc, compound:prt, csubj:cleft, csubj:cop, det:poss, discourse:emo, flat:foreign, flat:name, mark:prt, nmod:poss, nmod:tmod, nsubj:outer, obl:prep, obl:tmod, parataxis:hashtag, parataxis:rt, parataxis:sentence, parataxis:url, vocative:mention, xcomp:pred
The following 2 relation types are not used in this corpus at all: dislocated, clf