UD Irish TwittIrish
Language: Irish (code: ga
)
Family: IE
This treebank has been part of Universal Dependencies since the UD v2.8 release.
The following people have contributed to making this treebank part of UD: Lauren Cassidy, Teresa Lynn, Jennifer Foster, Sarah McGuinness.
Repository: UD_Irish-TwittIrish
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15
License: CC BY-SA 4.0
Genre: social
Questions, comments? General annotation questions (either Irish-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [lauren • cassidy (æt) adaptcentre • ie; teresa • lynn (æt) adaptcentre • ie]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.
Annotation | Source |
---|---|
Lemmas | annotated manually |
UPOS | annotated manually, natively in UD style |
XPOS | not available |
Features | not available |
Relations | annotated manually, natively in UD style |
Description
A Universal Dependencies treebank of 2596 tweets in modern Irish.
The TwittIrish treebank contains 2596 Irish language tweets from two corpora: 1297 tweets from the New Twitter Corpus [NTC] and 1299 tweets from the Lynn Twitter Corpus [LTC].
- NTC consists of 25000 tweets posted between 2010 and 2019 randomly sampled from a database of 14111 users who have tweeted in Irish.
- LTC consists of 1493 tweets posted between 2009 and 2014 randomly sampled from 950000 tweets by 8000 users. Lemmas and POS-tags were added to LTC a part of a PhD research project by Dr. Teresa Lynn at Dublin City University, Ireland (Lynn, 2016) (Lynn, Scannell and Maguire, 2015). The LTC data was further annotated with Code-Switching information (Lynn and Scannell, 2019). The LTC data can be found here: https://github.com/tlynn747/IrishTwitterPOS.
Irish language tweets were identified by Kevin Scannell as part of the Indigenous Tweets website project http://indigenoustweets.com/. Non-Irish tweets were filtered out using a simple character-trigram language identifier.
The conversion from the LTC annotation scheme to the UD annotation scheme was designed by Lauren Cassidy as part of an PhD project, supervised by Dr. Teresa Lynn and Dr. Jennifer Foster at Dublin City University, Ireland. The conversion was automatic, with manual review, in consultation with other researchers working on UD annotation of User Generated Content (Sanguinetti et al., 2020).
Trees were parsed automatically using the Irish UD Treebank [IUDT] (Lynn and Foster, 2016) as training data, followed by manual review. The IUDT can be found here https://github.com/UniversalDependencies/UD_Irish-IDT.
Acknowledgments
We wish to thank all of the contributors to the IUDT annotation, Kevin Scannell for providing data and linguistic advice, and James Barry for improving the accuracy of automatic parsing by experimenting with different models.
The creation of TwittIrish treebank from 2019-2023 is funded by the Irish Government The Department of Tourism, Culture, Arts, Gaeltacht, Sport and Media under the GaelTech project.
This research is partially supported by Science Foundation Ireland through the ADAPT Centre for Digital Content Technology. The ADAPT Centre for Digital Content Technology is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
References
- TwittIrish: A Universal Dependencies Treebank of Tweets in Modern Irish (Cassidy et al., ACL 2022)
- Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations (Sanguinetti et al., Lang Resources & Evaluation 2022)
- Code-switching in Irish tweets: A preliminary analysis (Lynn and Scannell, 2019)
- Irish Dependency Treebanking and Parsing (Lynn, Dublin City University 2016)
- Universal Dependencies for Irish (Lynn and Foster, CLTW 2016)
- Minority Language Twitter: Part-of-Speech Tagging and Analysis of Irish Tweets (Lynn et al., W-NUT 2015)
Statistics of UD Irish TwittIrish
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X
Features
Relations
acl – acl:relcl – advcl – advmod – amod – appos – aux – case – case:voc – cc – ccomp – compound – compound:prt – conj – cop – csubj – csubj:cleft – csubj:cop – dep – det – det:poss – discourse – discourse:emo – expl – fixed – flat – flat:foreign – flat:name – goeswith – iobj – list – mark – mark:prt – nmod – nmod:poss – nmod:tmod – nsubj – nsubj:outer – nummod – obj – obl – obl:prep – obl:tmod – orphan – parataxis – parataxis:hashtag – parataxis:rt – parataxis:sentence – parataxis:url – punct – reparandum – root – vocative – vocative:mention – xcomp – xcomp:pred
Tokenization and Word Segmentation
- This corpus contains 2596 sentences and 47790 tokens.
- This corpus contains 6493 tokens (14%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 3724 types of words that contain both letters and punctuation. Examples: #gaeilge, @user241, @user1140, @user263, @user27, @user288, d', @user412, @user635, #gaa, 's, #tg4, @user187, @user660, @user619, @user880, @user1530, @user1478, @user1697, b', n-éirí, @user229, @user640, @user1639, @user1648, @user663, @user886, @user423, @user505, @user1523, @user312, d’, #clg, :D, @user791, @user891, Foinse.ie, @user1158, @user1349, @user1368, @user1747, @user292, @user850, m', #lágaeilge, #snag, @user1175, @user1606, @user402, @user792
Morphology
Tags
- This corpus uses 17 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
- This corpus contains 61 word types tagged as particles (PART): 's, Mag, Mc, Mhac, Mhic, Mo, Nic, O', O'Toole, San, Uí, Why, a, ag, an, ana, ar, ba, bhfuil, cha, char, d', de, do, droch, dár, d’, g, go, gur, ina, inar, iontach, is, lena, mac, n, n', n't, na, nach, ndéanann, ni, nil, nior, nios, not, ná, nár, ní, níor, níos, o, o'Connor, s, seminar, to, á, ín, ó, óg
- This corpus contains 63 lemmas tagged as pronouns (PRON): @user1297, They, a, all, be, cad, caidé, ceard, cibé, cé, céard, ea, everyone, féin, he, her, himself, his, i, iad, is, it, le, me, mise, muid, muidne, my, mé, mí, our, se, seo, seó, she, siad, sibh, sin, sinn, sise, siúd, spré, sé, sí, that, their, them, there, this, tusa, tú, u, us, we, what, who, y, ya, you, your, yourself, é, í
- This corpus contains 46 lemmas tagged as determiners (DET): Die, a, achan, all, an, another, any, aon, bhur, brón, chuile, cibé, cé, cúpla, do, eile, else, gach, his, i, is, la, le, leath, meus, mo, my, na, no, our, pé, s, s.c., seo, sin, siúd, some, such, the, this, uile, watever, you, your, ár, úd
- Out of the above, 16 lemmas occurred sometimes as PRON and sometimes as DET: a, all, cibé, cé, his, i, is, le, my, our, seo, sin, siúd, this, you, your
- This corpus contains 11 lemmas tagged as auxiliaries (AUX): be, can, could, do, have, is, might, must, should, will, would
- Out of the above, 5 lemmas occurred sometimes as AUX and sometimes as VERB: be, can, do, have, is
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus uses 2 lemmas as copulas (cop). Examples: is, be.
- This corpus uses 10 lemmas as auxiliaries (aux). Examples: be, will, do, can, have, would, could, might, must, should.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (731)
- VERB--NOUN-ADP(do) (1)
- VERB--PRON (637)
- VERB--PRON-ADP(chun) (1)
- obj
- VERB--NOUN (453)
- VERB--NOUN-ADP(chun) (1)
- VERB--NOUN-ADP(do) (1)
- VERB--NOUN-ADP(le) (1)
- VERB--PRON (110)
- VERB--PRON-ADP(on) (1)
- iobj
- VERB--NOUN (1)
- VERB--PRON (1)
Relations Overview
- This corpus uses 21 relation subtypes: acl:relcl, case:voc, compound:prt, csubj:cleft, csubj:cop, det:poss, discourse:emo, flat:foreign, flat:name, mark:prt, nmod:poss, nmod:tmod, nsubj:outer, obl:prep, obl:tmod, parataxis:hashtag, parataxis:rt, parataxis:sentence, parataxis:url, vocative:mention, xcomp:pred
- The following 2 relation types are not used in this corpus at all: dislocated, clf