UD Alemannic DIVITAL
Language: Alemannic (code: gsw)
Family: IE
This treebank has been part of Universal Dependencies since the UD v2.17 release.
The following people have contributed to making this treebank part of UD: Nathanaël Beiner, Barbara Hoff, Delphine Bernhard.
Repository: UD_Alemannic-DIVITAL
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.17
License: CC BY-SA 4.0
Genre: fiction, nonfiction, legal, spoken, wiki, bible
Questions, comments? General annotation questions (either Alemannic-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [dbernhard (æt) unistra • fr]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
| Annotation | Source |
|---|---|
| Lemmas | not available |
| UPOS | annotated manually, natively in UD style |
| XPOS | not available |
| Features | not available |
| Relations | annotated manually, natively in UD style |
Description
UD_Alemannic-DIVITAL is a manually corrected treebank of Alemannic Alsatian consisting of sentences from several genres.
The corpus consists mostly of Low Alemannic Alsatian sentences. The sentences have been automatically annotated and manually verified.
The MISC column includes a gloss in French (Gloss[fr]) and a lemma in German (Lemma[de]).
Document metadata is included at the beginning of each new document (#newdoc): author, source, genre, audience, discourse_type, domain, factuality, form, origin, channel, language_variety.
For details on the pre-annotation and manual correction process see:
- Barbara Hoff, Nathanaël Beiner, and Delphine Bernhard. 2025. Universal Dependencies for the Alemannic Alsatian Dialects. In Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025), pages 10–22, Ljubljana, Slovenia. Association for Computational Linguistics_.
- Delphine Bernhard, Nathanaël Beiner, and Barbara Hoff. 2025. Pre-annotation Matters: A Comparative Study on POS and Dependency Annotation for an Alsatian Dialect. In Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025), pages 173–186, Vienna, Austria. Association for Computational Linguistics.
The annotation guidelines are detailed in:
- Nathanaël Beiner, Barbara Hoff, Carole Werner, and Delphine Bernhard. 2025. Syntactic annotation guidelines for Alsatian – DIVITAL project (Version 1). NAKALA - https://nakala.fr (Huma-Num - CNRS). https://doi.org/10.34847/NKL.5B6CS6WU
Information on metadata can be found in:
- Marianne Vergez-Couret, Delphine Bernhard, Michael Nauge, Myriam Bras, Pablo Ruiz Fabo, and Carole Werner. 2024. Managing Fine-grained Metadata for Text Bases in Extremely Low Resource Languages: The Cases of Two Regional Languages of France. In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 212–221, Torino, Italia. ELRA and ICCL.
Acknowledgments
The following people were involved in the creation of this dataset:
- Nathanaël Beiner (data annotation, guidelines)
- Barbara Hoff (data annotation, guidelines)
- Delphine Bernhard (advice on annotations, data collection, selection and pre-processing)
The work was supported by the French National Research Agency (project ANR-21-CE27-0004 DIVITAL).
References
If you use this treebank, please cite this paper:
@inproceedings{hoff-etal-2025-universal,
title = "{U}niversal {D}ependencies for the {A}lemannic {A}lsatian {D}ialects",
author = {Hoff, Barbara and
Beiner, Nathana{\"e}l and
Bernhard, Delphine},
editor = {Jablotschkin, Sarah and
K{\"u}bler, Sandra and
Zinsmeister, Heike},
booktitle = "Proceedings of the 23rd International Workshop on Treebanks and Linguistic Theories (TLT, SyntaxFest 2025)",
month = aug,
year = "2025",
address = "Ljubljana, Slovenia",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.tlt-1.2/",
pages = "10--22",
ISBN = "979-8-89176-291-6",
}
Statistics of UD Alemannic DIVITAL
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X
Features
Epenthesis – Foreign – Typo
Relations
acl – acl:relcl – advcl – advcl:relcl – advmod – advmod:emph – advmod:lmod – advmod:tmod – amod – appos – aux – aux:pass – case – cc – cc:preconj – ccomp – compound – compound:prt – conj – cop – csubj – csubj:outer – det – det:poss – det:predet – discourse – dislocated – expl – expl:pv – fixed – flat – flat:name – goeswith – mark – nmod – nmod:lmod – nmod:poss – nmod:tmod – nsubj – nsubj:outer – nsubj:pass – nummod – obj – obl – obl:agent – obl:arg – obl:lmod – obl:tmod – orphan – parataxis – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 977 sentences, 19334 tokens and 19743 syntactic words.
- This corpus contains 3376 tokens (17%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 206 types of words that contain both letters and punctuation. Examples: d’, d', 's, ’s, s', m'r, d'r, 'm, 'r, ’ne, g’sinn, g’sààt, l', z', frz., g’komme, so-n-, 'ne, Nàtionàl-, g’hett, wisse-n-, ’m, 'em, -ed-, ABC-Buech, Diwan-Netzwerk, Diwan-Schuele, Regional-, biss'l, d'ran, d'rvon, de⸗n⸗, g'säit, g’funde, g’schlàcht, g’sindigt, kumme-n-, mi', numme-n-, od'r, wid'r, worre-n-, wùrre-n-, z’, àng’fànge, ⸗i, 'rem, 'rüs, -ere, -ewer-
- This corpus contains 409 multi-word tokens. On average, one multi-word token consists of 2.00 syntactic words.
- There are 74 types of multi-word tokens. Examples: im, vùm, ìm, àm, vum, zuem, des, vùme, bim, zum, am, ime, mìtem, ìme, ùffem, ins, vùnere, ìnere, dùrichs, mit'm, mìteme, mìtere, noochem, sowie, àme, àns, ìns, ùffs, üs'm, em, l'abbé, mit'em, mìme, noocheme, uf'm, voreme, zem, züem, ànere, ém, ùntereme, aux, du, fers, fonana, foum, hìnterem, i's, jedesmol, l’Homme.
Morphology
Tags
- This corpus uses 17 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
- This corpus contains 22 word types tagged as particles (PART): am, fer, im, in, ne, nem, nemm, nemmi, net, nimmeh, nimmi, nit, nitt, nét, nëm, nìmm, nìmmi, nìt, ze, zu, z’, àm
- This corpus contains 1 lemmas tagged as pronouns (PRON): _
- This corpus contains 1 lemmas tagged as determiners (DET): _
- Out of the above, 1 lemmas occurred sometimes as PRON and sometimes as DET: _
- This corpus contains 1 lemmas tagged as auxiliaries (AUX): _
- Out of the above, 1 lemmas occurred sometimes as AUX and sometimes as VERB: _
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
Other Features
- Epenthesis
- Yes
- ADP: gajen'
- ADV: so-n-, numme⸗n⸗, o, so
- AUX: worre-n-, wùrre-n-, hàn, welle⸗n⸗
- DET: eso-n-
- SCONJ: wo
- VERB: wisse-n-, wissen, Wissen', Wissen-, Wisse⸗n, bekùmme-n-, frschiasan', gangen, geh'n, gelaje-n-
- Yes
- Foreign
- Yes
- ADJ: constitutionnel, européenne, international, nationale, régional, supérieur, Alsacienne, Basque, Culturelle, Législatives
- ADP: de, d', d’, pour, en, an, du, à
- ADV: enfin, bien, ex, finalement, merci, également
- AUX: hàn, sommes
- CCONJ: et
- DET: les, la, l', de, le, ma, Das, dem, den, des
- INTJ: Bravo, Eh, allez, Oui, Salut, Sapristi, bien
- NOUN: Conseil, Rapport, République, droits, Bretzel, Institut, Or, article, bilinguisme, langue
- NUM: IIIe
- PRON: -toi, Toi, ich, je, nous
- PROPN: Alsace, France, ONU, Europe, Gascogne, IPA, La, Moselle, oc, AMCT
- SCONJ: que
- VERB: coûte, Foie, VIVE, aussuchen, cherche, choisir, matar, parlez, passant, suche
- X: bon, Alsace, BA, BE, BI, BO, BU, Little, Pace, Texas
- Yes
- Typo
- Yes
- ADV: o, so
- PRON: sin
- SCONJ: wo
- Yes
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: _.
- This corpus uses 1 lemmas as auxiliaries (aux). Examples: _.
- This corpus uses 1 lemmas as passive auxiliaries (aux:pass). Examples: _.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (322)
- VERB--NOUN-ADP(_) (2)
- VERB--PRON (680)
- obj
- VERB--NOUN (516)
- VERB--NOUN-ADP(_) (4)
- VERB--PRON (187)
Reflexive Verbs
- This corpus contains 11 lemmas that occur at least once with an expl:pv child. Examples: _ sich, _ sìch, _ mi, _ mich, _ éich, _ anànder, _ di, _ eich, _ eijch, _ enànder, _ sin
Relations Overview
- This corpus uses 22 relation subtypes: acl:relcl, advcl:relcl, advmod:emph, advmod:lmod, advmod:tmod, aux:pass, cc:preconj, compound:prt, csubj:outer, det:poss, det:predet, expl:pv, flat:name, nmod:lmod, nmod:poss, nmod:tmod, nsubj:outer, nsubj:pass, obl:agent, obl:arg, obl:lmod, obl:tmod
- The following 4 relation types are not used in this corpus at all: iobj, clf, list, dep