UD Maltese MUDT
Language: Maltese (code: mt
)
Family: Afro-Asiatic
This treebank has been part of Universal Dependencies since the UD v2.3 release.
The following people have contributed to making this treebank part of UD: Slavomír Čéplö, Daniel Zeman.
Repository: UD_Maltese-MUDT
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15
License: CC BY-SA 4.0
Genre: news, legal, nonfiction, fiction, wiki
Questions, comments? General annotation questions (either Maltese-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [bulbul (æt) bulbul • sk]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
Annotation | Source |
---|---|
Lemmas | not available |
UPOS | annotated manually in non-UD style, automatically converted to UD |
XPOS | annotated manually |
Features | not available |
Relations | annotated manually, natively in UD style |
Description
MUDT (Maltese Universal Dependencies Treebank) is a manually annotated treebank of Maltese, a Semitic language of Malta descended from North African Arabic with a significant amount of Italo-Romance influence. MUDT was designed as a balanced corpus with four major genres (see Splitting below) represented roughly equally.
Origin
This treebank is the product of the PhD thesis Constituent order in Maltese: A quantitative analysis by Slavomír Čéplö. The text (see References) contains a detailed description of the annotation decisions and composition of the treebank. The treebank was originally produced in accordance with UDv1, this version is brought up to the UDv2.5 standard.
Splitting
MUDT contains 2074 sentences and 44,162 tokens (both defined orthographically) in the following text types:
Text type | Subtype | Sentence count |
---|---|---|
newspaper | news | 239 |
op-eds | 240 | |
Subtotal | 479 | |
quasi-spoken | newspaper interviews | 280 |
parliament: debates and Q&A | 294 | |
Subtotal | 574 | |
fiction | short stories | 246 |
novel chapters | 251 | |
Subtotal | 497 | |
non-fiction | humanities | 249 |
science, encyclopedic and instructional | 275 | |
Subtotal | 524 | |
Total | 2074 |
The annotated sentences have been manually split into train, test and dev sets as follows:
File | Sentence count | Token count |
---|---|---|
mt_mudt-ud-train.conllu | 1123 | 22880 |
mt_mudt-ud-test.conllu | 518 | 11073 |
mt_mudt-ud-dev.conllu | 433 | 10209 |
Acknowledgments
Statistics of UD Maltese MUDT
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – SYM – VERB – X
Features
Relations
acl – advcl – advmod – advmod:neg – amod – appos – aux – aux:neg – aux:part – aux:pass – case – case:det – cc – ccomp – compound – conj – cop – cop:expl – csubj – dep – det – discourse – dislocated – expl – fixed – flat – flat:name – goeswith – iobj – list – mark – nmod – nmod:poss – nsubj – nsubj:pass – nummod – obj – obl – obl:agent – obl:arg – orphan – parataxis – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 2074 sentences and 44162 tokens.
- This corpus contains 10625 tokens (24%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 236 types of words that contain both letters and punctuation. Examples: l-, il-, ta', tal-, fil-, f', mill-, b', għall-, fl-, it-, is-, ma', x', t-, lill-, id-, ir-, bil-, m', jista', mal-, d-, s-, fis-, tas-, ix-, tat-, r-, tad-, bl-, in-, fit-, tista', fid-, iż-, 'l, tar-, tan-, n-, għas-, x-, bis-, ż-, 'il, għat-, jibqa', tax-, ċ-, fix-
Morphology
Tags
- This corpus uses 17 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X
- This corpus contains 4 word types tagged as particles (PART): la, le, m', ma
- This corpus contains 1 lemmas tagged as pronouns (PRON): _
- This corpus contains 1 lemmas tagged as determiners (DET): _
- Out of the above, 1 lemmas occurred sometimes as PRON and sometimes as DET: _
- This corpus contains 1 lemmas tagged as auxiliaries (AUX): _
- Out of the above, 1 lemmas occurred sometimes as AUX and sometimes as VERB: _
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus uses 1 lemmas as copulas (cop). Examples: _.
- This corpus uses 1 lemmas as auxiliaries (aux). Examples: _.
- This corpus uses 1 lemmas as passive auxiliaries (aux:pass). Examples: _.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (1068)
- VERB--NOUN-ADP(_) (1)
- VERB--PRON (347)
- VERB--PRON-ADP(_) (1)
- obj
- VERB--NOUN (1327)
- VERB--NOUN-ADP(_) (18)
- VERB--PRON (138)
- VERB--PRON-ADP(_) (17)
- iobj
- VERB--NOUN (29)
- VERB--NOUN-ADP(_) (11)
- VERB--PRON (12)
- VERB--PRON-ADP(_) (9)