UD South Levantine Arabic MADAR
Language: South Levantine Arabic (code: ajp
)
Family: Afro-Asiatic
This treebank has been part of Universal Dependencies since the UD v2.7 release.
The following people have contributed to making this treebank part of UD: Shorouq Zahra.
Repository: UD_South_Levantine_Arabic-MADAR
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15
License: CC BY-SA 4.0
Genre: spoken, social
Questions, comments? General annotation questions (either South Levantine Arabic-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [shorouqjzahra (æt) gmail • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
Annotation | Source |
---|---|
Lemmas | annotated manually |
UPOS | annotated manually, natively in UD style |
XPOS | not available |
Features | not available |
Relations | annotated manually, natively in UD style |
Description
The South_Levantine_Arabic-MADAR treebank consists of 100 manually-annotated sentences taken from the MADAR (Multi-Arabic Dialect Applications and Resources) project.
TO-DO: Add 20 annotated sentences from CCC as a train set.
The treebank contains 100 manually annotated sentences in the South Levantine dialect primarily spoken in Amman. The sentences were taken from the “MADAR Parallel Corpus Dataset” (Bouamor et al., 2018) which consists of parallel texts translated into 25 dialects spoken in 25 diferent cities in the Arab World. The original texts were taken from the Basic Traveling Expression Corpus (BTEC) (described in Takezawa et al., 2007).
Sentences in the treebank can best be described as short conversational tourism-related texts.
The treebank was created as part of the “Language Technology: Research and Development” course at Uppsala University. You can view the report here: “Parsing Low-Resource Levantine Arabic: Annotation Projection versus Small-Sized Annotated Data”. The report describes two methods for parsing low-resource Levantine Arabic using the treebank provided in this repo (but split instead into three sets: train, dev, and test).
Acknowledgments
Big thanks to Houda Bouamor, Nizar Habash, and the MADAR project team for creating the multi-dialect parallel corpus and allowing me to use the Amman portion of it prior to official release.
References
-
Bouamor, Houda, Nizar Habash, Mohammad Salameh, Wajdi Zaghouani, Owen Rambow, Dana Abdulrahim, Ossama Obeid, Salam Khalifa, Fadhl Eryani, Alexander Erdmann and Kemal Oflazer. The MADAR Arabic Dialect Corpus and Lexicon. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018.
-
Takezawa, Toshiyuki, et al. “Multilingual spoken language corpus development for communication research.” International Journal of Computational Linguistics & Chinese Language Processing, Volume 12, Number 3, September 2007: Special Issue on Invited Papers from ISCSLP 2006. 2007.
Statistics of UD South Levantine Arabic MADAR
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – VERB – X
Features
Relations
acl – advcl – advmod – advmod:emph – amod – aux – case – cc – ccomp – conj – dep – det – discourse – flat:foreign – iobj – mark – nmod – nmod:poss – nsubj – nummod – obj – obl – obl:arg – parataxis – punct – root – xcomp
Tokenization and Word Segmentation
- This corpus contains 100 sentences and 789 tokens.
- This corpus contains 228 tokens (29%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus does not contain words that contain both letters and punctuation.
Morphology
Tags
- This corpus uses 16 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, VERB, X
- This corpus does not use the following tags: SYM
- This corpus contains 3 word types tagged as particles (PART): ان, ما, مش
- This corpus contains 11 lemmas tagged as pronouns (PRON): آنَا, أنا, أنَا, بيلدينج, حَسَب, عنا, عند, كَ, مين, هُو, هُوَ
- This corpus contains 11 lemmas tagged as determiners (DET): أي, أَي, أَيّ, اللي, شو, كَم, كَيفَ, كُل, مَا, هاد, هاي
- This corpus contains 7 lemmas tagged as auxiliaries (AUX): حَب, راح, عَم, قِدِر, كَان, لَازِم, مُمكِن
- Out of the above, 4 lemmas occurred sometimes as AUX and sometimes as VERB: حَب, راح, قِدِر, كَان
- This corpus does not use the VerbForm feature.
Nominal Features
Degree and Polarity
Verbal Features
Pronouns, Determiners, Quantifiers
Other Features
Syntax
Auxiliary Verbs and Copula
- This corpus does not contain copulas.
- This corpus uses 7 lemmas as auxiliaries (aux). Examples: راح, لَازِم, مُمكِن, قِدِر, حَب, عَم, كَان.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (11)
- VERB--PRON (11)
- VERB--PRON-ADP(عند) (1)
- obj
- VERB--NOUN (33)
- VERB--NOUN-ADP(بِ) (1)
- VERB--NOUN-ADP(عَلَى) (1)
- VERB--NOUN-ADP(لِ) (1)
- VERB--PRON (14)
- iobj
- VERB--NOUN (2)
- VERB--PRON (1)
Relations Overview
- This corpus uses 4 relation subtypes: advmod:emph, flat:foreign, nmod:poss, obl:arg
- The following 1 main types are not used alone, they are always subtyped: flat
- The following 13 relation types are not used in this corpus at all: csubj, vocative, expl, dislocated, cop, appos, clf, fixed, compound, list, orphan, goeswith, reparandum