UD Bambara CRB
Language: Bambara (code: bm
)
Family: Mande
This treebank has been part of Universal Dependencies since the UD v2.3 release.
The following people have contributed to making this treebank part of UD: Katya Aplonova, Francis Tyers.
Repository: UD_Bambara-CRB
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.14
License: CC BY-SA 4.0
Genre: nonfiction, news
Questions, comments? General annotation questions (either Bambara-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [aplooon (æt) gmail • com]. Development of the treebank happens directly in the UD repository, so you may submit bug fixes as pull requests against the dev branch.
Annotation | Source |
---|---|
Lemmas | annotated manually in non-UD style, automatically converted to UD |
UPOS | annotated manually in non-UD style, automatically converted to UD |
XPOS | annotated manually |
Features | annotated manually in non-UD style, automatically converted to UD |
Relations | annotated manually in non-UD style, automatically converted to UD |
Description
The UD Bambara treebank is a section of the Corpus Référence du Bambara annotated natively with Universal Dependencies.
Bambara (also known as Bamana) is the most widely-spoken language of the Manding language group (Niger-Congo > Mande > Western Mande). It is spoken mainly in Mali by 13-14 million people; of these, around four million are L1 speakers. Development of the Bambara Reference Corpus was started in April 2012 (Vydrin 2013, Maslinsky 2014). The corpus includes a non-disambiguated sub-corpus and a disambiguated one. At present, the whole corpus contains about nine million tokens. The corpus was annotated using UD Annotatrix annotation tool (Tyers, Sheyanova, Washington 2018).
Documentation for the treebank is available on the UD web site.
Acknowledgments
The conversion and annotation has been done by Katya Aplonova and Francis M. Tyers at the Higher School of Economics in Moscow. We would like to thank the developers and annotators of the Corpus Référence du Bambara for permission to base this on their work.
Citation
If you use this corpus in your research please cite
@inproceedings{aplonova_2018,
author = {Aplonova, K. and Tyers, F. M.},
title = {Towards a dependency treebank for Bambara},
booktitle = {Proceedings of the 16th Conference on Treebanks and Linguistic Theories},
pages = {138--146},
year = 2018
}
References
- Maslinsky, K. (2014). Daba: a model and tools for Manding corpora. In Proceedings of TALAf 2014 : Traitement Automatique des Langues Africaines, pages 114-122.
- Tyers, F. M., Sheyanova, M., and Washington, J. N. (2018). UD Annotatrix: An annotation tool for Universal Dependencies. In Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories.
- Vydrin, V. (2013). Bamana reference corpus (BRC). Procedia - Social and Behavioral Sciences, 95, pages 75–80.
Statistics of UD Bambara CRB
POS Tags
ADJ – ADP – ADV – AUX – CCONJ – DET – INTJ – NOUN – NUM – PART – PRON – PROPN – PUNCT – SCONJ – VERB – X
Features
Aspect – Definite – Mood – Number – NumType – Person – Polarity – PronType – Tense – Valency – VerbForm – Voice
Relations
acl – advcl – advmod – amod – appos – aux – case – cc – ccomp – compound – compound:redup – conj – dep – det – det:rel – discourse – dislocated – fixed – flat – mark – nmod – nmod:poss – nsubj – nummod – obj – obl – orphan – parataxis – parataxis:obj – punct – reparandum – root – vocative – xcomp
Tokenization and Word Segmentation
- This corpus contains 1026 sentences and 13823 tokens.
- This corpus contains 1843 tokens (13%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus contains 6 types of words that contain both letters and punctuation. Examples: k', n', y', b', kelen-kelen, t'
Morphology
Tags
- This corpus uses 16 UPOS tags out of 17 possible: ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, VERB, X
- This corpus does not use the following tags: SYM
- This corpus contains 20 word types tagged as particles (PART): bada, bani, bilen, de, dennin, diye, dun, dè, dɛ, fana, hali, k', ko, koyi, kòni, le, sa, wa, wo, yo
- This corpus contains 36 lemmas tagged as pronouns (PRON): _, bɛ́ɛ, dì, dɔ̀w, dɔ́, dɔ́w, jɔ̀n, jɔ́n, jɔ́nì, minw, mun, mín, mîn, mùn, mùnna, né, nê, nìn, sí, à, àle, àlê, á, án, áw, é, ê, í, ò, ò.lú, òlû, ó, ù, ń, ɔn, ɲɔ́gɔn
- This corpus contains 19 lemmas tagged as determiners (DET): _, bɛ́, bɛ́ɛ, dòn, dɔw, dɔ́, dɔ́rɔn, jùmɛn, minw, mín, mîn, ninw, nìn, sí, wɛ́rɛ, yɛ̀rɛ, yɛ̀rɛ̂, ìn, ò
- Out of the above, 9 lemmas occurred sometimes as PRON and sometimes as DET: _, bɛ́ɛ, dɔ́, minw, mín, mîn, nìn, sí, ò
- This corpus contains 12 lemmas tagged as auxiliaries (AUX): bɛ, bɛ́na, ka, kàna, ma, mán, mána, na, tùn, tɛ, tɛ́na, ye
- Out of the above, 1 lemmas occurred sometimes as AUX and sometimes as VERB: tɛ
- There are 2 (de)verbal forms:
- Part
- ADJ: jalenba, jelenba, sigilen
- VERB: nalen, sigilen, kotò, selen, jèlen, bannen, bintò, bòlen, bònnen, dalen
- Vnoun
- NOUN: kanliba, falennò, tobili, nyininkali, FURULI, foli, furakèli, nyinini
Nominal Features
- Plur
- DET: ninw, dòw, minw
- NOUN: kòròkèw, denw, misiw, sagaw, dunanw, gòòtèw, julaw, nyèdenw, sosow, surukuw
- PRON: u, olu, an, aw, dòw, dow, minw, a
- PROPN: warabaw
- Sing
- PRON: a, n, ne, e, i, ale, ele, à
- Def
- DET: nin, ninw, in
- PRON: nin
Degree and Polarity
- Neg
- AUX: tè, ma, kana, te, tɛ, man, tèna
- VERB: tè, tɛ
- Pos
- AUX: ye, bè, ka, be, bɛ, mana, bèna, y', b', na
- VERB: tagara, bè, ye, nana, bòra, sera, kèra, tora, cira, banna
Verbal Features
- Imp
- AUX: bè, tè, be, bɛ, te, tɛ, b'
- VERB: be, bè, tè, tɛ
- Perf
- ADJ-Part: jalenba, jelenba, sigilen
- AUX: ye, ma, y'
- VERB: tagara, nana, bòra, sera, kèra, tora, cira, banna, bolila, donna
- VERB-Part: nalen, sigilen, selen, jèlen, bannen, bòlen, bònnen, dalen, dibilen, dilen
- Prog
- VERB-Part: kotò, bintò, natò
- Cnd
- AUX: mana
- Imp
- AUX: kana, ye
- Sub
- AUX: ka
- Fut
- AUX: bèna, na, n', tèna
- Past
- AUX: tun
- Cau
- VERB: labò, lajigin, lajè, dalajè, laminè, latila
Pronouns, Determiners, Quantifiers
- Dem
- DET: nin, ninw
- PRON: nin
- Emp
- PRON: ne, e, ale, aw, ele
- Int
- ADV: min
- Prs
- PRON: a, n, i, u, olu, an, à
- Rcp
- PRON: nyògòn, nyògon
- Rel
- DET: min, minw
- PRON: min, minw, mun
- Card
- NUM: 6
- Ord
- ADJ: SABANAN, filanan, tannan
- NUM: NAN
- 1
- PRON: n, ne, an
- 2
- PRON: e, i, aw, a
- 3
- PRON: a, u, ale, ele, e, à
Other Features
- Valency
- 1
- VERB: tagara, nana, bòra, sera, kèra, tora, cira, banna, bolila, donna
- 2
- AUX: ye, y'
- 1
Syntax
Auxiliary Verbs and Copula
- This corpus does not contain copulas.
- This corpus uses 12 lemmas as auxiliaries (aux). Examples: ka, ye, bɛ, tɛ, ma, tùn, kàna, mána, bɛ́na, na, mán, tɛ́na.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (537)
- VERB--PRON (1211)
- VERB-Part--NOUN (15)
- VERB-Part--PRON (15)
- obj
- VERB--NOUN (517)
- VERB--PRON (501)