UD Kadiweu Unicamp
Language: Kadiweu (code: kbc)
Family: Guaicuruan
This treebank has been part of Universal Dependencies since the UD v2.18 release.
The following people have contributed to making this treebank part of UD: Filomena Spatti Sandalo, Leonel Figueiredo de Alencar, Charlotte Chambelland Galves, Luiz Veronesi, Daniel Zeman.
Repository: UD_Kadiweu-Unicamp
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.18
License: CC BY-NC-SA 4.0
Genre: grammar-examples
Questions, comments? General annotation questions (either Kadiweu-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [sandalo (æt) unicamp • br, leonel • de • alencar (æt) ufc • br]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.
| Annotation | Source |
|---|---|
| Lemmas | annotated manually |
| UPOS | annotated manually, natively in UD style |
| XPOS | annotated manually |
| Features | annotated manually, natively in UD style |
| Relations | annotated manually, natively in UD style |
Description
UD_Kadiweu-UNICAMP is a treebank for Kadiwéu (ISO-639: kbc), an endangered Indigenous language of Brazil. It consists of isolated sentences produced by native speakers.
Kadiwéu is a polysynthetic language spoken in the state of Mato Grosso do Sul, Brazil. It is severely endangered: among approximately 1,500 Kadiwéu people, fewer than 300 speak the language, as many have shifted to Portuguese (Pires 2022). Kadiwéu is the only representative of the Waikurúan linguistic family in Brazil. This family includes four additional languages: Toba, Pilagá, and Mocoví, mostly spoken in Argentina, and Abipón, formerly spoken in Argentina but now extinct (Sandalo 1995).
UD_Kadiweu-UNICAMP is the first treebank for a Waikurúan language in the UD collection, contributing to the documentation and computational modeling of a poorly documented and under-resourced language family. It is an ongoing project, currently consisting of isolated sentences produced by native speakers, most of which are translations of Portuguese sentences. Future versions will also include narratives and other genres.
Acknowledgments
The construction of this treebank has been funded by the São Paulo Research Foundation (FAPESP) through the DACILAT project (grant No. 22/09158-5). It is part of the postdoctoral research of Leonel Figueiredo de Alencar at the Department of Linguistics of the State University of Campinas (UNICAMP), under the supervision of Filomena Spatti Sandalo, coordinator of the DACILAT project, and in collaboration with Charlotte Chambelland Galves.
We are much indebted to the speakers of Kadiwéu for sharing their knowledge of their language and for providing translations and acceptability judgements on constructed sentences.
References
-
Alencar, L. F. de. (2021). Uma gramática computacional de um fragmento do nheengatu / A computational grammar for a fragment of Nheengatu. Revista de Estudos da Linguagem, 29(3), 1717–1777. https://doi.org/10.17851/2237-2083.29.3.1717-1777
-
Galves, C., Sandalo, F., Sena, T. A. de, & Veronesi, L. (2017). Annotating a polysynthetic language: From Portuguese to Kadiwéu. Cadernos de Estudos Linguísticos, 59(3), 631–648. https://doi.org/10.20396/cel.v59i3.8651003
-
Pires, V. (2022). Palavras kadiwéu do mundo ancestral e do mundo novo: palavras novas, palavras antigas, palavras humildes e palavras honorificadas (Master’s thesis). Universidade Estadual de Campinas. https://hdl.handle.net/20.500.12733/4592
-
Sandalo, F. (1995). A grammar of Kadiwéu (PhD dissertation). University of Pittsburgh.
-
Sandalo, F., & Galves, C. (2023). Anotando sintaticamente uma língua originária do Brasil: O problema de Anchieta. Cadernos de Estudos Linguísticos, 65, e023007. https://doi.org/10.20396/cel.v65i00.8673592
-
Sandalo, F., Pires, V., Galves, C., Silva, H., Francisco, O., & Silva, S. (2024a). Corpus Kadiwéu. In L. Veronesi & C. Galves (Eds.), The Tycho Brahe Platform. https://www.tycho.iel.unicamp.br/
-
Sandalo, F., Pires, V., Galves, C., Silva, H., Francisco, O., & Silva, S. (2024b). Corpus Kadiwéu – gramática pedagógica. In L. Veronesi & C. Galves (Eds.), The Tycho Brahe Platform. https://www.tycho.iel.unicamp.br/
Statistics of UD Kadiweu Unicamp
POS Tags
ADJ – ADV – AUX – DET – NOUN – PART – PRON – PROPN – PUNCT – SCONJ – VERB
Features
AdvType – Aspect – Degree – Deixis – Gender – Gender[obj] – Mood – Number – Number[obj] – Number[psor] – Person – Person[erg] – Person[obj] – Person[psor] – Polarity – Poss – PronType – VerbForm – Voice
Relations
acl:relcl – advcl – advmod – aux – det – dislocated – mark – nmod:poss – nsubj – obj – punct – root
Tokenization and Word Segmentation
- This corpus contains 71 sentences, 306 tokens and 318 syntactic words.
- This corpus contains 137 tokens (45%) that are not followed by a space.
- This corpus does not contain words with spaces.
- This corpus does not contain words that contain both letters and punctuation.
- This corpus contains 12 multi-word tokens. On average, one multi-word token consists of 2.00 syntactic words.
- There are 10 types of multi-word tokens. Examples: adakake, madi, aGipegetege, aGipegitegi, aidei, alidi, aneiwaGadi, anenapioi, meninitibeci, mijo.
Morphology
Tags
- This corpus uses 11 UPOS tags out of 17 possible: ADJ, ADV, AUX, DET, NOUN, PART, PRON, PROPN, PUNCT, SCONJ, VERB
- This corpus does not use the following tags: NUM, ADP, CCONJ, INTJ, SYM, X
- This corpus contains 1 word types tagged as particles (PART): aG
- This corpus contains 10 lemmas tagged as pronouns (PRON): ane, da, di, ee, napioi, ni, niGida, niGidi, niGijo, niGina
- This corpus contains 8 lemmas tagged as determiners (DET): adi, eliodi, ica, ijo, niGida, niGijo, niGina, niGini
- Out of the above, 3 lemmas occurred sometimes as PRON and sometimes as DET: niGida, niGijo, niGina
- This corpus contains 1 lemmas tagged as auxiliaries (AUX): jaG
- There are 1 (de)verbal forms:
- Fin
- VERB: iwaGadi, idei, ipegitegi, ninitibeci, ipegetege, DapicoGo, Ninitibigiwaji, Te, dowediteloco, ipegitaGagi
Nominal Features
- Fem
- DET: ajo, NaGani, adi, NaGajo, naGana, naGani
- NOUN: niwatece, wetiGa, Etogo, Iwalo, libiniena, liwatece, lomigo, nigotaGa, GanigotGa, Iwalepodi
- PRON: naGada, naGadi, Ada, Adi, Ani, NaGajo, NaGana, naGajo
- VERB: etadi
- Fem,Masc
- NOUN: lodawa, Gadodawa
- Masc
- DET: ica, NiGida, ijo, NiGijo
- NOUN: iGeladi, libinienigi, liGeladi, looligi, naigi, weiigi, niganaGacanajo, nioladi, LotaGa, eyodi
- PRON: ee
- PROPN: João
- Plur
- DET: NiGinoa
- NOUN: libinienaGa, libinienigipi, Iwalepodi, lionigipi, loigipodi, naodigijedi, wetiadi
- VERB-Fin: Ninitibigiwaji
- Sing
- DET: ica, ajo, NaGani, NiGida, adi, ijo, NaGajo, NiGijo, naGana, naGani
- NOUN: iGeladi, libinienigi, liGeladi, niwatece, looligi, naigi, weiigi, wetiGa, niganaGacanajo, nioladi
- PRON: naGada, naGadi, Ada, Adi, Ani, NaGajo, NaGana, ee
Degree and Polarity
- Dim
- NOUN: libinienigi, libiniena, libinienaGa, libinienigipi, niganigawanigi
- Neg
- PART: aG
Verbal Features
- Perf
- AUX: ja
- Ind
- VERB-Fin: iwaGadi, idei, ipegitegi, ninitibeci, ipegetege, DapicoGo, Ninitibigiwaji, Te, dowediteloco, ipegitaGagi
- Appl
- VERB: ipegitegi, eniteloco, ipegetege, dowediteloco, ipegitaGagi, ipegitege
- VERB-Fin: ipegitegi, ipegetege, dowediteloco, ipegitaGagi, ipegitege
- Inv
- VERB: dapiko
Pronouns, Determiners, Quantifiers
- Dem
- ADV: digoida
- DET: ica, ajo, NaGani, NiGida, adi, ijo, NaGajo, NiGijo, NiGinoa, naGana
- PRON: naGada, naGadi, Ada, Adi, Ani, NaGajo, NaGana, naGajo
- Ind
- ADV: eliodi
- DET: eliodi
- Prs
- PRON: ee
- Rel
- PRON: ane
- Yes
- PRON: naGajo
- 1
- PRON: ee
- 3
- VERB: iwaGadi, ninitibeci, etadi, DapicoGo, Ninitibigiwaji
- VERB-Fin: iwaGadi, ninitibeci, DapicoGo, Ninitibigiwaji
- Sing
- NOUN: iGeladi, eyodi, iGonagi
Other Features
- AdvType
- Loc
- ADV: digoida
- Loc
- Deixis
- Remt
- ADV: digoida
- Remt
- Gender[obj]
- Fem
- VERB-Fin: ipegetege
- Masc
- VERB-Fin: ipegitegi
- Fem
- Number[obj]
- Sing
- VERB-Fin: ipegitaGagi
- Sing
- Person[erg]
- 3
- VERB-Fin: idei, ipegitegi, ipegetege, dowediteloco, ipegitaGagi, ipegitege
- 3
- Person[obj]
- 2
- VERB-Fin: ipegitaGagi
- 3
- VERB-Fin: ipegitegi, ipegetege, ipegitege
- 2
- Person[psor]
- 1
- NOUN: iGeladi, eyodi, iGonagi
- 2
- NOUN: Gadodawa, GanigotGa, Ganioxoa, Ganixoa, Gawenigi
- 3
- NOUN: libinienigi, liGeladi, looligi, lidi, nioladi, LotaGa, libiniena, libinienaGa, libinienigipi, lidGegi
- 1
Syntax
Auxiliary Verbs and Copula
- This corpus does not contain copulas.
- This corpus uses 1 lemmas as auxiliaries (aux). Examples: jaG.
Core Arguments, Oblique Arguments and Adjuncts
Here we consider only relations between verbs (parent) and nouns or pronouns (child).
- nsubj
- VERB--NOUN (6)
- VERB-Fin--NOUN (26)
- VERB-Fin--PRON (11)
- obj
- VERB--NOUN (3)
- VERB-Fin--NOUN (15)
Relations Overview
- This corpus uses 2 relation subtypes: acl:relcl, nmod:poss
- The following 2 main types are not used alone, they are always subtyped: acl, nmod
- The following 25 relation types are not used in this corpus at all: iobj, csubj, ccomp, xcomp, obl, vocative, expl, discourse, cop, appos, nummod, amod, clf, case, conj, cc, fixed, flat, compound, list, parataxis, orphan, goeswith, reparandum, dep