home edit page issue tracker

This page pertains to UD version 2.

UD Romanian Nonstandard

Language: Romanian (code: ro)
Family: IE

This treebank has been part of Universal Dependencies since the UD v2.1 release.

The following people have contributed to making this treebank part of UD: Cătălina Mărănduc, Cenel-Augusto Perez, Victoria Bobicev, Cătălin Mititelu, Florinel Hociung, Valentin Roșca, Roman Untilov, Petru Rebeja.

Repository: UD_Romanian-Nonstandard
Search this treebank on-line: PML-TQ
Download all treebanks: UD 2.15

License: CC BY-SA 4.0

Genre: bible, poetry

Questions, comments? General annotation questions (either Romanian-specific or cross-linguistic) can be raised in the main UD issue tracker. You can report bugs in this treebank in the treebank-specific issue tracker on Github. If you want to collaborate, please contact [catalinamaranduc (æt) gmail • com, perez_cenel_augusto (æt) yahoo • com, victoria • bobicev (æt) gmail • com]. Development of the treebank happens outside the UD repository. If there are bugs, either the original data source or the conversion procedure must be fixed. Do not submit pull requests against the UD repository.

Annotation Source
Lemmas annotated manually in non-UD style, automatically converted to UD
UPOS annotated manually in non-UD style, automatically converted to UD
XPOS annotated manually
Features annotated manually in non-UD style, automatically converted to UD
Relations annotated manually in non-UD style, automatically converted to UD

Description

The Romanian Non-standard UD treebank (called UAIC-RoDia) is based on UAIC-RoDia Treebank. UAIC-RoDia = ISLRN 156-635-615-024-0

The Romanian Non-standard UD treebank (called UAIC-RoDia)is based on UAIC-RoDia Treebank (The Treebank of the Faculty of Computer Science, ”AL. I. Cuza” University, Iași, Romania). This is a balanced treebank. The Contemporary standard part of it (Perez, 2014) was included in the UD-Romanian-RRT Treebank. Since 2015, the UAIC Treebank has been developed by including several nonstandard language genres, Old Romanian, Chat, Folklore (Mărănduc 2015, 2016, 2017c, 2018, Perez 2016), considering that the nonstandard langage is more used than the standard one. The digitization of cultural heritage includes the old texts and also the folklore, wich is an oral phenomenon that is threatened with extinction (Mărănduc, 2017b).

The UAIC-RoDia Treebank (ISLRN 156-635-615-024-0) has in March 2020, 34,794 sentences in its basic format.

For the first release, we transposed in the UD format a part of the New Testament from Alba Iulia (1648), 916 sentences. It is the first printed New Testament in Romanian, with Cyrillic letters. The text with Latin alphabet is obtained by an OCR program built at the Institut of Mathematics and Computer Science of Chișinău, Republic of Moldova, by a group of researchers led by Alexander Colesnicov and Ludmila Malahov (Colesnicov 2016, Cojocaru 2017).

The first release includes in the second part, 284 senteces are folklore in verses; 230 sentences from Romania and 54 from the Republic of Moldova (where the Romanian language is spoken)(Bobicev 2016).

For the second release, we finished the transposition in UD format of the first part of the New Testament (1648): all the prefaces and the four Gospels = 5,172 sentences, including the 916 fron the first release.

For the third release, all the Alba Iulia New Testament (1648).

For the next release, Flower of Gifts, Moldavian Ballads, Romanian Ballads.

Also, the contribution of the Republic of Moldova is now 1805 sentences folklore.

Today, 23 September 2019, we add a new sub-corpus, Caragea’s Law, 1818. In May 2020 we add the whole book Dosoftei, ”David’s Psalms translation with rhymes” (1673), and the first part of the Ion Neculce’s ”Chronicle” (1743), to be continued. In October 2020 we added 1000 sentences ”Romanian Ballads”. The folclore is at the beginning of the train document, but 50 sentences are at the end of the test and dev documents. Also in October 2020 we addad the rest of the Ion Neculce’s ”Chronicle” (1743).

Acknowledgments

Statistics of UD Romanian Nonstandard

POS Tags

ADJADPADVAUXCCONJDETINTJNOUNNUMPARTPRONPROPNPUNCTSCONJVERBX

Features

AdpTypeCaseCompoundDefiniteDegreeGenderMoodNumberNumber[psor]NumFormNumTypePartTypePersonPolarityPolitePositionPossPronTypeReflexStrengthTenseTypoVariantVerbForm

Relations

acladvcladvcl:tcladvmodadvmod:tmodamodapposauxaux:passcasecccc:preconjccompccomp:pmodclfcompoundconjcopcsubjcsubj:passdepdetdiscourseexplexpl:impersexpl:passexpl:possexpl:pvfixedflatgoeswithiobjlistmarknmodnmod:tmodnsubjnsubj:passnummodobjoblobl:agentobl:pmodorphanparataxispunctrootvocativexcomp

Tokenization and Word Segmentation

Morphology

Tags

Nominal Features

Degree and Polarity

Verbal Features

Pronouns, Determiners, Quantifiers

Other Features

Syntax

Auxiliary Verbs and Copula

Core Arguments, Oblique Arguments and Adjuncts

Here we consider only relations between verbs (parent) and nouns or pronouns (child).

Reflexive Verbs

Reflexive Passive

Verbs with Reflexive Core Objects

Relations Overview