UD for Soi
The AHA Soi Treebank is a small treebank for contemporary Soi. Its corpus is collected and annotated manually. We have prepared this bank tree based on interviews with Soi speakers.
Tokenization and Word Segmentation
- Words are generally delimited by whitespace or punctuation.
- Punctuation marks are attached to the neighboring word. We always tokenize them as separate tokens.
- Coordinating conjunction and prepositions are separated from the words that follows them in a sentence.
- For words that have two or more parts, such as words that have a suffix or prefix, a semi-space is used.
- There are no multiword tokens in Soi.
- In the cases that usually stick together in Persian script, we have separated them by a whitespace.
- We use Persian script in this version, but maybe in the next versions we will use transliteration to make the work more accurate. However, we do not adhere to Persian script rules. For example, we might start a word with a vowel and no base. It should be noted that Soi does not naturally have a written version and can be implemented with any script.
Morphology
Tags
- We will probably use all UPOS tags, but because most of our sentences are based on interviews with Soi speakers, some tags, such as
SYM
, may not appear naturally. Also, because our cirpus is under develop, we still do not know which tags may never be used. However, we do not usePROPN
based on Seraji corpus. - For XPOS tags, our basis at this level is the Seraji corpus.
- At this level, closed call auxiliary (tagged
AUX
) include “دار” only.
Features
Syntax
- Standard deprels are used.
- The following relation subtypes are used in Soi:
Treebanks
There is only one Soi UD treebank: