This is part of archived UD v1 documentation. See http://universaldependencies.org/ for the current version.

home issue tracker

Tokenization

The tokenization in the Italian UD treebank is a straightforward segmentation based on whitespace and punctuation. The following special cases deserve to be mentioned:

Abbreviations are treated as single words regardless of whether they contain dots or other punctuation symbols. Examples: etc., ecc., es., art., tel., U.S.A., a.C., S.O.S., L., sig.
Numerical expressions: dots (separating thousands), commas (separating decimals) and colons, (separating hours from minutes, and minutes from seconds) are part of the same token. Examples: 4.755.000, 19,30, 355.089,40, 20:24:0
Urls: are rendered as a single token
Proper names may contain a dash “-“ Examples: C-212-300, E-commerce, D-day, Yamate-dōri
Punctuation: “…” is single token

Multi-word tokens

The Italian UD treebank does not contain multiword tokens.

Fused words

According to the UD guidelines, the basic units of annotation are syntactic words (not phonological or orthographic words), therefore we systematically split off clitics and articulated prepositions. Examples follow:

specializzarsi = specializzare si = “to specialize oneself”
andarsene = andare se ne = “to go away”
mangiarlo = mangiare lo = “to eat it”
mangiarselo = mangiare se lo = “to eat it oneself”
della = di la = “of the”
all’ = a l’ = “to the”
degli = di gli = “of the”

Sentence splitting

Each sentence contains only one root. Splitting is usually performed after an end-of-sentence dot or after a colon or semicolon when these punctuation marks separate unrelated subparts of a sentence. Items in a list may sometimes be rendered as separate sentences.