Tokenization
The tokenization in the Italian UD treebank is a straightforward segmentation based on whitespace and punctuation. The following special cases deserve to be mentioned:
- Abbreviations are treated as single words regardless of whether they contain dots or other punctuation symbols. Examples: etc., ecc., es., art., tel., U.S.A., a.C., S.O.S., L., sig.
- Numerical expressions: dots (separating thousands), commas (separating decimals) and colons, (separating hours from minutes, and minutes from seconds) are part of the same token. Examples: 4.755.000, 19,30, 355.089,40, 20:24:0
- Urls: are rendered as a single token
- Proper names may contain a dash “-“ Examples: C-212-300, E-commerce, D-day, Yamate-dōri
- Punctuation: “…” is single token
Multi-word tokens
The Italian UD treebank does not contain multiword tokens.
Fused words
According to the UD guidelines, the basic units of annotation are syntactic words (not phonological or orthographic words), therefore we systematically split off clitics and articulated prepositions. Examples follow:
- specializzarsi = specializzare si = “to specialize oneself”
- andarsene = andare se ne = “to go away”
- mangiarlo = mangiare lo = “to eat it”
- mangiarselo = mangiare se lo = “to eat it oneself”
- della = di la = “of the”
- all’ = a l’ = “to the”
- degli = di gli = “of the”
Sentence splitting
Each sentence contains only one root. Splitting is usually performed after an end-of-sentence dot or after a colon or semicolon when these punctuation marks separate unrelated subparts of a sentence. Items in a list may sometimes be rendered as separate sentences.