home edit page issue tracker

This page still pertains to UD version 1.

Introduction

The Italian corpus annotated according to the UD annotation scheme was obtained by conversion from ISDT (Italian Stanford Dependency Treebank), released for the dependency parsing shared task of Evalita-2014 (Bosco et al. 2014).

ISDT is a resource annotated according to the Stanford dependencies scheme (de Marneffe et al. 2008, 2013a, 2013b, 2014), obtained through a semi-automatic conversion process starting from MIDT (the Merged Italian Dependency Treebank). MIDT in turn was obtained merging two existing Italian treebanks, differing both in corpus composition and adopted annotation schemes:

The details of the harmonization and conversion process leading to MIDT were discussed in (Bosco, Montemagni, Simi, 2012). The Stanford annotation scheme, obtained from an enriched version of MIDT, was adapted to the specificity of the Italian language. We refer to (Bosco, Montemagni, Simi, 2013 and 2014) for a discussion.

The final conversion step, leading to UD, is in progress. A first preliminary release was issued in January 2015.

Corpus composition

Original formatSourceGenreSize in tokensSize in sentences
TUT-CONLLEvalita 2011 Dependency parsingLegal texts, news articles, Wikipedia articles112,8773,802
ISST-TANLEvalita 2011 Domain adaptation taskNewspaper articles89,1024,043
ISST-TANLSPLeT 2012 Legal texts: European directives6,893259
MIDTSeveral QA competitionsQuestions23,3912,162
MIDTEvalita 2014 Dependency parsing: test data set (partial)News articles8,375300
TUT-CONLLParallel TUT (Italian part)Various genres61,4602,111
UDDue ParoleSimplified Italian news23,7181,138
TOTAL325,81613,815

NOTE: comment lines have been excluded from token count

Acknowledgments

We wish to thank all of the contributors to the original annotation efforts, as well as the supporting organizations, i.e. the Institute for Computational Linguistics “A. Zampolli”, the University of Pisa, and the University of Torino.

References