Corpus Italiano

Welcome

Welcome

welcome to PAISÀ
Corpus dell'Italiano

Corpus dell'Italiano

general info & download

construction steps

online access
PAISÀ project

PAISÀ project

description

partnership

funding
Materials

Materials

publications

help pages / manuals

Description of the corpus

The Paisà corpus is a large collection of Italian web texts, licensed under Creative Commons (Attribution-ShareAlike and Attribution-Noncommercial-ShareAlike). It has been created in the context of the project PAISÀ.

The PAISÀ documents were selected in two different ways. A part of the corpus was constructed using a method inspired by the WaCky project. We created 50,000 word pairs by randomly combining terms from an Italian basic vocabulary list, and used the pairs as queries to the Yahoo! search engine in order to retrieve candidate pages. We limited hits to pages in Italian with a Creative Commons license of type: CC-Attribution, CC-Attribution-Sharealike, CC-Attribution-Sharealike-Non-commercial, and CC-Attribution-Non-commercial. Pages that were wrongly tagged as CC-licensed were eliminated using a black-list that was populated by manual inspection of earlier versions of the corpus. The retrieved pages were automatically cleaned using the KrdWrd system.

The remaining pages in the PAISÀ corpus come from the Italian versions of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews, Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia Foundation dumps were used, extracting text with Wikipedia Extractor.

Once all materials were downloaded, the collection was filtered discarding empty documents or documents containing less than 150 words.

The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services.

Documents are marked in the corpus by an XML "text" tag with "id" and "url" attributes, the first corresponding to a unique numeric code assigned to each document, the second providing the original URL of the document.

See more on the construction process in section construction steps and on the partners' contributions in the section partnership. Online access to the corpus is available via a dedicated interface. Additionally, the corpus is provided for download in different versions.

Paisà corpus and frequency lists for download

The compiled Paisà corpus, and information derived from it, is licensed by the project Paisà (www.corpusitaliano.it) under a Creative Commons Attribution-Noncommercial-ShareAlike license.

It is a collection of cleaned and linguistically annotated web texts with copyright by the indicated URLs, used partly under Creative Commons Attribution-ShareAlike and partly used under Attribution-Noncommercial-ShareAlike, as effective at time of download in September/October 2010.

The frequency files provide simple listings of the lemmas found in the entire Paisà corpus, together with their counts (format: LEMMA,COUNT). The lemmas are listed in descending frequency order. The reduced list contains only lemmas that are composed of letters and the following three signs: .-'

Eurac Research CLARIN Centre (ERCC) download page

For citing the corpus:

Lyding, V. / Stemle, E. / Borghetti, C. / Brunello, M. / Castagnoli, S. / Dell'Orletta, F. / Dittmann, H. / Lenci, A. / Pirrelli, V. (2014): "The PAISÀ Corpus of Italian Web Texts" In: Proceedings of the 9th Web as Corpus Workshop (WaC-9), Association for Computational Linguistics, Gothenburg, Sweden, April 2014. pp. 36-43. [link to article]

Data format

Distributed data adhere to the following rules:

data files contain one or more sentences separated by a blank line;
a sentence consists of one or more tokens, each one starting on a new line;
a token consists of fields described in the table below;
fields are separated by one tab character;
all data files will contain the fields documented in the following table;
data files are UTF-8 encoded (Unicode).

Field 1	ID	Token counter, starting at 1 for each new sentence
Field 2	FORM	Word form or punctuation symbol
Field 3	LEMMA	Lemma of word form
Field 4	CPOSTAG	Coarse-grained part-of-speech tag
Field 5	POSTAG	Fine-grained part-of-speech tag
Field 6	FEATS	Morpho-syntactic features
Field 7	HEAD	Head of the current token, which is either a value of ID or zero ('0')
Field 8	DEPREL	Dependency relation linking the token to its head, which is 'ROOT' when the value of HEAD is zero. See the dependency tagset for more information.
Field 9	not used
Field 10	not used

The morpho-syntactic and dependency tagsets used were jointly developed by the Istituto di Linguistica Computazionale "Antonio Zampolli" (ILC-CNR) and the University of Pisa in the framework of the TANL (Text Analytics and Natural Language processing) project and was used for the annotation of the ISST-TANL dependency annotated corpus.

An annotation example follows:

ID	FORM	LEMMA	CPOSTAG	POSTAG	FEATS	HEAD	DEPREL
1	Gli	il	R	RD	num=p\|gen=m	2	det
2	stati	stati	S	S	num=p\|gen=m	4	subj
3	membri	membro	S	S	num=p\|gen=m	2	mod
4	provvedono	provvedere	V	V	num=p\|per=3\|mod=i\|ten=p	0	ROOT
5	affinché	affinché	C	CS	_	4	mod
6	il	il	R	RD	num=s\|gen=m	7	det
7	gestore	gestore	S	S	num=s\|gen=m	9	subj_pass
8	sia	essere	V	VA	num=s\|per=3\|mod=c\|ten=p	9	aux
9	obbligato	obbligare	V	V	num=s\|mod=p\|gen=m	5	sub
10	a	a	E	E	_	9	arg
11	trasmettere	trasmettere	V	V	mod=f	10	prep
12	all'	a	E	EA	num=s\|gen=n	11	comp_ind
13	autorità	autorità	S	S	num=n\|gen=f	12	prep
14	competente	competente	A	A	num=s\|gen=n	13	mod
15	una	una	R	RI	num=s\|gen=f	16	det
16	notifica	notifica	S	S	num=s\|gen=f	11	obj
17	entro	entro	E	E	_	11	comp_temp
18	i	il	R	RD	num=p\|gen=m	20	det
19	seguenti	seguente	A	A	num=p\|gen=n	20	mod
20	termini	termine	S	S	num=p\|gen=m	17	prep
21	.	.	F	FS	_	4	punc