English Italiano
Construction of the corpus
The corpus was constructed according to the following steps:
-
creation of a seed list of random combinations of frequent Italian words, taken from
- "Vocabolario di base della lingua italiana" (VdB) by Tullio De Mauro
- 50,000 tuples overall
-
retrieval of URLs over search engine, cleaning of URL lists
- using the BootCaT tools with Yahoo!
- using Yahoo!'s option to select pages licensed under Creative Commons attribuition
- removal of pages wrongly classified as Creative Commons by the search engine (based on manually created blacklists)
-
download of web content for the URLs and creation of cleaned corpora
- for general pages using the KrdWrd tools for web page retrieval and clean-up
- for Wiki pages using the Wikipedia Extractor in combination with a script to separate out single documents
- removal of empty, undersized and oversized files with in-house scripts
-
linguistic annotation
- ILC-POS-Tagger from CNR Pisa (for the TANL fine-grained pos tagset see here and here)
- DeSR Dependency Parser, Università di Pisa (for the ISST-TANL dependency tagset see here)
-
metadata
- the source URL is attached to each document of the corpus
- indexing for use with OpenCWB/CQP using the cwb-encoding tools
Collection of the documents
The PAISÀ documents were selected in two different ways. A part of the corpus was constructed using a method inspired by the WaCky project. We created 50,000 word pairs by randomly combining terms from an Italian basic vocabulary list, and used the pairs as queries to the Yahoo! search engine in order to retrieve candidate pages. We limited hits to pages in Italian with a Creative Commons license of type: CC-Attribution, CC-Attribution-Sharealike, CC-Attribution-Sharealike-Non-commercial, and CC-Attribution-Non-commercial. Pages that were wrongly tagged as CC-licensed were eliminated using a black-list that was populated by manual inspection of earlier versions of the corpus. The retrieved pages were automatically cleaned using the KrdWrd system.
The remaining pages in the PAISÀ corpus come from the Italian versions of various Wikimedia Foundation projects, namely: Wikipedia, Wikinews, Wikisource, Wikibooks, Wikiversity, Wikivoyage. The official Wikimedia Foundation dumps were used, extracting text with Wikipedia Extractor.
Once all materials were downloaded, the collection was filtered discarding empty documents or documents containing less than 150 words.
The corpus contains approximately 380,000 documents coming from about 1,000 different websites, for a total of about 250 million words. Approximately 260,000 documents are from Wikipedia, approx. 5,600 from other Wikimedia Foundation projects. About 9,300 documents come from Indymedia, and we estimate that about 65,000 documents come from blog services.
Documents are marked in the corpus by an XML "text" tag with "id" and "url" attributes, the first corresponding to a unique numeric code assigned to each document, the second providing the original URL of the document.
See more on the construction process in section construction steps and on the partners' contributions in the section partnership. Online access to the corpus is available via a dedicated interface. Additionally, the corpus is provided for download in different versions.
Data format
Distributed data adhere to the following rules:
- data files contain one or more sentences separated by a blank line;
- a sentence consists of one or more tokens, each one starting on a new line;
- a token consists of fields described in the table below;
- fields are separated by one tab character;
- all data files will contain the fields documented in the following table;
- data files are UTF-8 encoded (Unicode).
Field 1 | ID | Token counter, starting at 1 for each new sentence |
Field 2 | FORM | Word form or punctuation symbol |
Field 3 | LEMMA | Lemma of word form |
Field 4 | CPOSTAG | Coarse-grained part-of-speech tag |
Field 5 | POSTAG | Fine-grained part-of-speech tag |
Field 6 | FEATS | Morpho-syntactic features |
Field 7 | HEAD | Head of the current token, which is either a value of ID or zero ('0') |
Field 8 | DEPREL | Dependency relation linking the token to its head, which is 'ROOT' when the value of HEAD is zero. See the dependency tagset for more information. |
Field 9 | not used | |
Field 10 | not used |
The morpho-syntactic and dependency tagsets used were jointly developed by the Istituto di Linguistica Computazionale "Antonio Zampolli" (ILC-CNR) and the University of Pisa in the framework of the TANL (Text Analytics and Natural Language processing) project and was used for the annotation of the ISST-TANL dependency annotated corpus.
An annotation example follows:
ID | FORM | LEMMA | CPOSTAG | POSTAG | FEATS | HEAD | DEPREL |
1 | Gli | il | R | RD | num=p|gen=m | 2 | det |
2 | stati | stati | S | S | num=p|gen=m | 4 | subj |
3 | membri | membro | S | S | num=p|gen=m | 2 | mod |
4 | provvedono | provvedere | V | V | num=p|per=3|mod=i|ten=p | 0 | ROOT |
5 | affinché | affinché | C | CS | _ | 4 | mod |
6 | il | il | R | RD | num=s|gen=m | 7 | det |
7 | gestore | gestore | S | S | num=s|gen=m | 9 | subj_pass |
8 | sia | essere | V | VA | num=s|per=3|mod=c|ten=p | 9 | aux |
9 | obbligato | obbligare | V | V | num=s|mod=p|gen=m | 5 | sub |
10 | a | a | E | E | _ | 9 | arg |
11 | trasmettere | trasmettere | V | V | mod=f | 10 | prep |
12 | all' | a | E | EA | num=s|gen=n | 11 | comp_ind |
13 | autorità | autorità | S | S | num=n|gen=f | 12 | prep |
14 | competente | competente | A | A | num=s|gen=n | 13 | mod |
15 | una | una | R | RI | num=s|gen=f | 16 | det |
16 | notifica | notifica | S | S | num=s|gen=f | 11 | obj |
17 | entro | entro | E | E | _ | 11 | comp_temp |
18 | i | il | R | RD | num=p|gen=m | 20 | det |
19 | seguenti | seguente | A | A | num=p|gen=n | 20 | mod |
20 | termini | termine | S | S | num=p|gen=m | 17 | prep |
21 | . | . | F | FS | _ | 4 | punc |