Corpus Italiano

Welcome

Welcome

welcome to PAISÀ
Corpus dell'Italiano

Corpus dell'Italiano

general info & download

construction steps

online access
PAISÀ project

PAISÀ project

description

partnership

funding
Materials

Materials

publications

help pages / manuals

Filters

The Filter interface allows full text corpus documents to be retrieved according to a set of criteria specified by the user. The documents can be inspected one by one, or downloaded as .zipped .txt files.

The filter interface also allows for the creation of named subcorpora composed of filtered texts. After their creation, subcorpora show up in the corpus drop-down menu and can be queried like the full PAISÀ corpus.

Filter criteria

a keyword has to be entered into the "Keyword" search box; only documents containing the keyword will be retrieved
the number of tokens within a text - i.e. the number of running words of a text document
the number of tokens of non-basic vocabulary within a text - words that are not contained in the basic vocabulary according to these lists
the number of sentences within a text
the type-token ratio of a text, for details see here
the Gulpease index of a text, for details see here
the top-level domain - i.e., the final domain tag, e.g., ".it", ".org" or ".com" of the URL the text is taken from
the core URL - i.e. the name of a web page from which more than 500 texts were taken from to build the PAISÀ corpus

Filtered results are provided in three different manners:

lists of text documents
named subcorpus
word cloud

Lists of text documents

The list of texts satisfying the filtering criteria can be paged through by clicking on arrow icons (see screenshot below); single texts can be opened in a separate tab by clicking on the file name or icon.

Named subcorpora

A named subcorpus containing all corpus texts that satisfy the filtering criteria can be stored by entering a name for the subcorpus in the appropriate field (see screenshot below) before clicking "submit". The name for the subcorpus has to start with a capital letter and can be composed of letters, numbers and underscore.

User-defined subcorpora show up in the corpus dropdown menu and can be used for subsequent querying. The subcorpus called "Last" always stores the results of the most recent query or filtering carried out by the user.

Word Cloud

A word cloud is built based on the word frequencies of 80 of the documents that satisfy the filtering criteria. Words are displayed in alphabetic order and are scaled according to their frequencies.

The screenshot below shows a Word Cloud for documents filtered by the keyword "ferie".

The word cloud is implemented based on Google Visualization API.

You need more help? See here for an overview of our help pages.