Corpus Italiano

Welcome

Welcome

welcome to PAISÀ
Corpus dell'Italiano

Corpus dell'Italiano

general info & download

construction steps

online access
PAISÀ project

PAISÀ project

description

partnership

funding
Materials

Materials

publications

help pages / manuals

CQP Search

The CQP Search provides support for queries in the Corpus Query Processor query language (CQP) of the IMS Open Corpus Workbench. The search box serves as a command line for entering search commands.

As in the Simple and Advanced Search mode, results can be restricted to "easy sentences" and dependency diagrams can be displayed for the results. Easy sentences are selected based on predefined readability criteria. Both options are activated by selecting the respective check-boxes. If dependency diagrams are displayed for all hits the context defaults to one sentence.

Examples of queries in CQP syntax

plain words (with or without quotes), e.g. the word "Bolzano":
- "Bolzano"
words containing regular expressions (within quotes), e.g. the words "qui" and "qua":
- "qu(a|i)"
or any word starting with "sopra":
- "sopra.*"
flags %c (ignore case) and %d (ignore diacritics), positioned at the end of the query, e.g. "italiano" and "Italiano" and "ITALIANO":
- "Italiano" %c
POS and lemma annotations (that can also contain regular expressions), e.g. words with part of speech "DQ", all words of the lemma "andare":
- [pos="DQ"]
- [lemma="andare"]
Boolean expressions combining multiple requirements for a single token, e.g. the word "sei" belonging to the lemma "essere" or the word "sei" not belonging to the lemma "essere":
- [(word="sei") & (lemma="essere")]
- [(word="sei") & ! (lemma="essere")]
combination of multiple requirements for a single token, including feature values, e.g. any noun in neuter plural:
- [(pos="S") & (feats contains "gen=n") & (feats contains "num=p")];
sequences of tokens (described by word, lemma or pos), e.g. a word with part of speech "PQ" followed by a word of the lemma "andare":
- [pos="PQ"] [lemma="andare"]
[] denotes an arbitrary token, e.g. the word "io", followed by any word, followed by "ringrazio":
- "io" [] "ringrazio"
repetition of a token (possibly described by a pattern) can be expressed by the operators ? (0 or 1 occurrence), * (0 or more), + (1 or more), {n} (n occurrences), {m..n} (between m and n occurrences), e.g. a word of any part of speech starting with "D", followed by zero or any number of words of part of speech "A", followed by the word "strada":
- [pos="D.*"] [pos="A"]* "strada"
sequences can be structured by grouping them within parentheses interspersed by the disjunction operator (|), e.g. any of the words "quattro", "cinque" or "sei" followed by the word "giorni":
- "(quattro|cinque|sei)" "giorni"
conditions between search tokens can be introduced by using labels (see section 4.1. of the CQP tutorial for more information), e.g. to search for the lemma "mangiare" followed by its direct object with any number of intervening words:
- a:[lemma="mangiare"][]*b:".*"::b.head=a.id & b.deprel="obj" within s;
- (see here(LINK) to find out how to search for this example in the "Advanced Search" interface)
to retrieve examples with "mangiare" preceding OR following its "obj", we have to do a union on two separately created subcorpora:
- PREC = a:[lemma="mangiare"][]*b:[word="c.*"]::b.head=a.id & b.deprel="obj" within s;
- FOLL = b:[word="c.*"][]*a:[lemma="mangiare"]::b.head=a.id & b.deprel="obj" within s;
- union PREC FOLL;
several conditions on a particular token can be combined with relations between tokens, e.g. to search for two nouns with one intervening word, where the nouns begin with a vowel and are related to each other by the dependency relation "conjunct linked by con" (conj):
- a:[pos="S" & word="[aeiou].*"][]b:[pos="S" & word="[aeiou].*"]::b.head=a.id & b.deprel="conj" within s;

To search for unspecified tokens (e.g. "any word") the regular expression ".*" can be used. To learn more about how to use regular expressions, see here.

Display settings

Using the CQP commands "set" and "show" (see CQP tutorial section 2.3) the user can customize the display of search results.

Examples:

set Context 2 s shows the sentence containing the search hit plus two sentences to the left and two sentences to the right.
show +feats shows feature values for each word (e.g. for "oggetti", "|gen=m|num=p|")

Search results show the words followed by their attributes, separated by slashes.

The current settings are displayed in an area called "Settings", below the search options.

Word-level attributes

The following word-level attributes are supported:

id: unique identifier of a word within its text
lemma: base form of the word
coapos: coarse part of speech (or also here)
pos: part of speech (or also here)
head: identifier of the head word of the dependency relation (unique within the text)
feats: feature values of the word (e.g. case, gender, number)
deprel: dependency relation of the word

Structural attributes

Furthermore, searches can be restricted by structural attributes of the text. The following structural attributes are supported:

text_id: unique identifier of the text
text_url: source URL that the text was taken from
text_tok: number of words (tokens) in text
text_ttr: type-token ratio within text
text_advvoc: number of words (tokens) in text that are non-basic vocabulary
text_sent: number of sentences in text
text_gulpidx: readability index 'indice Gulpease' of the text
s_advvoc: number of words (tokens) in sentence that are non-basic vocabulary

Example: Searching for...

"gatto"::match.text_dom="org"; will return examples from web pages of domain ".org", only.
"casa"::int(match.text_tok)>4551; will return examples from source texts with more than 4551 words.
"casa"::int(match.s_advvoc)=0; will return examples from sentences that do not contain advanced vocabulary.

For details on how to use structural attributes in CQP queries, see CQP tutorial section 4.2

Named subcorpora

Search results can be stored in named subcorpora. This is done by typing "NAME = " in front of the query. Names have to start with capital letters and can be composed of letters, numbers and underscore.

Example:

CAP = "capodanno" expand to s; will save all sentences containing the word "capodanno" to a subcorpus called "CAP"

Named subcorpora, as created by the user, show up in the corpus dropdown menu and can be used for subsequent querying. The list of subcorpora is available in the Simple, Advanced and CQP search as well as in the Filter interface. The subcorpus called "Last" always stores the results of the most recent query or filtering carried out by the user. Be aware that only the search match is stored to the subcorpus.

Examples:

CASA1 = "casa" "di" ".*": The subcorpus CASA1 will only contain instances of the noun pattern "casa di X", for example "casa di legno", "casa di Dio", etc.
CASA2 = "casa" "di" ".*" expand to s: The subcorpus CASA2 will contain all sentences containing a string matching the search pattern, for example "Una vecchia donna, Mara, viene raffigurata in una piccola casa di legno."

Subcorpora can also be defined in the Filter interface. For details see here.

All subcorpora expire after 24 hours or once the browser is closed.

Examples of complex queries

We provide a list of precompiled examples of linguistically-motivated queries. These examples include queries for:

Complex noun phrases, e.g. "la più recente evoluzione"
Different question types, e.g. "Quali sono le sue abitudini?" or "Cosa volete di più?"
Passive constructions, e.g. "In alcuni posti, il puma viene chiamato coguaro, leone di montagna, lince, o gatto dipinto."

By clicking on a query example, the corresponding search is submitted. The CQP search box displays the corresponding query, which can be modified according to the user's needs.

Limitations on CQP search

Due to security restrictions some of the general CQP functionalities were not made available. It is not possible to use the count, sort, group, tabulate, dump and reduce commands.

You need more help? See here for an overview of our help pages.