The National Corpus of Polish Cheatsheet

Adam Przepiórkowski, Aleksander Buczyński, Jakub Wilk

May 26, 2010

This document contains excerpts from the publication The IPI PAN Corpus: Preliminary version. (This version was modified in March 2006 in order to take into account changes in the 2nd edition of the IPI PAN Corpus and in April 2010 to take into account changes in the National Corpus of Polish.)

1 Segmentation

(Author: Adam Przepiórkowski)

Tags are assigned to segments (tokens, roughly – words). Segments are not longer than orthographic words (‘from space to space’), but sometimes segments are shorter than orthographic words:

The segmentation principles given above lead to the segmentation of 1. (translated into English in 2.) that is presented in 3.

  1. Pojechalibyśmy z Janem M. Rokitą i Janem Nowakiem-Jeziorańskim na sesję polsko-amerykańską, gdyby nas zaprosił George W. Byłaby to nasza już 2. doń podróż od czasów PRL-u, a może i 3., czy nawet 4.
  2. ‘We would go with Jan M. Rokita and Jan Nowak-Jeziorański to the Polish-American session, if we were invited by George W. That would already be our 2nd trip to him since the times of PRL, and perhaps 3rd, or even 4th.’
  3. [Pojechali][by][śmy] [z] [Janem] [M.] [Rokitą] [i] [Janem] [Nowakiem][-][Jeziorańskim] [na] [sesję] [polsko][-][amerykańską][,] [gdyby] [nas] [zaprosił] [George] [W][.] [Była][by] [to] [nasza] [już] [2.] [do][ń] [podróż] [od] [czasów] [PRL-u][,] [a] [może] [i] [3.][,] [czy] [nawet] [4][.]

2 Tagset

(Author: Adam Przepiórkowski)

Each morphosyntactic tag is a sequence of colon-separated values, e.g.: subst:sg:nom:m1 for the segment chłopiec ‘boy’. The first value, e.g., subst, determines the grammatical class (cf. §2.2), while the values that follow it, e.g., sg, nom and m1, are the values of grammatical categories (cf. §2.1) appropriate for that grammatical class.

2.1 Grammatical categories

The following table presents the repertoire of grammatical categories used in the National Corpus of Polish:

Number: (2 values)
singular sg oko
plural pl oczy
Case: (7 values)
nominative nom woda
genitive gen wody
dative dat wodzie
accusative acc wodę
instrumental inst wodą
locative loc wodzie
vocative voc wodo
Gender: (5 values)
human masculine (virile)m1 papież, kto, wujostwo
animate masculine m2 baranek, walc, babsztyl
inanimate masculine m3 stół
feminine f stuła
neuter n dziecko, okno, co, skrzypce, spodnie
Person: (3 values)
first pri bredzę, my
second sec bredzisz, wy
third ter bredzi, oni
Degree: (3 values)
positive pos cudny
comparative comp cudniejszy
superlative sup najcudniejszy
Aspect: (2 values)
imperfective imperf iść
perfective perf zajść
Negation: (2 values)
affirmative aff pisanie, czytanego
negative neg niepisanie, nieczytanego
Accentability: (2 values)
accented (strong) akc jego, niego, tobie
non-accented (weak) nakc go, -ń, ci
Post-prepositionality: (2 values)
post-prepositional praep niego, -ń
non-post-prepositional npraepjego, go
Accommodability: (2 values)
agreeing congr dwaj, pięcioma
governing rec dwóch, dwu, pięciorgiem
Agglutination: (2 values)
non-agglutinative nagl niósł
agglutinative agl niosł-
Vocalicity: (2 values)
vocalic wok -em
non-vocalic nwok -m

2.2 Grammatical classes

The scope of traditional parts of speech such as verb, noun, numeral or pronoun is fuzzy and, hence, controversial. For example, are gerundial forms such as picie ‘drinking’ and palenie ‘smoking’ verbs (they have the category of aspect and they are productively related to verbal forms such as pić ‘to drink’ and palić ‘to smoke’), or are they nouns (they decline for case, and they have the lexical category of gender)? Are ordinal numerals such as piąty ‘fifth’ numerals (semantically, they are numerals), or are they adjectives (they have adjectival inflection)? Are adjectival pronouns such as taki ‘such’ pronouns (semantics) or adjectives (inflection)?

Grammatical classes used in the National Corpus of Polish are more precisely delimited and, overall, finer-grained than traditional parts of speech. The classes assumed here are based on the notion of flexeme, narrower than the notion of lexeme.

The following table contains the rough morphosyntactic characteristics of all flexemic classes assumed in the present tagset. The symbol in the table means that, for a given flexemic class, a given grammatical category is a morphological category (flexemes belonging to this class normally inflect for that category), while the symbol means that the category is a lexical category (for each flexeme belonging to this class, all forms of that flexeme have the same value of that category, although that value may differ between flexemes, as in the case of the gender of nouns).

numbercasegenderpersondegreeaspectnegationaccentabilitypost-prep.accom.agglt.vocalicity
noun
depreciative form
main numeral
collective numeral
adjective
ad-adj. adjective
post-prep. adjective
adverb
pronoun (non-3rd person)
pronoun (3rd person)
pronoun siebie
non-past form
future być
agglut. być
l-participle
imperative form
impersonal form
infinitive
adv. contemp. prtcp.
adv. anter. prtcp.
gerund
adj. act. prtcp.
adj. pass. prtcp.
winien-like verb
predicative
preposition
conjunction
particle-adverb
alien (nominal)
alien (other)
unknown form
punctuation

The following table provides the information about base forms for all grammatical classes, as well as the abbreviations of these classes as used in the National Corpus of Polish.

flexeme abbreviationbase form example
noun subst singular nominative profesor
depreciative form depr singular nominative form of the corresponding noun profesor
main numeral num inanimate masculine nominative form pięć, dwa
collective numeral numcol inanimate masculine nominative form of the main numeral pięć, dwa
adjective adj singular nominative masculine positive form polski
ad-adjectival adjective adja singular nominative masculine positive form of the adjectivepolski
post-prepositional adjective adjp singular nominative masculine positive form of the adjectivepolski
adverb adv positive form dobrze, bardzo
non-3rd person pronoun ppron12 singular nominative ja
3rd-person pronoun ppron3 singular nominative on
pronoun siebie siebie accusative siebie
non-past form fin infinitive czytać
future być bedzie infinitive być
agglutinate być aglt infinitive być
l-participle praet infinitive czytać
imperative impt infinitive czytać
impersonal imps infinitive czytać
infinitive inf infinitive czytać
contemporary adv. participlepcon infinitive czytać
anterior adv. participle pant infinitive czytać
gerund ger infinitive czytać
active adj. participle pact infinitive czytać
passive adj. participle ppas infinitive czytać
winien winien singular masculine form powinien, rad
predicative pred the only form of that flexeme warto
preposition prep the non-vocalic form of that flexeme na, przez, w
conjunction conj the only form of that flexeme oraz
particle-adverb qub the only form of that flexeme nie, -że, się
nominal alien xxs singular nominative form de, l’Hospital
other alien xxx the only form of that flexeme bene
unknown form ign the only form of that flexeme
punctuation interp the only form of that flexeme ;, ., (, ]

3 Query Language

(Author: Adam Przepiórkowski, Jakub Wilk)

Poliqarp’s query syntax is based on that of Corpus Query Processor (CQP), perhaps the most popular program of this kind, created at the University of Stuttgart, but it contains a number of additional features and improvements. 1 The present section describes the syntax of Poliqarp queries and illustrates it with numerous examples.

3.1 Searching for orthographic forms

In the simplest case, a query is just a sequence of segments, e.g.:

There are three segments in the latter query above, corresponding to two words: przyszedłem and rano. In the case of simple queries like the two queries above, Poliqarp attempts to identify those words which might consist of smaller segments and to handle them properly, so also the following queries will give the expected results:

In case of the latter query, Poliqarp will find all occurrences of the three-segment sequence [długo][m] [szedł], interpretable as an adverb (długo ‘long’), an agglutinate (-m ‘be’), and an l-participle (szedł ‘walk, go’), as well as all occurrences of the two-segment sequence [długom] [szedł], where the first segment is interpreted as a dative nominal form (długom ‘debts’), and the second – again, as an l-participle.

By default, queries are interpreted in a case-sensitive manner, so the following queries will produce different results:

In order to find all occurrences of the form przyszedł, regardless of case, the flag /i should be used. Thus, the two queries below will produce the same results, which will in particular contain all results of both queries above.

Both in the graphical version and in the text version of Poliqarp, case sensitivity can be set globally, for a whole query or a series of queries.

Queries may contain standard regular expressions over characters, specified with the help of the following special characters: ?, *, +, ., ,, |, {, }, [, ], (, ), as well as natural numbers; segment specifications containing regular expressions must be enclosed in quotes ". Since the formal introduction of regular expressions lies far outside the scope of the current publication, we will be content with discussing just a few examples, which, nevertheless, should allow the user to understand the syntax and semantics of such regular expressions.

  1. "Ala|Ela"

    the character | introduces the alternative of two expressions, so the query above can be used to find all occurrences of segments of the form Ala or Ela,

  2. "[AE]la

    square brackets denote the alternative of characters within them, so the query above can be used to find those segments whose first character is A or E, and the following two characters are la, i.e., this query is equivalent to the previous query,

  3. "beza?"

    the question mark signals the optionality of the character or the expression in parentheses which immediately precedes it, so the question above will be used find all occurrences of the segments bez and beza,

  4. "bez."

    the period denotes any character, so the results of this query will include beza, bezy, bezą, etc., but not bez or bezami,

  5. "bez.?"

    bez, beza, bezy, bezą, etc., but not bezami,

  6. ".z.z."

    5-character segments, where 2nd and 4th characters are z (e.g., czczą and rzezi),

  7. ".z.z..?"

    segments of length 5 or 6, where 2nd and 4th characters are z (e.g., czczą, rzezi and szczyt),

  8. "a*by"

    the asterisk denotes any number of occurrences of the character or the expression in parentheses which immediately precedes it, so this query can be used to find segments beginning with any number of as, followed by by, e.g., by (zero occurrences of a), aby, aaaaby, etc.,

  9. "Ala.*"

    segments beginning with Ala, e.g., Ala and Alabama,

  10. "ala.*"/i

    segments beginning with ala, Ala, aLa, ALA, etc., e.g., Ala, alabaster and ALABAMA,

  11. ".*al+"

    the plus has a similar interpretation as the asterisk: it denotes any number greater than zero of occurrences of the character or the expression in parentheses which immediately precedes it, so this query can be used to find segments ending in al, all, alll etc., but not in a, e.g., dal, robal and Gall,

  12. "a{1,3}b.*"/i

    the expression of the form n,m denotes from n to m occurrences of the character or the expression in parentheses which immediately precedes it; in this case, the query above can be used to find segments beginning with 1 to 3 occurrences of a or A, followed by b or B, and then followed by any sequence of characters, e.g., Aby, aaaby, absolutnie, ABBA,

  13. ".*(la){3,}.*"

    n, means at least n occurrences, so this query will help to find segments which contain at least three occurrences of the sequence la in a row, e.g., tralalala, sialalala,

  14. "[bcćdfghjklłmnńprsśtwzźż]{4,}[aąeęioóuy]"/i

    segments consisting of at least 4 consonants and exactly 1 vowel, e.g., źdźbła i Chrzczę,

  15. "([bcćdfghjklłmnńprsśtwzźż]{3}[aąeęioóuy])2,"/i

    segments consisting of at least two sequences of the type CCCV, where C is a consonant, and V is a vowel, e.g., wszystko, Zdmuchnąwszy i Szmajdziński; n means exactly n occurrences,

  16. "([^aąeęioóuy]{3}[aąeęioóuy])2,"/i

    as above,

  17. "(pod|na|za)jecha.*"

    segments beginning with podjecha, najecha or zajecha, e.g., podjechał, zajechawszy.

The specifications of segments given above must match complete segments, rather than only their parts, hence the necessity of flanking the sequence (la){3,} in query 13. above with the regular expression .*, matching any sequence of characters (also the empty sequence). The same effect can be achieved with the help of the flag /x, which means that the given specification must be matched by a subsequence of the segment, not necessarily by the complete segment:

3.2 Searching for base forms

The following query may be used in order to find all forms of the lexeme korpus:

The base attribute is one of many attributes that may be used in a query. The value of this attribute should specify the base form (the lemma), so a query like [base=pisać] can be used to find forms such as pisać ‘write’ (infinitive), piszę (non-past form), pisała (l-participle), piszcie (imperative), pisanie (gerund), pisano (impersonal), pisane (adjectival participle), etc.

Another attribute that may be used in queries is orth. The values of this attribute specify segments, so each of the following pairs contains queries which are equivalent.

On the other hand, the two queries below are not equivalent:

In the first case, Poliqarp will guess that the word przyszedłem may consist of two segments, przyszedł and em, and will expand the query accordingly, as described in §3.1. In contrast, the value of orth is always interpreted as the specification of a single segment.

The values of base and orth may contain regular expressions of the kind described in §3.1 above, e.g.:

3.3 Higher order queries

Queries about segments and about base forms may be combined. For example, the following query may be used to find all occurrences of the segment minę understood as a form of the lexeme mina ‘mine, face’ (and not, say, as a form of the lexeme mijać, ‘to pass’):

A similar effect can be achieved with the help of the following query, about those occurrences of the segment minę which are not interpreted as forms of mijać.

The condition that the base form be different from mijać may also be specified by putting the negation (the exclamation mark) before the name of the attribute, so the query below is equivalent to the query above.

Just as in the propositional calculus, double negation is equivalent to no negation, so the following queries about the segment nie understood as a form of the pronoun on are fully equivalent:

In Poliqarp queries, the operator & plays the role of logical conjunction. The operator dual to & is |, which plays the role of logical disjunction, e.g.:

In order to better understand the difference between the operators & and |, let us compare the effect of the following two queries:

The result of the former query will consist of those segments which simultaneously (conjunction) have the orthographic form minę and are interpreted as a form of the lexeme mina. On the other hand, the result of the latter query will consists of segments which either (disjunction) have the orthographic form minę, regardless of the interpretation of this segment, or are a form of the lexeme mina (e.g., mina, miny, minami). Hence, the latter query should return many more results than the former query.

As the examples above show, specifications of corpus positions, enclosed in square brackets, may contain any number of conditions of the type attribute=value, combined with the operators !, & and |. It is also possible to completely omit any conditions – the query below could be used to find all segments in the corpus.2

This trivial specification of corpus positions, matching any segment, may be useful for finding two forms in a certain distance from each other, e.g., two segments separated by two other segments, as in the following query:

The result of this query will include sequences such się nikogo nie bać, się Boga nie boicie, etc.

It would perhaps be more interesting to specify the upper limit on the number of segments which may intervene between two forms, not just the exact number of such intervening positions. Poliqarp makes it possible to pose such queries, as it allows to posit regular expressions also over corpus positions. For example, the following query may be used to find a form of the lexeme bać occurring two, three or four positions after the segment się:

The result of this query will contain all the sequences found by the previous query, as well as sequences such as się każdy następny Rywin będzie bał.

A more accurate query concerning various occurrences of the inherently reflexive verb bać się should find się within a certain window before a form of the lexeme bać, but without any intervening punctuation (intervening punctuation will often indicate clause boundary), or immediately after a form of bać, separated from that form by at most a single personal pronoun:

3.4 Searching for tags

The rather baroque query above can be simplified by replacing the condition orth!="[.!?,:]" with a direct reference to the ‘grammatical class’ interp:

In general, the values of the pos attribute are the abbreviations of names of grammatical classes discussed in §2.2 (cf. the table in 2.2). For example, a query about a sequence of two nominal forms beginning with an a may be formulated as follows:

The specifications of the values of pos may, just as in case of orth and base, contain regular expressions. For example, taking into account the fact that personal pronouns are split between the class of 3rd person pronouns ppron3 and non-3rd person pronouns ppron12, the following queries may be used to find any form of any personal pronoun:

That means that the query about bać się may be further simplified:

Apart from the specifications of segments (with the help of orth), base forms (base) and grammatical classes (pos), queries may contain specifications of particular grammatical categories, such as case or gender. The following attributes may be used to this end (cf. §2.1):

attribute possible values
number sg pl
case nom gen dat acc inst loc voc
gender m1 m2 m3 f n
person pri sec ter
degree pos comp sup
aspect imperf perf
negation aff neg
accentability akc nakc
post-prepositionalitynpraep praep
accommodability congr rec
agglutination agl nagl
vocalicity nwok wok

Hence, it is possible to pose the following queries:

  1. [number=sg]

    find singular forms

  2. [pos=subst & number=sg]

    find singular nominal forms

  3. [pos=subst & gender!=f]

    find masculine and neuter nominal forms

  4. [number=sg & case="nom|acc" & gender="m[123]"]

    find singular masculine forms in the nominative or in the accusative case

The following three-letter abbreviations may be used instead of the full names of the attributes:

attribute abbreviation
number nmb
case cas
gender gnd
person per
degree deg
aspect asp
negation neg
accommodability acm
accentability acn
post-prepositionalityppr
agglutination agg
vocalicity vcl

For example, the query below is equivalent to 4. above:

In the graphical and text versions of Poliqarp, it is possible to define so-called aliases, i.e., abbreviations for alternative values of a given attribute, which may themselves be used as if they were possible values of attributes. The current version of the National Corpus of Polish has four such aliases already pre-defined:

aliasdefinition
masc m1 m2 m3
noun subst depr ger xxs ppron12 ppron3
pron ppron12 ppron3 siebie
verb fin praet aglt bedzie inf imps impt pact ppas pcon pant ger winien

With the definitions of the aliases noun and masc given above, the following two queries are equivalent:

The values of grammatical classes and categories may be specified jointly, with the use of the tag attribute. For example, the following query may be used to find singular nominative neuter nouns:

The values of the tag attribute have the form kl:kat1:kat2:...:katn, where kl is the name of a grammatical class, while each of kati is the value of a grammatical category appropriate for that class, in the order specified in the table in 2.2.

Just as in case of other attributes, also the specification of the value of tag may contain regular expressions, e.g.:

Specification of grammatical classes and grammatical categories may contain variables (having the form $n, where n is a single digit), whose values will be set only during execution of the query. For example, the following query for an adjective and a following noun agreeing in case:

can be simplified to:

3.5 Ambiguities

One of the features that distinguish the National Corpus of Polish and Poliqarp from other corpora and search tools is the representation and processing of ambiguities. There are cases where it is impossible to tell which of a number of interpretations is the right one, as in 1. below.

  1. Pamiętam pijaną.
    remember.1sther.accdrunk.acc/ins

    ‘I remember her drunk.’

Since it is impossible to resolve the grammatical case of pijaną in 1., both interpretations, accusative and instrumental, should be marked in the corpus as correct in this context.

However, given that after disambiguation a single segment may contain more than one interpretation, the question arises whether such ambiguous segments, e.g., pijaną in 1., should be included in the result of a query which matches only some of these interpretations, e.g., in the result of the query [case=acc]. On the one hand, the segment pijaną should be included in the result of [case=acc], as accusative is one of the correct interpretations of this segment in this context, but on the other hand, this segment should not be included, as it is not absolutely certain that this is an accusative form.

Instead of choosing between these interpretations of a query like [case=acc], Poliqarp allows the user to pose both kinds of queries. When a single equality sign is used, as in [case=acc], all segments whose at least one interpretation matches the given condition will be returned, so both pijaną and in 1. will be included in the result of this query. On the other hand, when two equality signs are used, as in [case==acc], only those segments will be returned whose all interpretations satisfy the condition expressed with ==, i.e., in 1., only the form will match the query.

With this distinction in hand, it is possible to search for forms which, e.g., may in a given context be interpreted as either accusative or genitive, so – given a properly tagged corpus – the following query should give non-empty results.

Conversely, the query below matches those segments whose all interpretations in the given context are at the same time accusative and genitive, so it will necessarily produce empty results.

The queries above pertain to interpretations which are the result of morphosyntactic disambiguation. The National Coropus of Polish contains also all other interpretations assigned to a given segment by the morphological analyser. In some situations it is useful to have access to such interpretations rejected by the disambiguator, e.g., for the task of finding all syncretic forms of a certain kind in the corpus, or when investigating disambiguation errors. For example, in order to find all syncretic accusative/genitive forms in the corpus, regardless of their interpretation in contexts in which they occur, the following query may be posed:

The final equality operator available in Poliqarp queries is ~~. The following query may be used for finding those forms which are unambiguously accusative, again, regardless of the context in which they occur.

The table below summarises the four equality operators put at the user’s disposal in Poliqarp.

in the results of in the results of
morphological analysis disambiguation
at least one interpretation~ =
each interpretation ~~ ==

It should be clear that the following implications hold:

3.6 Constraining matches to sentences or paragraphs

Texts contained in the National Corpus of Polish are divided into sentences and paragraphs. This information may be taken into account in queries, in order to constrain a query to a sentence or a paragraph, as in the query below, which may be used to find the form się separated from a form of the verb bać by any positive number of (non-się) segments, but within a sentence.

Similarly, the qualifier ‘within p’ constrains the scope of a query to a paragraph.

3.7 Constraining matches with metadata

Each text in the National Corus of Polish comes with a set of data about that text, such as its title and author, publisher, date of publication, etc. Some of such metadata are accessible through Poliqarp and may be used to constrain the scope of a query, e.g., to texts by a given author or published between certain dates.

The following meta-attributes are available in the 3rd demo of the National Corpus of Polish:

Usually only some of these attributes will have a value defined, e.g., when only the date of the publication is known, not the date of the first publication or the date of origin, or in case of short newspaper notes, which might lack information about the author or even the title.

In order to constrain the scope of a query with metadata, the keyword meta should be placed at the end of the query and it should be followed by specifications of values of meta-attributes. In case the scope of the query is also constrained to a sentence or to a paragraph, the specification of metadata should follow the structural constraint, e.g.:

Just as in case of ordinary attributes such as orth or pos, also the specifications of values of the meta-attributes author and title may contain regular expressions. For example, the query below may be used to find forms of the lexeme wirus in those texts whose title contains one of the sequences: windows or microsoft.

By default, the specifications of values of author and title are taken to be case-insensitive and they are interpreted as matching (at least) parts of values of appropriate meta-attributes, so the following query will find sequences of nominal forms in works by, inter alia, Pol, Polkowski and Rampolski:

To change that default behaviour, the flags /X and /I may be used. The effect of these flags is dual to the effect of the flags /x and /i described above: the effect of /X is that a given specification of the value of an attribute is understood as matching the complete value of that attribute, while the flag /I enforces the case-sensitive interpretation, as in examples below:

Regular expressions are not allowed in case of the date-valued attributes created, acquired, recorded, first_published and published. On the other hand, it is possible to use the lesser/greater signs < and >, e.g.:

Constraints on meta-attributes may be combined with the operators &, | and !, e.g:

Former demos of the National Corpus of Polish and editions of the IPI PAN Corpus are using different metadata schemes. Please refer to The IPI PAN Corpus Cheatsheet for details.

3.8 Aligning matches

In order to make the results of a query more readable, it is possible to place within the query proper, i.e., before the qualifiers within and meta, a special alignment marker, ^, as in:

Instead of the usual three columns containing the left context of the match, the match itself, and the right context, the results of this query will be split into four columns, containing, respectively, the left context, the left match, i.e., the sequence of segments matching the part of the query before the alignment marker ^ (here, a non-empty sequence of nominative adjectives), the right match (here, a non-empty sequence of nominative nouns), and the right match.

4 Statistical queries

(Author: Aleksander Buczyński)

Statistical queries are supported only by the graphical version of Poliqarp. They are not available in the web interface.

Originally, Poliqarp was designed as a concordancer, responding to every query with a list of matches with contexts of selected width. This provides the user with examples of usage of specific constructions, but one can imagine many corpora problems in the case of which browsing through hundreds of occurrences is neither convenient nor efficient. The statistical extension (currently available only for the stand-alone version of Poliqarp) introduces the possibility to easily find answers to questions like:

The extension also provides several statistical measures for collocation detection, and for investigating correlations between individual attributes.

4.1 Syntax synopsis

A statistical query has the following syntax (square brackets denote optional parts):

<pattern> group by <attr list> [; <attr list>] [interp <method>] [sort <order>] [min <cmin>] [count <nmax>]

where:

<pattern> is a Poliqarp query as described in previous chapter; only segment sequences matching <pattern> will be taken into account in the statistics;

<attr list> is a list of attribute specifications (for example base or 2.case), separated by commas; each attribute specification consists of an optional segment specification (for example 2. or -1.), and an obligatory attribute name (for example base or case); see 4.3 for details;

<method> is an interpretation selection method (random or combine), as described in 4.4;

<order> is a sorting order, as described in 4.6 (simple queries) and 4.8 (queries with partial grouping);

<cmin> is a minimum frequency threshold; only results which occurred at least <cmin> times in the matches should be displayed;

<nmax> is a number (or all) of samples selected from the results to create the statistics; see 4.5 for explanation.

4.2 Simple statistical queries

The pattern matches can be grouped according to a set of segment attributes, specified in grouping rules. The simplest grouping rule consists of one attribute name. For example, to find the frequencies of the forms of the word woda (water), one could write:

[base=woda] group by orth

The results of the query is a table. Each different value of the specified attribute (orth) encountered in the matches of the first part of the query ([base=woda]) corresponds to one row in the results. Each row displays a value of the specified attribute, and the number of matches that contain this particular value.

It is possible to include more attributes in grouping rules, separated by commas, e.g.:

[base=woda] group by orth, number, case

In the results of this query, each row corresponds to a unique combination of values of the specified attributes (woda sg nom, wody pl nom, wody sg gen, etc.) This takes into account a distinction between homonymous forms (wody may be sg gen, pl nom, etc.)

4.3 Multiword patterns

Patterns can return matches longer than a single segment. To specify the segment whose attribute will be used for grouping, one should add the segment number (with a dot) before the name of the attribute. For example, to find all verbs occurring immediately after the word woda (water), one can type:

[base=woda][pos=verb] group by 2.base

Specification 2.base refers to the base form of the second segment of the match (the verb after woda). Note that subsequent numbers refer to the subsequent segments of the match, not segment specifications in the query. For example, to find the frequency distribution of three subsequent adverbs, the user should type (1. can be skipped):

[pos=adv]{3} group by 1.base, 2.base, 3.base

To make it possible to address segments in matches of possibly variable length, negative numbers can be used as segment specifications. Such numbers mean counting from the end of the match. For example, to allow an optional adverb between woda and the verb, the query should be modified as follows:

[base=woda][pos=adv]?[pos=verb] group by -1.base

Specification -1.base specifies the base form of last segment of the result. Similarly, -2. would refer to the second last, -3. — third last, etc.

4.4 Ambiguities

By default, one random interpretation of each segment is chosen for grouping. But if interp combine is added after an attribute specification, the value of the attribute will be calculated as a concatenation of all the unique values of the attribute in all the interpretations, separated by a vertical bar. For example, to find all possible interpretations’ combination for word forms that may be a form of the verb mieć, one could write:

[base~mieć] group by base interp combine

The results of such query will include word pairs like mama/mieć (mom/to have), mieć/mienie (to have/property), maić/mieć (to decorate with leaves and flowers/to have), mielić/mieć (to grind/to have), etc. The results will vary significantly depending on the value of the “Show only disambiguated results” option in Poliqarp configuration; if it is checked, the interpretations discarded by the tagger will not appear in the results.

4.5 Sample size

To return results quickly, the statistics are by default calculated on basis of 1000 randomly choosen samples from the pattern matches. If one needs a more precise number, the sample size can be adjusted with the keyword count, for example:

[base=woda] group by orth count 5000

To count all pattern matches simply use count all, for example:

[base=woda] group by orth count all

Note: as for version 1.2 there is a hardcoded limit of 500 000 matches. If you really need a bigger sample, you can split your query into a few more specific ones (for example divide it by meta style).

4.6 Results sorting and selection

Results can be sorted in alphabetic order (sort a fronte) or according to their frequency (sort by freq). If partial grouping is used, the results can be also sorted according to a collocation function — see 4.8 for details.

The results selection is now limited to a frequency threshold (min n).

4.7 Partial grouping

It is quite easy to find most frequent bigrams using only the basic syntax:

[][] group by 1.base, 2.base sort by freq

However, such results are often insufficient. For example, in collocation detection, not only the bigram frequency, but also the frequencies of its constituents have to be taken into account. Therefore, a special separator — semicolon “;” — has been introduced, which makes it possible to split the grouping rules into two parts. For example:

[][] group by 1.base; 2.base sort by freq

will cause the program to group the results by: 1.base (the part before the semicolon); 2.base (the part after the semicolon); 1.base, 2.base (both). For each line of the results of the last grouping, the results of the partial groupings should also be displayed. The sort and min modifiers are always applied to the last grouping.

Each of the grouping parts may include more than one attribute, but the grouping may have no more than two parts. In other words: the grouping rules can include any number of commas, but not more than one semicolon.

The syntax does not necessarily have to be used for bigrams. Different grouping parts may even include references to the same segment, for example:

[base=woda] group by number; case

4.8 Collocation functions

A few dependency measures for statistical detection of collocations have been added as possible sorting parameters (for example sort by cp or sort by dice) in queries with partial grouping (see 4.7). All the currently implemented measures are functions of the following parameters:

c(w1) — number of occurrences of w1, where w1 is the combination of values of the attributes defined by the first part of the grouping rules;

c(w2) — number of occurrences of w2, where w2 is the combination of values of the attributes defined by the second part of the grouping rules;

c(w1w2) — number of occurrences of the combination of w1 and w2.

The functions are:

Because dependency measures in fact prefer rare bigrams, a minimum frequency threshold is recommended, for example:

[pos!=interp]{2} group by base; 2.base sort by scp min 2

An alternative approach is to add some frequency bias to the dependency test value, for example:

[pos!=interp]{2} group by base; 2.base sort by scp bias 0.5

bias b means “before sorting, multiply the function results by power b of frequency”:

x bias b = x c(w w )b
             1  2

For example:

            c(w1w2-)2,5-
scp bias0.5 = c(w1)c(w2)

Of course, bias and min keywords can be combined, for example:

[]{2} group by base; 2.base sort by scp bias 0.5 min 2

For consistency, the minimum threshold is still applied to the bare frequency, not the biased collocation function.