Locked History Actions

PL196x

Polish language of the 1960s

This page is dedicated to the corpus of frequency dictionary of contemporary Polish. The original purpose of the corpus was to create a general frequency dictionary of contemporary Polish. The work started in 1967. Partial results were published between 1972 and 1977, the completed dictionary in 1990. The corpus was later augmented in various respects, both by manual editing and automated procedures.

Corpus data contain 10,000 samples divided into 5 parts: essays, news, scientific texts, fiction and plays. Every sample is approximately 50 words long, they all come from texts published between 1963 and 1967 and contain bibliographic description of its source. Each word is tagged with its base form and some morphological properties. Sentence boundaries are also marked.

In 2001 corpus authors agreed to publish the data in the Internet under GNU licence. This site presents corpus data in base and extended (enhanced) version as well as additional materials and corpus documentation.

The corpus was manually annotated using the IPI PAN tagset — each sentence by one annotator, there was no superannotation stage.

In 2012, under ICT-PSP project CESAR, the corpus with IPI PAN tagset annotation was automatically converted to annotation adhering to the NKJP tagset. The data from the automatic conversion into the NKJP tagset was used when simulating the manual annotation of NKJP – it served as data from one of the two annotators. The second annotator was a human annotator, who annotated the sentence from scratch. As in NKJP manual annotation, the human annotator inspected interpretations of segments in a given sentence generated automatically by the morphological analyser – the human annotator could either choose one of these interpretations, or enter a different interpretation manually. If the human annotator selected a different annotation than the one resulting from automatic conversion, the sentence was returned to the human annotator for inspection. Without knowing the result of automatic conversion, the human annotator could either change their earlier annotation or keep it. After this stage, if there were still conflicting interpretations, the superannotator would decide about the correct interpretation.

NOTE: If you are looking for a more fresh frequency data, please take a look at NKJP ngrams.

Corpus documentation

Selected bibliography

Corpus licence

Corpus data

Current version

The current version of the corpus features:

  • manually updated annotation
  • TEI P5 data format
  • both annotation and data format are compatible with NKJP corpus

Downloads:

Original version

Cluster

samples

"Raw"

Enhanced

TEI P4 XML

without codes

with codes

version

Style A: Scientific texts

1 MB

1,5 MB

1,1 MB

4,0 MB

10 MB

Style B: News

1 MB

1,5 MB

1,2 MB

3,9 MB

9 MB

Style C: Essays

1 MB

1,5 MB

1,2 MB

4,0 MB

10 MB

Style D: Fiction

1 MB

1,5 MB

1,1 MB

4,1 MB

11 MB

Style E: Plays

1 MB

1,5 MB

1,1 MB

4,4 MB

12 MB

Auxilliary files for the TEI P4-encoded XML version:

ISO image of the CD-ROM with most of the materials.

Concordances