National Corpus of Polish

The National Corpus of Polish is a shared initiative of four institutions: Institute of Computer Science at the Polish Academy of Sciences (coordinator), Institute of Polish Language at the Polish Academy of Sciences, Polish Scientific Publishers PWN, and the Department of Computational and Corpus Linguistics at the University of Łódź. It has been carried out as a research-development project of the Ministry of Science and Higher Education.

These four institutions have started cooperation to build a reference corpus of Polish language containing over fifteen hundred millions of words. The list of sources for the corpora contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts. For a corpus to be reliable, not only it is necessary to contain a high number of words, but it also needs a diversity of texts with respect to the subject and genre. The conversations ought to represent both male and female speakers, in various age groups, coming from various regions in Poland.

Official project website

Manually annotated subcorpus

The manually annotated 1-million word subcorpus of the NJKP, available on GNU GPL v.3:

NKJP-PodkorpusMilionowy-1.2.tgz -- zip compression,
NKJP-PodkorpusMilionowy-1.2.tar.bzip2 -- bzip2 compression.

NationalCorpusOfPolish

Menu

Wiki

National Corpus of Polish

Manually annotated subcorpus

Additional resources