Locked History Actions

Diff for "PPC"

Differences between revisions 43 and 44
Revision 43 as of 2019-06-25 09:04:23
Size: 2381
Revision 44 as of 2019-10-18 10:54:55
Size: 2303
Deletions are marked like this. Additions are marked like this.
Line 12: Line 12:
 * syntactic words,
 * syntactic groups,
Line 29: Line 27:
== Searching the corpus ==

 * [[http://sejm.nlp.ipipan.waw.pl/|using the search engine]]
 * [[http://ngram.sejm.nlp.ipipan.waw.pl/|using the ngram viewer]]
Line 37: Line 40:
== Other information ==
Line 41: Line 42:

== Searching the corpus ==

 * [[http://sejm.nlp.ipipan.waw.pl/|using search engine]]
 * [[http://ngram.sejm.nlp.ipipan.waw.pl/|using ngram viewer]]

== Contact information ==
== Contact ==

The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego

The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus co-funded by project CESAR and is currently being updated by CLARIN-PL infrastructure.

Corpus format

Corpus files are made available in TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:

  • utterance-level segmentation,
  • tokenization,
  • lemmatization,
  • disambiguated morphosyntactic description,
  • named entities.

Corpus data

The Polish Parliamentary Corpus contains:

  • Sejm sittings from 1919–present (including Legislative Sejm and State National Council)
  • Sejm committee sittings from 1993–present
  • Sejm interpellations and questions from 1997–present
  • Senate sittings from 1922–1939 and 1989–present
  • Senate committee sittings from 2015–present


At present you can download the unannotated TEI version with sources (15 GB). Linguistically annotated version will be added soon.

Searching the corpus


The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.


List of publications

Maciej Ogrodniczuk. Polish Parliamentary Corpus. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the LREC 2018 Workshop ParlaCLARIN: Creating and Using Parliamentary Corpora, pages 15–19, Paris, France, 2018. European Language Resources Association (ELRA).

Maciej Ogrodniczuk. The Polish Sejm Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pages 2219–2223, Istanbul, Turkey, 2012. European Language Resources Association (ELRA).

Please see also the slides from CLARIN-PLUS Workshop "Working with Parliamentary Records". Sofia, 27–29 March 2017.


Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences