The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego
The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus co-funded by project CESAR and is currently being updated by CLARIN-PL infrastructure.
Corpus files are made available in TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:
- utterance-level segmentation,
- disambiguated morphosyntactic description,
- syntactic words,
- syntactic groups,
- named entities.
The Polish Parliamentary Corpus contains:
- Sejm sittings from 1919–present (including Legislative Sejm and State National Council)
- Sejm committee sittings from 1993–present
- Sejm interpellations and questions from 1997–present
- Senate sittings from 1922–1939 and 1989–present
- Senate committee sittings from 2015–present
At present you can download the unannotated TEI version with sources (15 GB). Linguistically annotated version will be added soon.
The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.
Please see also the slides from CLARIN-PLUS Workshop "Working with Parliamentary Records". Sofia, 27–29 March 2017.
Searching the corpus
Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences