The Polish Parliamentary Corpus / Polski Korpus Parlamentarny
The Polish Parliamentary Corpus (PPC) is planned to be a large collection of linguistically analysed documents from proceedings of Polish Parliament, Sejm and Senate.
The resources was collated based on the following corpora:
Format
Corpus files are made available in TEI P5 format. The resource contains automatically created annotation of:
- utterance-level segmentation,
- tokenization,
- lemmatization,
- disambiguated morphosyntactic description,
- syntactic words,
- syntactic groups,
- named entities.
Data
Please use the Polish Sejm Corpus data until newer version is made available.
Whole corpus
Divided by term and document type
Interpellations and questions Term 3, (1997–2001), 1.7 GB
Interpellations and questions Term 4, (2001–05), 2.4 GB
Interpellations and questions Term 5, (2005–07), 2.0 GB
Interpellations and questions Term 6, (2007–11), 4.8 GB
Sittings Term 1, (1991–93), 0.9 GB
Sittings Term 2, (1993–97), 2.6 GB
Sittings Term 3, (1997–2001), 2.8 GB
Sittings Term 4, (2001–05), 3.4 GB
Sittings Term 5, (2005–07), 1.4 GB
Sittings Term 6, (2007–11), 3.1 GB
Publications
So far please cite the Polish Sejm Corpus paper:
Searching the corpus
Online search of the corpus will be available soon.