The Polish Sejm Corpus / Polski Korpus Sejmowy
The Polish Sejm Corpus (PSC) is a large (300M-segment) collection of documents (stenographic transcripts, interpellations and questions) of Polish Sejm sittings from 1-8 terms of office.
The first edition of the PSC was prepared in October 2011 and was co-funded by CESAR European project. The current edition is being co-funded by CLARIN-PL.
Corpus data
The corpus contain transcripts of Sejm sessions saved in TEI P5 format. The resource contains automatically created annotation of:
- utterance-level segmentation,
- tokenization,
- lemmatization,
- disambiguated morphosyntactic description,
- syntactic words,
- syntactic groups,
- named entities.
Whole corpus
Divided by term and document type
|| 3 || 1997–2001 || 2.9 GB || 1.7 GB || || 4 || 2001–05 || 3.4 GB || 2.4 GB || || 5 || 2005–07 || 1.4 GB || 2.0 GB || || 6 || 2007–11 || 3.2 GB || 4.8 GB || || 7 || 2011–15 || 2.7 GB || 8.0 GB || || 8 || 2015– || 0.7 GB || 1.1 GB ||
Publications
Searching the corpus
Online search of the corpus is available at http://sejm.nlp.ipipan.waw.pl/. You can also use the Poliqarp image of the corpus: http://sejm.nlp.ipipan.waw.pl/static/PSC_poliqarp.tar.gz (826 MB).