Revision 16 as of 2018-05-07 06:41:10

Clear message
Locked History Actions

PPC

The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego

The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus co-funded by project CESAR and is currently being updated by CLARIN-PL infrastructure.

Corpus format

Corpus files are made available in TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:

  • utterance-level segmentation,
  • tokenization,
  • lemmatization,
  • disambiguated morphosyntactic description,
  • syntactic words,
  • syntactic groups,
  • named entities.

Corpus data

Licence

The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.

Publications

[ATTACH] Ogrodniczuk M. (2012). The Polish Sejm Corpus. In: D. Fišer, M. Eskevich, F. de Jong (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 2219–2223, European Language Resources Association (ELRA). [ATTACH] [ATTACH].

[ATTACH] Ogrodniczuk M. (2018). Polish Parliamentary Corpus. In: D. Fišer, M. Eskevich, F. de Jong (eds.) Proceedings of the LREC 2018 Workshop “ParlaCLARIN: Creating and Using Parliamentary Corpora. European Language Resources Association (ELRA). ISBN 979-10-95546-02-3. [ATTACH] [ATTACH].

Other information

Please see also the slides from CLARIN-PLUS Workshop "Working with Parliamentary Records". Sofia, 27--29 March 2017.

Searching the corpus

Contact information

Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences