Locked History Actions

Diff for "PPC"

Differences between revisions 2 and 3
Revision 2 as of 2017-02-28 11:45:47
Size: 1644
Comment:
Revision 3 as of 2017-02-28 11:46:20
Size: 1643
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and [[http://www.senat.gov.pl/en/|Senate]]. It is based on the [[PSC|Polish Sejm Corpus]] prepared in October 2011 and co-funded by project [[http://clip.ipipan.waw.pl/CESAR|CESAR]] and is currently being developed by [[http://clip.ipipan.waw.pl/CLARIN-PL-2|CLARIN-PL]] infrastructure. The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and [[http://www.senat.gov.pl/en/|Senate]]. It is based on the [[PSC|Polish Sejm Corpus]] prepared in October 2011 and co-funded by project [[http://clip.ipipan.waw.pl/CESAR|CESAR]] and is currently being extended by [[http://clip.ipipan.waw.pl/CLARIN-PL-2|CLARIN-PL]] infrastructure.

The Polish Parliamentary Corpus / Polski Korpus Parlamentarny

The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus prepared in October 2011 and co-funded by project CESAR and is currently being extended by CLARIN-PL infrastructure.

Corpus format

Corpus files are made available in TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:

  • utterance-level segmentation,
  • tokenization,
  • lemmatization,
  • disambiguated morphosyntactic description,
  • syntactic words,
  • syntactic groups,
  • named entities.

Corpus data

Publications

So far please cite the Polish Sejm Corpus paper:

List of publications

Maciej Ogrodniczuk. The Polish Sejm Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pages 2219–2223, Istanbul, Turkey, 2012. European Language Resources Association (ELRA).

Searching the corpus

Online search of the full corpus will be available soon. Search tool for a 200M sample of the Sejm data is available at http://sejm.nlp.ipipan.waw.pl/. You can also use the Poliqarp image of this smaller corpus: http://sejm.nlp.ipipan.waw.pl/static/PSC_poliqarp.tar.gz (826 MB).