Locked History Actions

Diff for "PPC"

Differences between revisions 2 and 19 (spanning 17 versions)
Revision 2 as of 2017-02-28 11:45:47
Size: 1644
Comment:
Revision 19 as of 2018-05-07 06:44:09
Size: 2983
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= The Polish Parliamentary Corpus / Polski Korpus Parlamentarny = = The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego =
Line 3: Line 3:
The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and [[http://www.senat.gov.pl/en/|Senate]]. It is based on the [[PSC|Polish Sejm Corpus]] prepared in October 2011 and co-funded by project [[http://clip.ipipan.waw.pl/CESAR|CESAR]] and is currently being developed by [[http://clip.ipipan.waw.pl/CLARIN-PL-2|CLARIN-PL]] infrastructure. The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and [[http://www.senat.gov.pl/en/|Senate]]. It is based on the [[PSC|Polish Sejm Corpus]] co-funded by project [[http://clip.ipipan.waw.pl/CESAR|CESAR]] and is currently being updated by [[http://clip.ipipan.waw.pl/CLARIN-PL-2|CLARIN-PL]] infrastructure.
Line 17: Line 17:
   * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-all.tar|Sejm: Interpellations and questions terms 3-8, sittings terms 1-8]], 37.2 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Senat.tar|Senate: Sittings terms 2-9]], 5.2 GB
Line 18: Line 21:
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-all.tar|Interpellations and questions terms 3-8, sittings terms 1-8]], 37.2 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Senat.tar|Sittings terms 2-9]], 5.2 GB
== Licence ==
Line 21: Line 23:
The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.
Line 24: Line 27:
So far please cite the Polish Sejm Corpus paper:
Line 26: Line 28:
<<BibMate(key,"ogro:12:lrec",omitYears=true)>>  * Ogrodniczuk M. (2012). '' The Polish Sejm Corpus''. In: D. Fišer, M. Eskevich, F. de Jong (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 2219–2223, European Language Resources Association (ELRA). [[http://lrec-conf.org/workshops/lrec2018/W2/summaries/11_W2.html|{{attachment:bibtex.png|alt text|align="bottom"}}]] [[[http://www.lrec-conf.org/proceedings/lrec2012/pdf/653_Paper.pdf|{{attachment:pdf.png}}]].

 * Ogrodniczuk M. (2018). ''Polish Parliamentary Corpus''. In: D. Fišer, M. Eskevich, F. de Jong (eds.) Proceedings of the LREC 2018 Workshop “ParlaCLARIN: Creating and Using Parliamentary Corpora. European Language Resources Association (ELRA). ISBN 979-10-95546-02-3. [[http://lrec-conf.org/workshops/lrec2018/W2/summaries/11_W2.html|{{attachment:bibtex.png}}]] [[http://lrec-conf.org/workshops/lrec2018/W2/pdf/11_W2.pdf|{{attachment:pdf.png}}]].


== Other information ==

Please see also [[https://www.clarin.eu/sites/default/files/2-ogrodniczuk.pdf|the slides]] from [[https://www.clarin.eu/event/2017/clarin-plus-workshop-working-parliamentary-records|CLARIN-PLUS Workshop "Working with Parliamentary Records"]]. Sofia, 27--29 March 2017.
Line 30: Line 40:
Online search of the full corpus will be available soon. Search tool for a 200M sample of the Sejm data is available at http://sejm.nlp.ipipan.waw.pl/. You can also use the Poliqarp image of this smaller corpus: http://sejm.nlp.ipipan.waw.pl/static/PSC_poliqarp.tar.gz (826 MB).  * [[http://sejm.nlp.ipipan.waw.pl/|using Poliqarp search engine]]
 * [[http://smyrna.sejm.nlp.ipipan.waw.pl/|using Smyrna search engine]]
 * [[http://ngram.sejm.nlp.ipipan.waw.pl/|using ngram viewer]]


== Contact information ==

[[http://zil.ipipan.waw.pl/MaciejOgrodniczuk|Maciej Ogrodniczuk]], Institute of Computer Science, Polish Academy of Sciences

The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego

The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus co-funded by project CESAR and is currently being updated by CLARIN-PL infrastructure.

Corpus format

Corpus files are made available in TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:

  • utterance-level segmentation,
  • tokenization,
  • lemmatization,
  • disambiguated morphosyntactic description,
  • syntactic words,
  • syntactic groups,
  • named entities.

Corpus data

Licence

The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.

Publications

  • Ogrodniczuk M. (2012). The Polish Sejm Corpus. In: D. Fišer, M. Eskevich, F. de Jong (eds.) Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), pp. 2219–2223, European Language Resources Association (ELRA). [ATTACH] [ATTACH].

  • Ogrodniczuk M. (2018). Polish Parliamentary Corpus. In: D. Fišer, M. Eskevich, F. de Jong (eds.) Proceedings of the LREC 2018 Workshop “ParlaCLARIN: Creating and Using Parliamentary Corpora. European Language Resources Association (ELRA). ISBN 979-10-95546-02-3. [ATTACH] [ATTACH].

Other information

Please see also the slides from CLARIN-PLUS Workshop "Working with Parliamentary Records". Sofia, 27--29 March 2017.

Searching the corpus

Contact information

Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences