Locked History Actions

Diff for "PPC"

Differences between revisions 1 and 14 (spanning 13 versions)
Revision 1 as of 2016-10-20 16:52:25
Size: 2343
Comment:
Revision 14 as of 2018-05-07 06:28:55
Size: 2347
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= The Polish Parliamentary Corpus / Polski Korpus Parlamentarny = = The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego =
Line 3: Line 3:
The Polish Parliamentary Corpus (PPC) is planned to be a large collection of linguistically analysed documents from proceedings of Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and Senate. The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and [[http://www.senat.gov.pl/en/|Senate]]. It is based on the [[PSC|Polish Sejm Corpus]] co-funded by project [[http://clip.ipipan.waw.pl/CESAR|CESAR]] and is currently being updated by [[http://clip.ipipan.waw.pl/CLARIN-PL-2|CLARIN-PL]] infrastructure.
Line 5: Line 5:
The resources was collated based on the following corpora:
 * [[PSC|Polish Sejm Corpus]].
== Corpus format ==
Line 8: Line 7:
== Format ==

Corpus files are made available in TEI P5 format. The resource contains automatically created annotation of:
Corpus files are made available in TEI P5 format compatible with the annotation used by the [[http://nkjp.pl/index.php?page=0&lang=1|National Corpus of Polish]]. The resource contains automatically created annotation of:
Line 19: Line 16:
== Data == == Corpus data ==
Line 21: Line 18:
Please use the Polish Sejm Corpus data until newer version is made available.  * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-all.tar|Sejm: Interpellations and questions terms 3-8, sittings terms 1-8]], 37.2 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Senat.tar|Senate: Sittings terms 2-9]], 5.2 GB
Line 23: Line 21:
=== Whole corpus ===
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-all.tar|Interpellations and questions terms 3-6, sittings terms 1-6]], 25 GB
== Licence ==
Line 26: Line 23:
=== Divided by term and document type ===
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad3.tar|Interpellations and questions Term 3]], (1997–2001), 1.7 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad4.tar|Interpellations and questions Term 4]], (2001–05), 2.4 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad5.tar|Interpellations and questions Term 5]], (2005–07), 2.0 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad6.tar|Interpellations and questions Term 6]], (2007–11), 4.8 GB

 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad1.tar|Sittings Term 1]], (1991–93), 0.9 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad2.tar|Sittings Term 2]], (1993–97), 2.6 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad3.tar|Sittings Term 3]], (1997–2001), 2.8 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad4.tar|Sittings Term 4]], (2001–05), 3.4 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad5.tar|Sittings Term 5]], (2005–07), 1.4 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad6.tar|Sittings Term 6]], (2007–11), 3.1 GB
The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.
Line 45: Line 31:
== Publications ==

{{attachment:closed.gif}} Ogrodniczuk M. (2018). ''Polish Parliamentary Corpus''. In: D. Fišer, M. Eskevich, F. de Jong'' (eds.) Proceedings of the LREC 2018 Workshop “ParlaCLARIN: Creating and Using Parliamentary Corpora. [[http://lrec-conf.org/workshops/lrec2018/W2/summaries/11_W2.html|{{attachment:bibtex.png}}]] [[http://lrec-conf.org/workshops/lrec2018/W2/pdf/11_W2.pdf|{{attachment:pdf.png}}]]. European Language Resources Association (ELRA). ISBN 979-10-95546-02-3.

Line 47: Line 38:
Online search of the corpus will be available soon.  * [[http://sejm.nlp.ipipan.waw.pl/|using Poliqarp search engine]]
 * [[http://smyrna.sejm.nlp.ipipan.waw.pl/|using Smyrna search engine]]
 * [[http://ngram.sejm.nlp.ipipan.waw.pl/|using ngram viewer]]


== Contact information ==

[[http://zil.ipipan.waw.pl/MaciejOgrodniczuk|Maciej Ogrodniczuk]], Institute of Computer Science, Polish Academy of Sciences

The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego

The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus co-funded by project CESAR and is currently being updated by CLARIN-PL infrastructure.

Corpus format

Corpus files are made available in TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:

  • utterance-level segmentation,
  • tokenization,
  • lemmatization,
  • disambiguated morphosyntactic description,
  • syntactic words,
  • syntactic groups,
  • named entities.

Corpus data

Licence

The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.

Publications

So far please cite the Polish Sejm Corpus paper:

List of publications

Maciej Ogrodniczuk. The Polish Sejm Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pages 2219–2223, Istanbul, Turkey, 2012. European Language Resources Association (ELRA).

Publications

[ATTACH] Ogrodniczuk M. (2018). Polish Parliamentary Corpus. In: D. Fišer, M. Eskevich, F. de Jong (eds.) Proceedings of the LREC 2018 Workshop “ParlaCLARIN: Creating and Using Parliamentary Corpora. [ATTACH] [ATTACH]. European Language Resources Association (ELRA). ISBN 979-10-95546-02-3.

Searching the corpus

Contact information

Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences