Differences between revisions 2 and 46 (spanning 44 versions)

The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego

The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus co-funded by project CESAR and is currently being updated by CLARIN-PL infrastructure.

Corpus format

Corpus files are made available in TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:

utterance-level segmentation,
tokenization,
lemmatization,
disambiguated morphosyntactic description,
named entities.

Corpus data

The Polish Parliamentary Corpus contains:

Sejm sittings from 1919–present (including Legislative Sejm and State National Council)
Sejm committee sittings from 1993–present
Sejm interpellations and questions from 1997–present
Senate sittings from 1922–1939 and 1989–present
Senate committee sittings from 2015–present

Download

At present you can download the unannotated TEI version (1.7 GB). Linguistically annotated version will be added soon.

Searching the corpus

Licence

The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.

Publications

List of publications

Maciej Ogrodniczuk. The Polish Sejm Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pages 2219–2223, Istanbul, Turkey, 2012. European Language Resources Association (ELRA).

Please see also the slides from CLARIN-PLUS Workshop "Working with Parliamentary Records". Sofia, 27–29 March 2017.

Contact

Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences

-  ⇤ ← Revision 2 as of 2017-02-28 11:45:47 → 
  Size: 1644
  Editor: MaciejOgrodniczuk
  Comment:
+   ← Revision 46 as of 2020-02-28 20:34:59 → ⇥
  Size: 2294
  Editor: BartlomiejNiton
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-= The Polish Parliamentary Corpus / Polski Korpus Parlamentarny =
+= The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego =
 Line 3:
-The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and [[http://www.senat.gov.pl/en/|Senate]]. It is based on the [[PSC|Polish Sejm Corpus]] prepared in October 2011 and co-funded by project [[http://clip.ipipan.waw.pl/CESAR|CESAR]] and is currently being developed by [[http://clip.ipipan.waw.pl/CLARIN-PL-2|CLARIN-PL]] infrastructure.
+The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and [[http://www.senat.gov.pl/en/|Senate]]. It is based on the [[PSC|Polish Sejm Corpus]] co-funded by project [[http://clip.ipipan.waw.pl/CESAR|CESAR]] and is currently being updated by [[http://clip.ipipan.waw.pl/CLARIN-PL-2|CLARIN-PL]] infrastructure.
 Line 12:
- * syntactic words,
 * syntactic groups,
 * named entities.
+ * named entities.
-Line 17:
+Line 15:
+  The Polish Parliamentary Corpus contains:
 * Sejm sittings from 1919–present (including Legislative Sejm and State National Council)
 * Sejm committee sittings from 1993–present
 * Sejm interpellations and questions from 1997–present 
 * Senate sittings from 1922–1939 and 1989–present
 * Senate committee sittings from 2015–present
-Line 18:
+Line 23:
- * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-all.tar|Interpellations and questions terms 3-8, sittings terms 1-8]], 37.2 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Senat.tar|Sittings terms 2-9]], 5.2 GB
+== Download ==
-Line 21:
+Line 25:
+At present you can download [[http://manage.legis.nlp.ipipan.waw.pl/download/ppc-nanno.tar.gz|the unannotated TEI version]] (1.7 GB). Linguistically annotated version will be added soon.

== Searching the corpus ==

 * [[http://sejm.nlp.ipipan.waw.pl/|using the search engine]]
 * [[http://ngram.sejm.nlp.ipipan.waw.pl/|using the ngram viewer]]

== Licence ==

The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.
-Line 24:
+Line 38:
-So far please cite the Polish Sejm Corpus paper:
+<<BibMate(key, "ogr:2018:parlaclarin", "ogro:12:lrec", omitYears=true)>>
-Line 26:
+Line 40:
-<<BibMate(key,"ogro:12:lrec",omitYears=true)>>
+Please see also [[https://www.clarin.eu/sites/default/files/2-ogrodniczuk.pdf|the slides]] from [[https://www.clarin.eu/event/2017/clarin-plus-workshop-working-parliamentary-records|CLARIN-PLUS Workshop "Working with Parliamentary Records"]]. Sofia, 27–29 March 2017.
-Line 28:
+Line 42:
-== Searching the corpus ==
+== Contact ==
-Line 30:
+Line 44:
-Online search of the full corpus will be available soon. Search tool for a 200M sample of the Sejm data is available at http://sejm.nlp.ipipan.waw.pl/. You can also use the Poliqarp image of this smaller corpus: http://sejm.nlp.ipipan.waw.pl/static/PSC_poliqarp.tar.gz (826 MB).
+[[http://zil.ipipan.waw.pl/MaciejOgrodniczuk|Maciej Ogrodniczuk]], Institute of Computer Science, Polish Academy of Sciences

Diff for "PPC"

Menu

Wiki