Size: 1644
Comment:
|
Size: 2121
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
= The Polish Parliamentary Corpus / Polski Korpus Parlamentarny = | = The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego = |
Line 3: | Line 3: |
The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and [[http://www.senat.gov.pl/en/|Senate]]. It is based on the [[PSC|Polish Sejm Corpus]] prepared in October 2011 and co-funded by project [[http://clip.ipipan.waw.pl/CESAR|CESAR]] and is currently being developed by [[http://clip.ipipan.waw.pl/CLARIN-PL-2|CLARIN-PL]] infrastructure. | The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and [[http://www.senat.gov.pl/en/|Senate]]. It is based on the [[PSC|Polish Sejm Corpus]] co-funded by project [[http://clip.ipipan.waw.pl/CESAR|CESAR]] and is currently being updated by [[http://clip.ipipan.waw.pl/CLARIN-PL-2|CLARIN-PL]] infrastructure. |
Line 14: | Line 14: |
* named entities. | * named entities. |
Line 17: | Line 17: |
* [[http://sejm.nlp.ipipan.waw.pl/static/PSC-all.tar|Sejm: Interpellations and questions terms 3-8, sittings terms 1-8]], 38.3 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Senat.tar|Senate: Sittings terms 2-9]], 5.6 GB |
|
Line 18: | Line 21: |
* [[http://sejm.nlp.ipipan.waw.pl/static/PSC-all.tar|Interpellations and questions terms 3-8, sittings terms 1-8]], 37.2 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Senat.tar|Sittings terms 2-9]], 5.2 GB |
== Licence == |
Line 21: | Line 23: |
The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence. | |
Line 24: | Line 27: |
So far please cite the Polish Sejm Corpus paper: | <<BibMate(key, "ogr:2018:parlaclarin", "ogro:12:lrec", omitYears=true)>> |
Line 26: | Line 29: |
<<BibMate(key,"ogro:12:lrec",omitYears=true)>> | == Other information == Please see also [[https://www.clarin.eu/sites/default/files/2-ogrodniczuk.pdf|the slides]] from [[https://www.clarin.eu/event/2017/clarin-plus-workshop-working-parliamentary-records|CLARIN-PLUS Workshop "Working with Parliamentary Records"]]. Sofia, 27–29 March 2017. |
Line 30: | Line 36: |
Online search of the full corpus will be available soon. Search tool for a 200M sample of the Sejm data is available at http://sejm.nlp.ipipan.waw.pl/. You can also use the Poliqarp image of this smaller corpus: http://sejm.nlp.ipipan.waw.pl/static/PSC_poliqarp.tar.gz (826 MB). | * [[http://sejm.nlp.ipipan.waw.pl/|using Poliqarp search engine]] * [[http://smyrna.sejm.nlp.ipipan.waw.pl/|using Smyrna search engine]] * [[http://ngram.sejm.nlp.ipipan.waw.pl/|using ngram viewer]] == Contact information == [[http://zil.ipipan.waw.pl/MaciejOgrodniczuk|Maciej Ogrodniczuk]], Institute of Computer Science, Polish Academy of Sciences |
The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego
The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus co-funded by project CESAR and is currently being updated by CLARIN-PL infrastructure.
Corpus format
Corpus files are made available in TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:
- utterance-level segmentation,
- tokenization,
- lemmatization,
- disambiguated morphosyntactic description,
- syntactic words,
- syntactic groups,
- named entities.
Corpus data
Sejm: Interpellations and questions terms 3-8, sittings terms 1-8, 38.3 GB
Senate: Sittings terms 2-9, 5.6 GB
Licence
The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.
Publications
Other information
Please see also the slides from CLARIN-PLUS Workshop "Working with Parliamentary Records". Sofia, 27–29 March 2017.
Searching the corpus
Contact information
Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences