Size: 2343
Comment:
|
Size: 2294
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
= The Polish Parliamentary Corpus / Polski Korpus Parlamentarny = | = The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego = |
Line 3: | Line 3: |
The Polish Parliamentary Corpus (PPC) is planned to be a large collection of linguistically analysed documents from proceedings of Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and Senate. | The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and [[http://www.senat.gov.pl/en/|Senate]]. It is based on the [[PSC|Polish Sejm Corpus]] co-funded by project [[http://clip.ipipan.waw.pl/CESAR|CESAR]] and is currently being updated by [[http://clip.ipipan.waw.pl/CLARIN-PL-2|CLARIN-PL]] infrastructure. |
Line 5: | Line 5: |
The resources was collated based on the following corpora: * [[PSC|Polish Sejm Corpus]]. |
== Corpus format == |
Line 8: | Line 7: |
== Format == Corpus files are made available in TEI P5 format. The resource contains automatically created annotation of: |
Corpus files are made available in TEI P5 format compatible with the annotation used by the [[http://nkjp.pl/index.php?page=0&lang=1|National Corpus of Polish]]. The resource contains automatically created annotation of: |
Line 15: | Line 12: |
* syntactic words, * syntactic groups, * named entities. |
* named entities. |
Line 19: | Line 14: |
== Data == | == Corpus data == The Polish Parliamentary Corpus contains: * Sejm sittings from 1919–present (including Legislative Sejm and State National Council) * Sejm committee sittings from 1993–present * Sejm interpellations and questions from 1997–present * Senate sittings from 1922–1939 and 1989–present * Senate committee sittings from 2015–present |
Line 21: | Line 23: |
Please use the Polish Sejm Corpus data until newer version is made available. | == Download == |
Line 23: | Line 25: |
=== Whole corpus === * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-all.tar|Interpellations and questions terms 3-6, sittings terms 1-6]], 25 GB |
At present you can download [[http://manage.legis.nlp.ipipan.waw.pl/download/ppc-nanno.tar.gz|the unannotated TEI version]] (1.7 GB). Linguistically annotated version will be added soon. |
Line 26: | Line 27: |
=== Divided by term and document type === * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad3.tar|Interpellations and questions Term 3]], (1997–2001), 1.7 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad4.tar|Interpellations and questions Term 4]], (2001–05), 2.4 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad5.tar|Interpellations and questions Term 5]], (2005–07), 2.0 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad6.tar|Interpellations and questions Term 6]], (2007–11), 4.8 GB |
== Searching the corpus == |
Line 32: | Line 29: |
* [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad1.tar|Sittings Term 1]], (1991–93), 0.9 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad2.tar|Sittings Term 2]], (1993–97), 2.6 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad3.tar|Sittings Term 3]], (1997–2001), 2.8 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad4.tar|Sittings Term 4]], (2001–05), 3.4 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad5.tar|Sittings Term 5]], (2005–07), 1.4 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad6.tar|Sittings Term 6]], (2007–11), 3.1 GB |
* [[http://sejm.nlp.ipipan.waw.pl/|using the search engine]] * [[http://ngram.sejm.nlp.ipipan.waw.pl/|using the ngram viewer]] == Licence == The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence. |
Line 41: | Line 38: |
So far please cite the Polish Sejm Corpus paper: | <<BibMate(key, "ogr:2018:parlaclarin", "ogro:12:lrec", omitYears=true)>> |
Line 43: | Line 40: |
<<BibMate(key,"ogro:12:lrec",omitYears=true)>> | Please see also [[https://www.clarin.eu/sites/default/files/2-ogrodniczuk.pdf|the slides]] from [[https://www.clarin.eu/event/2017/clarin-plus-workshop-working-parliamentary-records|CLARIN-PLUS Workshop "Working with Parliamentary Records"]]. Sofia, 27–29 March 2017. |
Line 45: | Line 42: |
== Searching the corpus == | == Contact == |
Line 47: | Line 44: |
Online search of the corpus will be available soon. | [[http://zil.ipipan.waw.pl/MaciejOgrodniczuk|Maciej Ogrodniczuk]], Institute of Computer Science, Polish Academy of Sciences |
The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego
The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus co-funded by project CESAR and is currently being updated by CLARIN-PL infrastructure.
Corpus format
Corpus files are made available in TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:
- utterance-level segmentation,
- tokenization,
- lemmatization,
- disambiguated morphosyntactic description,
- named entities.
Corpus data
The Polish Parliamentary Corpus contains:
- Sejm sittings from 1919–present (including Legislative Sejm and State National Council)
- Sejm committee sittings from 1993–present
- Sejm interpellations and questions from 1997–present
- Senate sittings from 1922–1939 and 1989–present
- Senate committee sittings from 2015–present
Download
At present you can download the unannotated TEI version (1.7 GB). Linguistically annotated version will be added soon.
Searching the corpus
Licence
The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.
Publications
Please see also the slides from CLARIN-PLUS Workshop "Working with Parliamentary Records". Sofia, 27–29 March 2017.
Contact
Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences