⇤ ← Revision 1 as of 2016-10-20 16:52:25
Size: 2343
Comment:
|
Size: 1644
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 3: | Line 3: |
The Polish Parliamentary Corpus (PPC) is planned to be a large collection of linguistically analysed documents from proceedings of Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and Senate. | The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and [[http://www.senat.gov.pl/en/|Senate]]. It is based on the [[PSC|Polish Sejm Corpus]] prepared in October 2011 and co-funded by project [[http://clip.ipipan.waw.pl/CESAR|CESAR]] and is currently being developed by [[http://clip.ipipan.waw.pl/CLARIN-PL-2|CLARIN-PL]] infrastructure. |
Line 5: | Line 5: |
The resources was collated based on the following corpora: * [[PSC|Polish Sejm Corpus]]. |
== Corpus format == |
Line 8: | Line 7: |
== Format == Corpus files are made available in TEI P5 format. The resource contains automatically created annotation of: |
Corpus files are made available in TEI P5 format compatible with the annotation used by the [[http://nkjp.pl/index.php?page=0&lang=1|National Corpus of Polish]]. The resource contains automatically created annotation of: |
Line 19: | Line 16: |
== Data == | == Corpus data == |
Line 21: | Line 18: |
Please use the Polish Sejm Corpus data until newer version is made available. | * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-all.tar|Interpellations and questions terms 3-8, sittings terms 1-8]], 37.2 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Senat.tar|Sittings terms 2-9]], 5.2 GB |
Line 23: | Line 21: |
=== Whole corpus === * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-all.tar|Interpellations and questions terms 3-6, sittings terms 1-6]], 25 GB === Divided by term and document type === * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad3.tar|Interpellations and questions Term 3]], (1997–2001), 1.7 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad4.tar|Interpellations and questions Term 4]], (2001–05), 2.4 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad5.tar|Interpellations and questions Term 5]], (2005–07), 2.0 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad6.tar|Interpellations and questions Term 6]], (2007–11), 4.8 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad1.tar|Sittings Term 1]], (1991–93), 0.9 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad2.tar|Sittings Term 2]], (1993–97), 2.6 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad3.tar|Sittings Term 3]], (1997–2001), 2.8 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad4.tar|Sittings Term 4]], (2001–05), 3.4 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad5.tar|Sittings Term 5]], (2005–07), 1.4 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad6.tar|Sittings Term 6]], (2007–11), 3.1 GB |
|
Line 47: | Line 30: |
Online search of the corpus will be available soon. | Online search of the full corpus will be available soon. Search tool for a 200M sample of the Sejm data is available at http://sejm.nlp.ipipan.waw.pl/. You can also use the Poliqarp image of this smaller corpus: http://sejm.nlp.ipipan.waw.pl/static/PSC_poliqarp.tar.gz (826 MB). |
The Polish Parliamentary Corpus / Polski Korpus Parlamentarny
The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus prepared in October 2011 and co-funded by project CESAR and is currently being developed by CLARIN-PL infrastructure.
Corpus format
Corpus files are made available in TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:
- utterance-level segmentation,
- tokenization,
- lemmatization,
- disambiguated morphosyntactic description,
- syntactic words,
- syntactic groups,
- named entities.
Corpus data
Publications
So far please cite the Polish Sejm Corpus paper:
Searching the corpus
Online search of the full corpus will be available soon. Search tool for a 200M sample of the Sejm data is available at http://sejm.nlp.ipipan.waw.pl/. You can also use the Poliqarp image of this smaller corpus: http://sejm.nlp.ipipan.waw.pl/static/PSC_poliqarp.tar.gz (826 MB).