Size: 2669
Comment:
|
Size: 2697
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 3: | Line 3: |
The Polish Sejm Corpus (PSC) is a large (200M-segment) collection of documents (stenographic transcripts, interpellations and questions) of [[http://www.sejm.gov.pl/english.html|Polish Sejm]] sittings from 1-7 terms of office. | The Polish Sejm Corpus (PSC) is a large (300M-segment) collection of documents (stenographic transcripts, interpellations and questions) of [[http://www.sejm.gov.pl/english.html|Polish Sejm]] sittings from 1-8 terms of office. |
Line 20: | Line 20: |
* [[http://sejm.nlp.ipipan.waw.pl/static/PSC-all.tar|Interpellations and questions terms 3-6, sittings terms 1-6]], 25 GB | * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-all.tar|Interpellations and questions terms 3-8, sittings terms 1-8]], 25 GB |
Line 23: | Line 23: |
* [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad3.tar|Interpellations and questions Term 3]], (1997–2001), 1.7 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad4.tar|Interpellations and questions Term 4]], (2001–05), 2.4 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad5.tar|Interpellations and questions Term 5]], (2005–07), 2.0 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad6.tar|Interpellations and questions Term 6]], (2007–11), 4.8 GB |
|
Line 28: | Line 24: |
* [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad1.tar|Sittings Term 1]], (1991–93), 0.9 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad2.tar|Sittings Term 2]], (1993–97), 2.6 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad3.tar|Sittings Term 3]], (1997–2001), 2.8 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad4.tar|Sittings Term 4]], (2001–05), 3.4 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad5.tar|Sittings Term 5]], (2005–07), 1.4 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad6.tar|Sittings Term 6]], (2007–11), 3.1 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad7.tar|Sittings Term 7]], (2011–15), 2.7 GB * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad8.tar|Sittings Term 8]], (2015–), 704 MB |
||<)> Term || Years || Sittings || Interpellations and questions || || 1 || 1991–93 || [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad1.tar|1.0 GB]] || || || 2 || 1993–97 || [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad2.tar|2.6 GB]] || || || 3 || 1997–2001 || [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad3.tar|2.9 GB]] || [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad3.tar|1.7 GB]] || || 4 || 2001–05 || [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad4.tar|3.4 GB]] || [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad4.tar|2.4 GB]] || || 5 || 2005–07 || [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad5.tar|1.4 GB]] || [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad5.tar|2.0 GB]] || || 6 || 2007–11 || [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad6.tar|3.2 GB]] || [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad6.tar|4.8 GB]] || || 7 || 2011–15 || [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad7.tar|2.7 GB]] || [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad7.tar|8.0 GB]] || || 8 || 2015– || [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad8.tar|0.7 GB]] || [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad8.tar|1.1 GB]] || |
The Polish Sejm Corpus / Polski Korpus Sejmowy
The Polish Sejm Corpus (PSC) is a large (300M-segment) collection of documents (stenographic transcripts, interpellations and questions) of Polish Sejm sittings from 1-8 terms of office.
The first edition of the PSC was prepared in October 2011 and was co-funded by CESAR European project. The current edition is being co-funded by CLARIN-PL.
Corpus data
The corpus contain transcripts of Sejm sessions saved in TEI P5 format. The resource contains automatically created annotation of:
- utterance-level segmentation,
- tokenization,
- lemmatization,
- disambiguated morphosyntactic description,
- syntactic words,
- syntactic groups,
- named entities.
Whole corpus
Divided by term and document type
|| 3 || 1997–2001 || 2.9 GB || 1.7 GB || || 4 || 2001–05 || 3.4 GB || 2.4 GB || || 5 || 2005–07 || 1.4 GB || 2.0 GB || || 6 || 2007–11 || 3.2 GB || 4.8 GB || || 7 || 2011–15 || 2.7 GB || 8.0 GB || || 8 || 2015– || 0.7 GB || 1.1 GB ||
Publications
Searching the corpus
Online search of the corpus is available at http://sejm.nlp.ipipan.waw.pl/. You can also use the Poliqarp image of the corpus: http://sejm.nlp.ipipan.waw.pl/static/PSC_poliqarp.tar.gz (826 MB).