Size: 1656
Comment:
|
Size: 1808
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 21: | Line 21: |
== Licence == CC-BY (Attribution) |
|
Line 31: | Line 34: |
== Contact information == [[Maciej Ogrodniczuk]], Institute of Computer Science, Polish Academy of Sciences |
The Polish Parliamentary Corpus / Polski Korpus Parlamentarny
The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus prepared in October 2011 and co-funded by project CESAR and is currently being updated by CLARIN-PL infrastructure.
Corpus format
Corpus files are made available in TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:
- utterance-level segmentation,
- tokenization,
- lemmatization,
- disambiguated morphosyntactic description,
- syntactic words,
- syntactic groups,
- named entities.
Corpus data
Sejm: Interpellations and questions terms 3-8, sittings terms 1-8, 37.2 GB
Senate: Sittings terms 2-9, 5.2 GB
Licence
CC-BY (Attribution)
Publications
So far please cite the Polish Sejm Corpus paper:
Searching the corpus
Online search of the full corpus will be available soon. Search tool for a 200M sample of the Sejm data is available at http://sejm.nlp.ipipan.waw.pl/. You can also use the Poliqarp image of this smaller corpus: http://sejm.nlp.ipipan.waw.pl/static/PSC_poliqarp.tar.gz (826 MB).
Contact information
Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences