Differences between revisions 1 and 76 (spanning 75 versions)

The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego

The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of the Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus co-funded by project CESAR and later extended using the support of CLARIN-PL, MARCELL and ParlaMint projects.

Corpus data

The current size of the corpus (as of 29 November 2022) amounts to over 800M segments with detailed distribution over houses, periods, and document types presented below. Apart from the stenographic records of plenary sittings (271M segments) and committee sittings (330M segments), the corpus contains 199M segments of interpellations and questions.

Sejm

Senate

Sittings

Committees

Interpellations

Sittings

Committees

Years

Period

docs

segments

docs

segments

docs

segments

Period

docs

segments

docs

segments

1919–1922

Legislative Sejm

312

6 945 162

–

1919–1922

–

1922–1927

1st term of office

277

7 338 355

–

1922–1927

1st term

1 979 541

–

1928–1930

2nd

2 139 835

–

1928–1930

2nd

171 345

–

1930–1935

3rd

2 404 267

–

1930–1935

3rd

1 804 635

–

1935–1938

4th

2 133 181

–

1935–1938

4th

724 687

–

1938–1939

5th

610 455

–

1938–1939

5th

347 430

–

1943–1947

State National Council

234 514

–

1947–1952

Legislative Sejm

107

2 575 136

–

1952–1956

1st term of office

1 172 333

–

1957–1961

2nd

2 502 936

–

1961–1965

3rd

1 388 862

–

1965–1969

4th

1 163 336

–

1969–1972

5th

526 277

–

1972–1976

6th

1 176 712

–

1976–1980

7th

918 993

–

1980–1985

8th

3 377 139

–

1985–1989

9th

2 641 788

–

1989–1991

10th

6 674 111

–

1989–1991

1st term

3 170 293

–

1991–1993

1st term of office

142

7 739 147

–

1991–1993

2nd

1 459 440

–

1993–1997

2nd

317

22 134 682

3 858

41 756 476

–

1993–1997

3rd

125

5 051 677

–

1997–2001

3rd

320

24 138 142

4 690

42 510 604

23 507

12 101 453

1997–2001

4th

187

8 255 897

–

2001–2005

4th

337

28 743 846

4 942

49 302 521

30 986

17 519 177

2001–2005

5th

175

6 485 347

–

2005–2007

5th

148

11 737 186

2 359

18 970 036

26 689

14 777 377

2005–2007

6th

3 571 293

–

2007–2011

6th

298

22 415 708

5 565

51 752 446

59 353

36 412 001

2007–2011

7th

167

8 819 116

–

2011–2015

7th

292

22 488 262

5 126

44 645 569

85 599

61 565 989

2011–2015

8th

159

7 100 841

–

2015–2019

8th

239

19 431 789

4 828

46 735 363

79 194

56 720 590

2015–2019

9th

204

10 240 444

2 156

16 011 501

2019–

9th

159

11 426 833

2 731

23 818 878

–

2019–

10th

121

6 359 450

1 423

12 990 378

Corpus format

Corpus files are made available in XML TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:

utterance-level segmentation, tokenization and lemmatization produced with Morfeusz2
disambiguated morphosyntactic description produced with Concraft2
named entities produced with Liner2
dependency structures produced with COMBO parser.

Download

Unannotated TEI version (1.7 GB)
Linguistically annotated version (34.4 GB)
A small sample with data from different periods (39 MB)
PPC data on GitLab

Searching the corpus

Licence

The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.

Publications

List of publications

Maciej Ogrodniczuk, Michał Rudolf, Beata Wójtowicz, and Sonia Janicka. Error correction environment for the Polish Parliamentary Corpus. In Proceedings of The Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference, pages 35–38, Marseille, France, 2022. European Language Resources Association.

Maciej Ogrodniczuk and Bartłomiej Nitoń. New developments in the Polish Parliamentary Corpus. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the Second ParlaCLARIN Workshop, pages 1–4, Marseille, France, 2020. European Language Resources Association (ELRA).

Maciej Ogrodniczuk. Polish Parliamentary Corpus. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the LREC 2018 Workshop ParlaCLARIN: Creating and Using Parliamentary Corpora, pages 15–19, Paris, France, 2018. European Language Resources Association (ELRA).

Maciej Ogrodniczuk. The Polish Sejm Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pages 2219–2223, Istanbul, Turkey, 2012. European Language Resources Association (ELRA).

Contact

Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences

-  ⇤ ← Revision 1 as of 2016-10-20 16:52:25 → 
  Size: 2343
  Editor: MaciejOgrodniczuk
  Comment:
+   ← Revision 76 as of 2024-03-22 16:40:46 → ⇥
  Size: 10858
  Editor: MaciejOgrodniczuk
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-= The Polish Parliamentary Corpus / Polski Korpus Parlamentarny =
+= The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego =
 Line 3:
-The Polish Parliamentary Corpus (PPC) is planned to be a large collection of linguistically analysed documents from proceedings of Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and Senate.
+The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of the Polish Parliament, [[http://www.sejm.gov.pl/english.html|Sejm]] and [[http://www.senat.gov.pl/en/|Senate]]. It is based on the [[PSC|Polish Sejm Corpus]] co-funded by project [[http://clip.ipipan.waw.pl/CESAR|CESAR]] and later extended using the support of [[http://clip.ipipan.waw.pl/CLARIN-PL|CLARIN-PL]], [[http://clip.ipipan.waw.pl/MARCELL|MARCELL]] and [[http://zil.ipipan.waw.pl/ParlaMint|ParlaMint]] projects.
 Line 5:
-The resources was collated based on the following corpora:
 * [[PSC|Polish Sejm Corpus]].
+== Corpus data ==
 
The current size of the corpus (as of 29 November 2022) amounts to over 800M segments with detailed distribution over houses, periods, and document types presented below. Apart from the stenographic records of plenary sittings
(271M segments) and committee sittings (330M segments), the corpus contains 199M segments of interpellations and questions.
-Line 8:
+Line 10:
-== Format ==
+||<tablewidth="100%">              ||<-7> '''Sejm'''                                                                                                         || ||<-6> '''Senate'''                                                                           ||
||              ||                        ||<-2> '''Sittings'''           ||<-2> '''Committees'''         ||<-2> '''Interpellations'''    || ||            ||              ||<-2> '''Sittings'''           ||<-2> '''Committees'''         ||
|| '''Years ''' || '''Period'''           || '''docs''' ||  '''segments'''|| '''docs''' || '''segments''' || '''docs''' || '''segments''' || ||            || '''Period''' || '''docs''' || '''segments''' || '''docs''' || '''segments''' ||
|| 1919–1922    || Legislative Sejm       ||<)>    312  ||<)>   6 945 162 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || || 1919–1922  ||<:>    –      ||<:>      –  ||<:>   –         ||<:>     –   ||<:>      –      ||
|| 1922–1927    || 1st term of office     ||<)>    277  ||<)>   7 338 355 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || || 1922–1927  ||    1st term  ||<)>     96  ||<)>  1 979 541  ||<:>     –   ||<:>      –      ||
|| 1928–1930    || 2nd                    ||<)>     58  ||<)>   2 139 835 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || || 1928–1930  ||    2nd       ||<)>      3  ||<)>    171 345  ||<:>     –   ||<:>      –      ||
|| 1930–1935    || 3rd                    ||<)>     72  ||<)>   2 404 267 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || || 1930–1935  ||    3rd       ||<)>     64  ||<)>  1 804 635  ||<:>     –   ||<:>      –      ||
|| 1935–1938    || 4th                    ||<)>     73  ||<)>   2 133 181 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || || 1935–1938  ||    4th       ||<)>     29  ||<)>    724 687  ||<:>     –   ||<:>      –      ||
|| 1938–1939    || 5th                    ||<)>     23  ||<)>     610 455 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || || 1938–1939  ||    5th       ||<)>     20  ||<)>    347 430  ||<:>     –   ||<:>      –      ||
|| 1943–1947    || State National Council ||<)>      6  ||<)>     234 514 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || ||            ||<:>     –     ||<:>     –   ||<:>     –       ||<:>     –   ||<:>      –      ||
|| 1947–1952    || Legislative Sejm       ||<)>    107  ||<)>   2 575 136 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || ||            ||<:>     –     ||<:>     –   ||<:>     –       ||<:>     –   ||<:>      –      ||
|| 1952–1956    || 1st term of office     ||<)>     39  ||<)>   1 172 333 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || ||            ||<:>     –     ||<:>     –   ||<:>     –       ||<:>     –   ||<:>      –      ||
|| 1957–1961    || 2nd                    ||<)>     59  ||<)>   2 502 936 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || ||            ||<:>     –     ||<:>     –   ||<:>     –       ||<:>     –   ||<:>      –      ||
|| 1961–1965    || 3rd                    ||<)>     32  ||<)>   1 388 862 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || ||            ||<:>     –     ||<:>     –   ||<:>     –       ||<:>     –   ||<:>      –      ||
|| 1965–1969    || 4th                    ||<)>     23  ||<)>   1 163 336 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || ||            ||<:>     –     ||<:>     –   ||<:>     –       ||<:>     –   ||<:>      –      ||
|| 1969–1972    || 5th                    ||<)>     17  ||<)>     526 277 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || ||            ||<:>     –     ||<:>     –   ||<:>     –       ||<:>     –   ||<:>      –      ||
|| 1972–1976    || 6th                    ||<)>     32  ||<)>   1 176 712 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || ||            ||<:>     –     ||<:>     –   ||<:>     –       ||<:>     –   ||<:>      –      ||
|| 1976–1980    || 7th                    ||<)>     29  ||<)>     918 993 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || ||            ||<:>     –     ||<:>     –   ||<:>     –       ||<:>     –   ||<:>      –      ||
|| 1980–1985    || 8th                    ||<)>     70  ||<)>   3 377 139 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || ||            ||<:>     –     ||<:>     –   ||<:>     –       ||<:>     –   ||<:>      –      ||
|| 1985–1989    || 9th                    ||<)>     45  ||<)>   2 641 788 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || ||            ||<:>     –     ||<:>     –   ||<:>     –       ||<:>     –   ||<:>      –      ||
|| 1989–1991    || 10th                   ||<)>     77  ||<)>   6 674 111 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || || 1989–1991  ||    1st term  ||<)>     60  ||<)>  3 170 293  ||<:>     –   ||<:>      –      ||
|| 1991–1993    || 1st term of office     ||<)>    142  ||<)>   7 739 147 ||<:>    –    ||<:>       –     ||<:>     –   ||<:>      –      || || 1991–1993  ||    2nd       ||<)>     48  ||<)>  1 459 440  ||<:>     –   ||<:>      –      ||
|| 1993–1997    || 2nd                    ||<)>    317  ||<)>  22 134 682 ||<)>  3 858  ||<)>  41 756 476 ||<:>     –   ||<:>      –      || || 1993–1997  ||    3rd       ||<)>    125  ||<)>  5 051 677  ||<:>     –   ||<:>      –      ||
|| 1997–2001    || 3rd                    ||<)>    320  ||<)>  24 138 142 ||<)>  4 690  ||<)>  42 510 604 ||<)>  23 507 ||<)>  12 101 453 || || 1997–2001  ||    4th       ||<)>    187  ||<)>  8 255 897  ||<:>     –   ||<:>      –      ||
|| 2001–2005    || 4th                    ||<)>    337  ||<)>  28 743 846 ||<)>  4 942  ||<)>  49 302 521 ||<)>  30 986 ||<)>  17 519 177 || || 2001–2005  ||    5th       ||<)>    175  ||<)>  6 485 347  ||<:>     –   ||<:>      –      ||
|| 2005–2007    || 5th                    ||<)>    148  ||<)>  11 737 186 ||<)>  2 359  ||<)>  18 970 036 ||<)>  26 689 ||<)>  14 777 377 || || 2005–2007  ||    6th       ||<)>     74  ||<)>  3 571 293  ||<:>     –   ||<:>      –      ||
|| 2007–2011    || 6th                    ||<)>    298  ||<)>  22 415 708 ||<)>  5 565  ||<)>  51 752 446 ||<)>  59 353 ||<)>  36 412 001 || || 2007–2011  ||    7th       ||<)>    167  ||<)>  8 819 116  ||<:>     –   ||<:>      –      ||
|| 2011–2015    || 7th                    ||<)>    292  ||<)>  22 488 262 ||<)>  5 126  ||<)>  44 645 569 ||<)>  85 599 ||<)>  61 565 989 || || 2011–2015  ||    8th       ||<)>    159  ||<)>  7 100 841  ||<:>     –   ||<:>      –      ||
|| 2015–2019    || 8th                    ||<)>    239  ||<)>  19 431 789 ||<)>  4 828  ||<)>  46 735 363 ||<)>  79 194 ||<)>  56 720 590 || || 2015–2019  ||    9th       ||<)>    204  ||<)> 10 240 444  ||<)>  2 156  ||<)> 16 011 501  ||
|| 2019–        || 9th                    ||<)>    159  ||<)>  11 426 833 ||<)>  2 731  ||<)>  23 818 878 ||<)>     –   ||<:>      –      || || 2019–      ||    10th      ||<)>    121  ||<)>  6 359 450  ||<)>   1 423  ||<)> 12 990 378 ||
-Line 10:
+Line 41:
-Corpus files are made available in TEI P5 format. The resource contains automatically created annotation of:
 * utterance-level segmentation,
 * tokenization, 
 * lemmatization, 
 * disambiguated morphosyntactic description,
 * syntactic words,
 * syntactic groups,
 * named entities.
+== Corpus format ==
-Line 19:
+Line 43:
-== Data ==
+Corpus files are made available in XML TEI P5 format compatible with the annotation used by the [[http://nkjp.pl/index.php?page=0&lang=1|National Corpus of Polish]]. The resource contains automatically created annotation of:
 * utterance-level segmentation, tokenization and lemmatization produced with [[http://morfeusz.sgjp.pl/|Morfeusz2]]
 * disambiguated morphosyntactic description produced with [[http://zil.ipipan.waw.pl/Concraft|Concraft2]]
 * named entities produced with [[http://nlp.pwr.wroc.pl/narzedzia-i-zasoby/narzedzia/liner2|Liner2]]
 * dependency structures produced with [[http://zil.ipipan.waw.pl/PDB/PDBparser|COMBO parser]].
-Line 21:
+Line 49:
-Please use the Polish Sejm Corpus data until newer version is made available.
-Line 23:
+Line 50:
-=== Whole corpus ===
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-all.tar|Interpellations and questions terms 3-6, sittings terms 1-6]], 25 GB
+== Download ==
-Line 26:
+Line 52:
-=== Divided by term and document type ===
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad3.tar|Interpellations and questions Term 3]], (1997–2001), 1.7 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad4.tar|Interpellations and questions Term 4]], (2001–05), 2.4 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad5.tar|Interpellations and questions Term 5]], (2005–07), 2.0 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Interpelacje-kad6.tar|Interpellations and questions Term 6]], (2007–11), 4.8 GB
+ * [[http://manage.legis.nlp.ipipan.waw.pl/download/ppc-nanno.tar.gz|Unannotated TEI version]] (1.7 GB) 
 * [[https://manage.legis.nlp.ipipan.waw.pl/download/ppc-anno.tar.gz|Linguistically annotated version]] (34.4 GB)
 * [[attachment:ppc-sample.zip|A small sample with data from different periods]] (39 MB)
 * [[http://git.nlp.ipipan.waw.pl/PPC/ppc|PPC data on GitLab]]
-Line 32:
+Line 57:
- * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad1.tar|Sittings Term 1]], (1991–93), 0.9 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad2.tar|Sittings Term 2]], (1993–97), 2.6 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad3.tar|Sittings Term 3]], (1997–2001), 2.8 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad4.tar|Sittings Term 4]], (2001–05), 3.4 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad5.tar|Sittings Term 5]], (2005–07), 1.4 GB
 * [[http://sejm.nlp.ipipan.waw.pl/static/PSC-Posiedzenia-kad6.tar|Sittings Term 6]], (2007–11), 3.1 GB
+== Searching the corpus ==

 * [[http://sejm.nlp.ipipan.waw.pl/|using the search engine]]
 * [[http://ngram.sejm.nlp.ipipan.waw.pl/|using the ngram viewer]]

== Licence ==

The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.
-Line 41:
+Line 68:
-So far please cite the Polish Sejm Corpus paper:
+<<BibMate(key, "ogr:etal:22:parlaclarin:kdp", "ogr:nit:20:parlaclarin", "ogr:18:parlaclarin", "ogro:12:lrec", omitYears=true)>>
-Line 43:
+Line 70:
-<<BibMate(key,"ogro:12:lrec",omitYears=true)>>
+== See also ==
-Line 45:
+Line 72:
-== Searching the corpus ==
+ * [[https://www.clarin.eu/sites/default/files/2-ogrodniczuk.pdf|The slides]] from [[https://www.clarin.eu/event/2017/clarin-plus-workshop-working-parliamentary-records|CLARIN-PLUS Workshop "Working with Parliamentary Records"]], Sofia, 27–29 March 2017.
 * [[https://www.youtube.com/watch?v=KEG_6WsTT5I|Webinar on PPC]] from CLARIN-PL workshop session (November 2020).
 * [[https://www.clarin.eu/content/parlamint-towards-comparable-parliamentary-corpora|ParlaMint]] project reusing a portion of data from the Polish Parliamentary Corpus in a multilingual setting.
-Line 47:
+Line 76:
-Online search of the corpus will be available soon.
+== Contact ==

[[http://zil.ipipan.waw.pl/MaciejOgrodniczuk|Maciej Ogrodniczuk]], Institute of Computer Science, Polish Academy of Sciences

Diff for "PPC"

Menu

Wiki

The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego

Corpus data

Corpus format

Download

Searching the corpus

Licence

Publications

See also

Contact