Locked History Actions

PPC

The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego

The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of the Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus co-funded by project CESAR and later extended using the support of CLARIN-PL, MARCELL and ParlaMint projects.

Corpus data

The current size of the corpus (as of 29 November 2022) amounts to over 800M segments with detailed distribution over houses, periods, and document types presented below. Apart from the stenographic records of plenary sittings (271M segments) and committee sittings (330M segments), the corpus contains 199M segments of interpellations and questions.

Sejm

Senate

Sittings

Committees

Interpellations

Sittings

Committees

Years

Period

docs

segments

docs

segments

docs

segments

Period

docs

segments

docs

segments

1919–1922

Legislative Sejm

312

6 945 162

1919–1922

1922–1927

1st term of office

277

7 338 355

1922–1927

1st term

96

1 979 541

1928–1930

2nd

58

2 139 835

1928–1930

2nd

3

171 345

1930–1935

3rd

72

2 404 267

1930–1935

3rd

64

1 804 635

1935–1938

4th

73

2 133 181

1935–1938

4th

29

724 687

1938–1939

5th

23

610 455

1938–1939

5th

20

347 430

1943–1947

State National Council

6

234 514

1947–1952

Legislative Sejm

107

2 575 136

1952–1956

1st term of office

39

1 172 333

1957–1961

2nd

59

2 502 936

1961–1965

3rd

32

1 388 862

1965–1969

4th

23

1 163 336

1969–1972

5th

17

526 277

1972–1976

6th

32

1 176 712

1976–1980

7th

29

918 993

1980–1985

8th

70

3 377 139

1985–1989

9th

45

2 641 788

1989–1991

10th

77

6 674 111

1989–1991

1st term

60

3 170 293

1991–1993

1st term of office

142

7 739 147

1991–1993

2nd

48

1 459 440

1993–1997

2nd

317

22 134 682

3 858

41 756 476

1993–1997

3rd

125

5 051 677

1997–2001

3rd

320

24 138 142

4 690

42 510 604

23 507

12 101 453

1997–2001

4th

187

8 255 897

2001–2005

4th

337

28 743 846

4 942

49 302 521

30 986

17 519 177

2001–2005

5th

175

6 485 347

2005–2007

5th

148

11 737 186

2 359

18 970 036

26 689

14 777 377

2005–2007

6th

74

3 571 293

2007–2011

6th

298

22 415 708

5 565

51 752 446

59 353

36 412 001

2007–2011

7th

167

8 819 116

2011–2015

7th

292

22 488 262

5 126

44 645 569

85 599

61 565 989

2011–2015

8th

159

7 100 841

2015–2019

8th

239

19 431 789

4 828

46 735 363

79 194

56 720 590

2015–2019

9th

204

10 240 444

2 156

16 011 501

2019–

9th

159

11 426 833

2 731

23 818 878

2019–

10th

121

6 359 450

1 423

12 990 378

Corpus format

Corpus files are made available in XML TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:

  • utterance-level segmentation, tokenization and lemmatization produced with Morfeusz2

  • disambiguated morphosyntactic description produced with Concraft2

  • named entities produced with Liner2

  • dependency structures produced with COMBO parser.

Download

Searching the corpus

Licence

The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.

Publications

List of publications

Maciej Ogrodniczuk, Michał Rudolf, Beata Wójtowicz, and Sonia Janicka. Error correction environment for the Polish Parliamentary Corpus. In Proceedings of The Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference, pages 35–38, Marseille, France, 2022. European Language Resources Association.

Maciej Ogrodniczuk and Bartłomiej Nitoń. New developments in the Polish Parliamentary Corpus. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the Second ParlaCLARIN Workshop, pages 1–4, Marseille, France, 2020. European Language Resources Association (ELRA).

Maciej Ogrodniczuk. Polish Parliamentary Corpus. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the LREC 2018 Workshop ParlaCLARIN: Creating and Using Parliamentary Corpora, pages 15–19, Paris, France, 2018. European Language Resources Association (ELRA).

Maciej Ogrodniczuk. The Polish Sejm Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pages 2219–2223, Istanbul, Turkey, 2012. European Language Resources Association (ELRA).

See also

Contact

Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences