The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego

The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of the Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus co-funded by project CESAR and later extended using the support of CLARIN-PL, MARCELL and ParlaMint projects.

Corpus data

The current size of the corpus (as of 22 May 2025) amounts to over 870M segments with detailed distribution over houses, periods, and document types presented below. Apart from the stenographic records of plenary sittings (271M segments) and committee sittings (330M segments), the corpus contains 199M segments of interpellations and questions.

Sejm

Senate

Sittings

Committees

Interpellations

Sittings

Committees

Years

Period

docs

segments

docs

segments

docs

segments

Period

docs

segments

docs

segments

1919–1922

Legislative Sejm

312

6 945 162

–

1919–1922

–

1922–1927

1st term of office

277

7 338 355

–

1922–1927

1st term

1 979 541

–

1928–1930

2nd

2 139 835

–

1928–1930

2nd

171 345

–

1930–1935

3rd

2 404 267

–

1930–1935

3rd

1 804 635

–

1935–1938

4th

2 133 181

–

1935–1938

4th

724 687

–

1938–1939

5th

610 455

–

1938–1939

5th

347 430

–

1943–1947

State National Council

234 514

–

1947–1952

Legislative Sejm

107

2 575 136

–

1952–1956

1st term of office

1 172 333

–

1957–1961

2nd

2 502 936

–

1961–1965

3rd

1 388 862

–

1965–1969

4th

1 163 336

–

1969–1972

5th

526 277

–

1972–1976

6th

1 176 712

–

1976–1980

7th

918 993

–

1980–1985

8th

3 377 139

–

1985–1989

9th

2 641 788

–

1989–1991

10th

6 674 111

–

1989–1991

1st term

3 170 293

–

1991–1993

1st term of office

142

7 739 147

–

1991–1993

2nd

1 459 440

–

1993–1997

2nd

317

22 134 682

3 858

41 756 476

–

1993–1997

3rd

125

5 051 677

–

1997–2001

3rd

320

24 138 142

4 690

42 510 604

23 507

12 101 453

1997–2001

4th

187

8 255 897

–

2001–2005

4th

337

28 743 846

4 942

49 302 521

30 986

17 519 177

2001–2005

5th

175

6 485 347

–

2005–2007

5th

148

11 737 186

2 359

18 970 036

26 689

14 777 377

2005–2007

6th

3 571 293

–

2007–2011

6th

298

22 415 708

5 565

51 752 446

59 353

36 412 001

2007–2011

7th

167

8 819 116

–

2011–2015

7th

292

22 488 262

5 126

44 645 569

85 599

61 565 989

2011–2015

8th

159

7 100 841

–

2015–2019

8th

239

19 431 789

4 828

46 735 363

79 194

56 720 590

2015–2019

9th

204

10 240 444

2 156

16 012 609

2019–2023

9th

198

15 074 054

3 768

32 117 982

–

2019–2023

10th

148

7 992 356

1 859

16 777 361

2023–

10th

7 994 439

1 876

13 209 520

–

2023–

11th

1 609 769

506

4 238 073

Corpus format

Corpus files are made available in XML TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:

utterance-level segmentation, tokenization and lemmatization produced with Morfeusz2
disambiguated morphosyntactic description produced with Concraft2
named entities produced with Liner2
dependency structures produced with COMBO parser.

Download

Please use the links for individual terms of office. The XML version contains both TEI-encoded source data, metadata and linguistic annotation in CCL format; PDFs are original source files (often non-searchable).

1919–1922	XML	PDF
1922–1927	XML	PDF
1928–1930	XML	PDF
1930–1935	XML	PDF
1935–1938	XML	PDF
1938–1939	XML	PDF
1943–1947	XML	PDF
1947–1952	XML	PDF
1952–1956	XML	PDF
1957–1961	XML	PDF
1961–1965	XML	PDF
1965–1969	XML	PDF
1969–1972	XML	PDF
1972–1976	XML	PDF
1976–1980	XML	PDF
1980–1985	XML	PDF
1985–1989	XML	PDF
1989–1991	XML	PDF
1991–1993	XML	PDF
1993–1997	XML	PDF
1997–2001	XML	PDF
2001–2005	XML	PDF
2005–2007	XML	PDF
2007–2011	XML	PDF
2011–2015	XML	PDF
2015–2019	XML	PDF
2019–2023	XML	PDF
2023–2027	XML	PDF

You can also take a look at:

Searching the corpus

Korpus Dyskursu Parlamentarnego search engine (in Polish)

Licence

The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.

Publications

List of publications

Maciej Ogrodniczuk, Michał Rudolf, Beata Wójtowicz, and Sonia Janicka. Error correction environment for the Polish Parliamentary Corpus. In Proceedings of The Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference, pages 35–38, Marseille, France, 2022. European Language Resources Association.

Maciej Ogrodniczuk and Bartłomiej Nitoń. New developments in the Polish Parliamentary Corpus. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the Second ParlaCLARIN Workshop, pages 1–4, Marseille, France, 2020. European Language Resources Association (ELRA).

Maciej Ogrodniczuk. Polish Parliamentary Corpus. In Darja Fišer, Maria Eskevich, and Franciska de Jong, editors, Proceedings of the LREC 2018 Workshop ParlaCLARIN: Creating and Using Parliamentary Corpora, pages 15–19, Paris, France, 2018. European Language Resources Association (ELRA).

Maciej Ogrodniczuk. The Polish Sejm Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pages 2219–2223, Istanbul, Turkey, 2012. European Language Resources Association (ELRA).

Contact

Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences

PPC

Menu

Wiki

The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego

Corpus data

Corpus format

Download

Searching the corpus

Licence

Publications

See also

Contact