The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego

The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus co-funded by project CESAR and is currently being updated by CLARIN-PL infrastructure.

Corpus format

Corpus files are made available in TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:

utterance-level segmentation,
tokenization,
lemmatization,
disambiguated morphosyntactic description,
named entities.

Corpus data

The current size of the corpus amounts to 749M segments with detailed distribution over houses, periods, and document types presented below. Apart from the stenographic records of plenary sittings (261M segments) and committee sittings (288M segments), the corpus contains 199M segments of interpellations and questions.

		Sittings		Committees		Interpellations
Years	Period	docs	segments	docs	segments	docs	segments
1919–1922	Legislative Sejm	312	6 945 162	–	–	–	–
1922–1927	1st term of office	277	7 338 355	–	–	–	–
1928–1930	2nd	58	2 139 835	–	–	–	–
1930–1935	3rd	72	2 404 267	–	–	–	–
1935–1938	4th	73	2 133 181	–	–	–	–
1938–1939	5th	23	610 455	–	–	–	–
1943–1947	State National Council	6	234 441	–	–	–	–
1947–1952	Legislative Sejm	107	2 575 136	–	–	–	–
1952–1956	1st term of office	39	1 172 333	–	–	–	–
1957–1961	2nd	59	2 502 936	–	–	–	–
1961–1965	3rd	32	1 388 862	–	–	–	–
1965–1969	4th	23	1 163 336	–	–	–	–
1969–1972	5th	17	526 277	–	–	–	–
1972–1976	6th	32	1 176 712	–	–	–	–
1976–1980	7th	29	918 993	–	–	–	–
1980–1985	8th	70	3 377 139	–	–	–	–
1985–1989	9th	45	2 641 788	–	–	–	–
1989–1991	10th	77	6 674 111	–	–	–	–
1991–1993	1st term of office	142	7 739 147	–	–	–	–
1993–1997	2nd	317	22 134 682	3 858	41 756 476	–	–
1997–2001	3rd	320	24 138 142	4 691	42 510 604	23 507	12 101 453
2001–2005	4th	337	28 743 846	4 945	49 302 521	30 986	17 519 177
2005–2007	5th	148	11 737 186	2 359	18 970 036	26 689	14 777 377
2007–2011	6th	298	22 415 708	5 565	44 363 063	59 353	36 412 001
2011–2015	7th	292	20 765 505	5 126	38 541 083	85 679	61 565 989
2015–2019	8th	239	19 131 000	4 561	36 708 873	79 194	56 720 590

Download

At present you can download the unannotated TEI version (1.7 GB). Linguistically annotated version will be added soon.

Searching the corpus

Licence

The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.

Publications

List of publications

Maciej Ogrodniczuk. The Polish Sejm Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pages 2219–2223, Istanbul, Turkey, 2012. European Language Resources Association (ELRA).