Locked History Actions

Diff for "PPC"

Differences between revisions 46 and 47
Revision 46 as of 2020-02-28 20:34:59
Size: 2294
Comment:
Revision 47 as of 2020-03-05 18:08:25
Size: 6454
Comment:
Deletions are marked like this. Additions are marked like this.
Line 16: Line 16:
The Polish Parliamentary Corpus contains:
 * Sejm sittings from 1919–present (including Legislative Sejm and State National Council)
 * Sejm committee sittings from 1993–present
 * Sejm interpellations and questions from 1997–present
 * Senate sittings from 1922–1939 and 1989–present
 * Senate committee sittings from 2015–present
The current size of the corpus amounts to 749M segments with detailed distribution over houses, periods, and document types presented below. Apart from the stenographic records of plenary sittings
(261M segments) and committee sittings (288M segments), the corpus contains 199M segments of interpellations and questions.

|| || ||<-2> '''Sittings''' ||<-2> '''Committees''' ||<-2> '''Interpellations''' ||
|| '''Years ''' || '''Period''' || '''docs''' || '''segments'''|| '''docs''' || '''segments''' || '''docs''' || '''segments''' ||
|| 1919–1922 || Legislative Sejm ||<)> 312 ||<)> 6 945 162 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1922–1927 || 1st term of office ||<)> 277 ||<)> 7 338 355 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1928–1930 || 2nd ||<)> 58 ||<)> 2 139 835 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1930–1935 || 3rd ||<)> 72 ||<)> 2 404 267 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1935–1938 || 4th ||<)> 73 ||<)> 2 133 181 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1938–1939 || 5th ||<)> 23 ||<)> 610 455 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1943–1947 || State National Council ||<)> 6 ||<)> 234 441 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1947–1952 || Legislative Sejm ||<)> 107 ||<)> 2 575 136 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1952–1956 || 1st term of office ||<)> 39 ||<)> 1 172 333 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1957–1961 || 2nd ||<)> 59 ||<)> 2 502 936 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1961–1965 || 3rd ||<)> 32 ||<)> 1 388 862 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1965–1969 || 4th ||<)> 23 ||<)> 1 163 336 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1969–1972 || 5th ||<)> 17 ||<)> 526 277 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1972–1976 || 6th ||<)> 32 ||<)> 1 176 712 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1976–1980 || 7th ||<)> 29 ||<)> 918 993 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1980–1985 || 8th ||<)> 70 ||<)> 3 377 139 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1985–1989 || 9th ||<)> 45 ||<)> 2 641 788 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1989–1991 || 10th ||<)> 77 ||<)> 6 674 111 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1991–1993 || 1st term of office ||<)> 142 ||<)> 7 739 147 ||<:> – ||<:> – ||<:> – ||<:> – ||
|| 1993–1997 || 2nd ||<)> 317 ||<)> 22 134 682 ||<)> 3 858 ||<)> 41 756 476 ||<)> – ||<)> – ||
|| 1997–2001 || 3rd ||<)> 320 ||<)> 24 138 142 ||<)> 4 691 ||<)> 42 510 604 ||<)> 23 507 ||<)> 12 101 453 ||
|| 2001–2005 || 4th ||<)> 337 ||<)> 28 743 846 ||<)> 4 945 ||<)> 49 302 521 ||<)> 30 986 ||<)> 17 519 177 ||
|| 2005–2007 || 5th ||<)> 148 ||<)> 11 737 186 ||<)> 2 359 ||<)> 18 970 036 ||<)> 26 689 ||<)> 14 777 377 ||
|| 2007–2011 || 6th ||<)> 298 ||<)> 22 415 708 ||<)> 5 565 ||<)> 44 363 063 ||<)> 59 353 ||<)> 36 412 001 ||
|| 2011–2015 || 7th ||<)> 292 ||<)> 20 765 505 ||<)> 5 126 ||<)> 38 541 083 ||<)> 85 679 ||<)> 61 565 989 ||
|| 2015–2019 || 8th ||<)> 239 ||<)> 19 131 000 ||<)> 4 561 ||<)> 36 708 873 ||<)> 79 194 ||<)> 56 720 590 ||

The Polish Parliamentary Corpus / Korpus Dyskursu Parlamentarnego

The Polish Parliamentary Corpus (PPC) is a large collection of linguistically analysed documents from the proceedings of Polish Parliament, Sejm and Senate. It is based on the Polish Sejm Corpus co-funded by project CESAR and is currently being updated by CLARIN-PL infrastructure.

Corpus format

Corpus files are made available in TEI P5 format compatible with the annotation used by the National Corpus of Polish. The resource contains automatically created annotation of:

  • utterance-level segmentation,
  • tokenization,
  • lemmatization,
  • disambiguated morphosyntactic description,
  • named entities.

Corpus data

The current size of the corpus amounts to 749M segments with detailed distribution over houses, periods, and document types presented below. Apart from the stenographic records of plenary sittings (261M segments) and committee sittings (288M segments), the corpus contains 199M segments of interpellations and questions.

Sittings

Committees

Interpellations

Years

Period

docs

segments

docs

segments

docs

segments

1919–1922

Legislative Sejm

312

6 945 162

1922–1927

1st term of office

277

7 338 355

1928–1930

2nd

58

2 139 835

1930–1935

3rd

72

2 404 267

1935–1938

4th

73

2 133 181

1938–1939

5th

23

610 455

1943–1947

State National Council

6

234 441

1947–1952

Legislative Sejm

107

2 575 136

1952–1956

1st term of office

39

1 172 333

1957–1961

2nd

59

2 502 936

1961–1965

3rd

32

1 388 862

1965–1969

4th

23

1 163 336

1969–1972

5th

17

526 277

1972–1976

6th

32

1 176 712

1976–1980

7th

29

918 993

1980–1985

8th

70

3 377 139

1985–1989

9th

45

2 641 788

1989–1991

10th

77

6 674 111

1991–1993

1st term of office

142

7 739 147

1993–1997

2nd

317

22 134 682

3 858

41 756 476

1997–2001

3rd

320

24 138 142

4 691

42 510 604

23 507

12 101 453

2001–2005

4th

337

28 743 846

4 945

49 302 521

30 986

17 519 177

2005–2007

5th

148

11 737 186

2 359

18 970 036

26 689

14 777 377

2007–2011

6th

298

22 415 708

5 565

44 363 063

59 353

36 412 001

2011–2015

7th

292

20 765 505

5 126

38 541 083

85 679

61 565 989

2015–2019

8th

239

19 131 000

4 561

36 708 873

79 194

56 720 590

Download

At present you can download the unannotated TEI version (1.7 GB). Linguistically annotated version will be added soon.

Searching the corpus

Licence

The parliamentary data is public domain. The corpus annotations are available on CC-BY (attribution) licence.

Publications

List of publications

Maciej Ogrodniczuk. The Polish Sejm Corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pages 2219–2223, Istanbul, Turkey, 2012. European Language Resources Association (ELRA).

Please see also the slides from CLARIN-PLUS Workshop "Working with Parliamentary Records". Sofia, 27–29 March 2017.

Contact

Maciej Ogrodniczuk, Institute of Computer Science, Polish Academy of Sciences