Locked History Actions

Diff for "PL196x"

Differences between revisions 3 and 43 (spanning 40 versions)
Revision 3 as of 2011-04-11 12:05:42
Size: 5654
Comment:
Revision 43 as of 2021-03-22 14:06:57
Size: 13252
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Polish language of the XX century sixties = ## page was renamed from Polish language of the XX century sixties
= Polish language of the 1960s
=
Line 3: Line 4:
This page is dedicated to the corpus of frequency dictionary of contemporary Polish. The original purpose of the corpus was to create a general frequency dictionary of contemporary Polish. The work started in 1967. Partial results were published between 1972 and 1977, the completed dictionary in 1990. The corpus was later augmented in various respects, both by manual editing and automated procedures. This page is dedicated to the corpus of [[http://rcin.org.pl/dlibra/docmetadata?from=rss&id=2054|frequency dictionary of contemporary Polish]]. The original purpose of the corpus was to create a general frequency dictionary of contemporary Polish. The work started in 1967. Partial results were published between 1972 and 1977, the completed dictionary in 1990. The corpus was later augmented in various respects, both by manual editing and automated procedures.
Line 9: Line 10:
The corpus was manually annotated using the IPI PAN tagset — each sentence by one annotator, there was no superannotation stage.

In 2012, under ICT-PSP project [[CESAR]], the corpus with IPI PAN tagset annotation was automatically converted to annotation adhering to the NKJP tagset. The data from the automatic conversion into the NKJP tagset was used when simulating the manual annotation of NKJP – it served as data from one of the two annotators. The second annotator was a human annotator, who annotated the sentence from scratch. As in NKJP manual annotation, the human annotator inspected interpretations of segments in a given sentence generated automatically by the morphological analyser – the human annotator could either choose one of these interpretations, or enter a different interpretation manually. If the human annotator selected a different annotation than the one resulting from automatic conversion, the sentence was returned to the human annotator for inspection. Without knowing the result of automatic conversion, the human annotator could either change their earlier annotation or keep it. After this stage, if there were still conflicting interpretations, the superannotator would decide about the correct interpretation.

NOTE: If you are looking for a more fresh frequency data, please take a look at [[http://zil.ipipan.waw.pl/NKJPNGrams|NKJP ngrams]].

== Citing the electronic version of the corpus ==

 * Ogrodniczuk, Maciej. ''[[attachment:ogrodniczuk-newedition.pdf|Nowa edycja wzbogaconego korpusu słownika frekwencyjnego]]''. (In Polish, EN: ''New edition of the Enhanced corpus of the Frequency dictionary''). [In:] Językoznawstwo w Polsce. Stan i perspektywy. Stanisław Gajda (ed.) Institute of Polish Philology, Polish Academy of Sciences – Linguistics Committee, Opole University. Opole 2003, pp. 181–190. ISBN 83-86881-36-4.

Line 11: Line 23:
 * Bień, Janusz S.; Woliński, Marcin. Enhanced corpus of the frequency dictionary of contemporary Polish. (In Polish). December 17th, 2001.
 * Bień, Janusz S.; Woliński, Marcin. Numerical grammatical codes in enhanced corpus of the frequency dictionary. (In Polish). December 17th, 2001.
 * Głowińska, Katarzyna. Morphological taxonomy for the frequency dictionary. (In Polish)
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Szafran, Krzysztof; Woronczak, Jerzy. Polish language in the sixties (introduction to the printed edition of the frequency dictionary).
 * Ogrodniczuk, Maciej. Enhancing the corpus of the frequency dictionary with new grammatical codes. (In Polish)
 * Bień, Janusz S.; Woliński, Marcin. ''[[attachment:wksf.pdf|Wzbogacony korpus Słownika frekwencyjnego polszczyzny współczesnej]]''. (In Polish, EN: ''Enhanced corpus of the frequency dictionary of contemporary Polish''). December 17th, 2001.
 * Bień, Janusz S.; Woliński, Marcin. ''[[attachment:kodynum.pdf|Numeryczne kody gramatyczne we wzbogaconym korpusie Słownika frekwencyjnego polszczyzny współczesnej]]''. (In Polish, EN: ''Numerical grammatical codes in enhanced corpus of the frequency dictionary''). December 17th, 2001.
 * Głowińska, Katarzyna. ''[[attachment:taksonomia.pdf|Taksonomia morfologiczna dla Słownika frekwencyjnego]]''. (In Polish, EN: ''Morphological taxonomy for the frequency dictionary'').
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Szafran, Krzysztof; Woronczak, Jerzy. ''[[attachment:pl196x-doc-en.pdf|Polish language in the sixties]]'' (introduction to the printed edition of the frequency dictionary, available also [[attachment:pl196x-doc-pl.pdf|in Polish]]).
 * Ogrodniczuk, Maciej. ''[[attachment:operacje.pdf|Wzbogacenie korpusu słownika frekwencyjnego o nowe kody gramatyczne]]''. (In Polish, EN: ''Enhancing the corpus of the frequency dictionary with new grammatical codes'').
Line 19: Line 31:
 * Bień, Janusz S.; Woliński, Marcin. Enhanced corpus of the Frequency dictionary of contemporary Polish. (In Polish) [In:] Prace lingwistyczne dedykowane prof. Jadwidze Sambor. Jadwiga Linde-Usiekniewicz (ed.), pp. 6-10, Warszawa 2003, Faculty of Polish Philology, Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. Vocabulary of contemporary Polish. Frequency lists. Volume I. Scientific texts. (In Polish) Warszawa, 1974. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. Vocabulary of contemporary Polish. Frequency lists. Volume II. News. (In Polish) Warszawa, 1974. Warsaw University.
 * Lewicki, Andrzej; Masłowski, Władysław; Sambor, Jadwiga; Woronczak, Jerzy. Vocabulary of contemporary Polish. Frequency lists. Volume III. Essays. (In Polish) Warszawa, 1975. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. Vocabulary of contemporary Polish. Frequency lists. Volume IV. Fiction. (In Polish) Warszawa, 1976. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. Vocabulary of contemporary Polish. Frequency lists. Volume V. Plays. (In Polish) Warszawa, 1977. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Szafran, Krzysztof; Woronczak, Jerzy. Frequency dictionary of contemporary Polish. (In Polish) Kraków, 1990. Institute of Polish Philology, Polish Academy of Sciences.
 * Nazarczuk, Marta. Initial preparation of the corpus of Frequency dictionary of contemporary Polish for CD-ROM distribution. (In Polish) Master thesis prepared under supervision of prof. Janusz S. Bień. Warsaw, 1997. Institute of Polish Philology, Warsaw University. 59 pages, CD-ROM.
 * Ogrodniczuk, Maciej. New edition of the Enhanced corpus of the Frequency dictionary.. (In Polish) [In:] Językoznawstwo w Polsce. Stan i perspektywy. Stanisław Gajda (ed.) Institute of Polish Philology, Polish Academy of Sciences - Linguistics Committee, Opole University. Opole 2003, pp. 181-190. ISBN 83-86881-36-4.
 * Ogrodniczuk, Maciej. Augmenting the morphological description in the corpus of Frequency dictionary of contemporary Polish. (In Polish) [In:] Prace lingwistyczne dedykowane prof. Jadwidze Sambor. Jadwiga Linde-Usiekniewicz (ed.), pp. 164-168, Warszawa 2003, Faculty of Polish Philology, Warsaw University.
 * Ogrodniczuk, Maciej. Encoding of Polish linguistic data with SGML and TEI. (In Polish) Master thesis prepared under supervision of prof. Janusz S. Bień. Warsaw, 2000. Institute of Informatics, Warsaw University. 83 pages, CD-ROM.
 * Saloni, Zygmunt. Frequency dictionary of contemporary Polish. (In Polish) ComputerWorld, November 4th 1991, pp. 16-17.
 * Awramiuk Elżbieta. ''Wpływ odstępstw od segmentacji ortograficznej na wyniki statystyczne Słownika frekwencyjnego polszczyzny współczesnej''. (In Polish, EN: ''Influence of deviations from orthographic segmentation on the statistical results in the frequency dictionary of contemporary Polish''). Roczniki Humanistyczne Uniwersytetu w Białymstoku 2001–2002, t. 49–50, z. 6, s. 31–43. Białystok 2001.
 * Bień, Janusz S.; Woliński, Marcin. ''[[attachment:wksf2.pdf|Wzbogacony korpus Słownika frekwencyjnego polszczyzny współczesnej]]''. (In Polish, EN: ''Enhanced corpus of the Frequency dictionary of contemporary Polish''). [In:] Prace lingwistyczne dedykowane prof. Jadwidze Sambor. Jadwiga Linde-Usiekniewicz (ed.), pp. 6-10, Warszawa 2003, Faculty of Polish Philology, Warsaw University.
 * Czerepowicka, Monika; Saloni, Zygmunt. ''[[attachment:saloni.pdf|Co skreślano i co dopisywano w korpusie Słownika frekwencyjnego polszczyzny współczesnej]]''. (In Polish, EN: ''What was deleted and what was added in the frequency dictionary of contemporary Polish''), pp. 381–391.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. ''[[http://rcin.org.pl/dlibra/docmetadata?id=3279|Słownictwo współczesnego języka polskiego. Listy frekwencyjne. Tom I. Teksty popularnonaukowe – część I]] i [[http://rcin.org.pl/dlibra/docmetadata?id=3282|II]]''. (In Polish, EN: ''Vocabulary of contemporary Polish. Frequency lists. Volume I. Scientific texts''). Warszawa, 1974. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. ''[[http://rcin.org.pl/dlibra/docmetadata?id=3280|Słownictwo współczesnego języka polskiego. Listy frekwencyjne. Tom II. Drobne wiadomości prasowe – część I]]''. (In Polish, EN: ''Vocabulary of contemporary Polish. Frequency lists. Volume II. News''). Warszawa, 1974. Warsaw University.
 * Lewicki, Andrzej; Masłowski, Władysław; Sambor, Jadwiga; Woronczak, Jerzy. ''[[http://rcin.org.pl/dlibra/docmetadata?id=3284|Słownictwo współczesnego języka polskiego. Listy frekwencyjne. Tom III. Publicystyka – część I]] i [[http://rcin.org.pl/dlibra/docmetadata?id=3285|II]]''. (In Polish, EN: ''Vocabulary of contemporary Polish. Frequency lists. Volume III. Essays''). Warszawa, 1975. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. ''[[http://rcin.org.pl/dlibra/docmetadata?id=3286|Słownictwo współczesnego języka polskiego. Listy frekwencyjne. Tom IV. Proza artystyczna – część I]], [[http://rcin.org.pl/dlibra/docmetadata?id=3288|II]] i [[http://rcin.org.pl/dlibra/docmetadata?id=3292|III]]''. (In Polish, EN: ''Vocabulary of contemporary Polish. Frequency lists. Volume IV. Fiction''). Warszawa, 1976. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. ''Słownictwo współczesnego języka polskiego. Listy frekwencyjne. Tom V. Dramat artystyczny''. (In Polish, EN: ''Vocabulary of contemporary Polish. Frequency lists. Volume V. Plays''). Warszawa, 1977. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Szafran, Krzysztof; Woronczak, Jerzy. ''[[http://rcin.org.pl/dlibra/docmetadata?id=2054|Słownik frekwencyjny polszczyzny współczesnej – tom I]] i [[http://rcin.org.pl/dlibra/docmetadata?id=2093|II]]''. (In Polish, EN: ''Frequency dictionary of contemporary Polish''). Kraków, 1990. Institute of Polish Philology, Polish Academy of Sciences.
 * Nazarczuk, Marta. ''[[attachment:nazarczuk-msc.pdf|Wstepne przygotowanie korpusu «Słownika frekwencyjnego polszczyzny współczesnej» do dystrybucji na CD-ROM]]''. (In Polish, EN: ''Initial preparation of the corpus of Frequency dictionary of contemporary Polish for CD-ROM distribution''). Master thesis prepared under supervision of prof. Janusz S. Bień. Warsaw, 1997. Institute of Polish Philology, Warsaw University. 59 pages, CD-ROM.
 * Ogrodniczuk, Maciej. ''[[attachment:SFPW.pdf|Rozszerzenie opisów morfologicznych w tekstach korpusu słownika frekwencyjnego polszczyzny współczesnej]]''. (In Polish, EN: ''Augmenting the morphological description in the corpus of Frequency dictionary of contemporary Polish''). [In:] Prace lingwistyczne dedykowane prof. Jadwidze Sambor. Jadwiga Linde-Usiekniewicz (ed.), pp. 164–168, Warszawa 2003, Faculty of Polish Philology, Warsaw University.
 * Ogrodniczuk, Maciej. ''[[attachment:ogrodniczuk-msc.pdf|Wykorzystanie SGML i TEI do zapisu polskich danych lingwistycznych]]''. (In Polish, EN: ''Encoding of Polish linguistic data with SGML and TEI''). Master thesis prepared under supervision of prof. Janusz S. Bień. Warsaw, 2000. Institute of Informatics, Warsaw University. 83 pages, CD-ROM.
 * Saloni, Zygmunt. ''[[http://www.computerworld.pl/artykuly/315352/Slownik.frekwencyjny.polszczyzny.wspolczesnej.html|Słownik frekwencyjny polszczyzny współczesnej]]'' (In Polish, EN: ''Frequency dictionary of contemporary Polish''). !ComputerWorld, November 4th 1991, pp. 16–17.
Line 35: Line 47:
 * GNU Free Documentation Licence for corpus documentation.
 * GNU General Public Licence for corpus data.
 * [[attachment:gfdl.txt|GNU Free Documentation Licence]] for corpus documentation.
 * [[attachment:gpl.txt|GNU GPL]] / CC-BY for corpus data.
Line 40: Line 52:
=== Cluster samples === <<Anchor(newest)>>
=== Current version ===
Line 42: Line 55:
|| || Without codes || With codes || "Raw" version || Enhanced version || TEI P4 XML version || The current version of the corpus features:
 * manually updated annotation
 * TEI P5 data format
 * both annotation and data format are compatible with NKJP corpus
Line 44: Line 60:
|| Style A: Scientific texts || 1,1 MB || 4,0 MB || 10 MB ||
|| Style B: News || 1,2 MB || 3,9 MB || 9 MB ||
|| Style C: Essays || 1,2 MB || 4,0 MB || 10 MB ||
|| Style D: Fiction || 1,1 MB || 4,1 MB || 11 MB ||
|| Style E: Plays || 1,1 MB || 4,4 MB || 12 MB ||
Downloads:
 * [[attachment:KF.tar.gz|TEI P5 XML corpus files]]
 * [[attachment:KF_poliqarp.tar.gz|Poliqarp files]]

=== Original version ===

||<style="border: none"> ||<style="border-right: none; text-align: right; width=80px">Cluster ||<style="border-left: none; text-align: left; width: 80px">samples ||<style="text-align:center; width: 80px"> "Raw" ||<style="text-align:center; width: 80px"> Enhanced ||<style="text-align:center; width: 80px"> TEI P4 XML ||
||<style="border: none"> ||<style="text-align:center"> without codes ||<style="text-align:center"> with codes ||<style="border-right:none"> ||<style="text-align:center; border-left: none; border-right: none"> version ||<style="border-left:none"> ||
|| Style A: Scientific texts ||<style="text-align:right">[[attachment:fiszki-bez-a-publi.pdf|1 MB]] ||<style="text-align:right"> [[attachment:fiszki-z-a-publi.pdf|1,5 MB]] ||<style="text-align:right">[[attachment:surowy-a-publi.txt|1,1 MB]] ||<style="text-align:right">[[attachment:wzbogacony-a-publi.txt|4,0 MB]] ||<style="text-align:right"> [[attachment:a-publi.xml|10 MB]] ||
|| Style B: News ||<style="text-align:right">[[attachment:fiszki-bez-b-prasa.pdf|1 MB]] ||<style="text-align:right">[[attachment:fiszki-z-b-prasa.pdf|1,5 MB]] ||<style="text-align:right">[[attachment:surowy-b-prasa.txt|1,2 MB]] ||<style="text-align:right">[[attachment:wzbogacony-b-prasa.txt|3,9 MB]] ||<style="text-align:right"> [[attachment:b-prasa.xml|9 MB]] ||
|| Style C: Essays ||<style="text-align:right">[[attachment:fiszki-bez-c-popul.pdf|1 MB]] ||<style="text-align:right">[[attachment:fiszki-z-c-popul.pdf|1,5 MB]] ||<style="text-align:right">[[attachment:surowy-c-popul.txt|1,2 MB]] ||<style="text-align:right">[[attachment:wzbogacony-c-popul.txt|4,0 MB]] ||<style="text-align:right"> [[attachment:c-popul.xml|10 MB]] ||
|| Style D: Fiction ||<style="text-align:right">[[attachment:fiszki-bez-d-proza.pdf|1 MB]] ||<style="text-align:right">[[attachment:fiszki-z-d-proza.pdf|1,5 MB]] ||<style="text-align:right">[[attachment:surowy-d-proza.txt|1,1 MB]] ||<style="text-align:right">[[attachment:wzbogacony-d-proza.txt|4,1 MB]] ||<style="text-align:right"> [[attachment:d-proza.xml|11 MB]] ||
|| Style E: Plays ||<style="text-align:right">[[attachment:fiszki-bez-e-dramat.pdf|1 MB]]||<style="text-align:right">[[attachment:fiszki-z-e-dramat.pdf|1,5 MB]] ||<style="text-align:right">[[attachment:surowy-e-dramat.txt|1,1 MB]] ||<style="text-align:right">[[attachment:wzbogacony-e-dramat.txt|4,4 MB]] ||<style="text-align:right"> [[attachment:e-dramat.xml|12 MB]] ||
Line 52: Line 76:
 * Master file with TEI header for the corpus
 * Feature library (assembles library of feature elements)
 * Feature structure library (assembles library of feature structure elements) 
 * Writing system declaration 
 * Feature structures representing morphological descriptions for Polish
 * [[attachment:pl196x.xml|Master file]] with TEI header for the corpus
 * [[attachment:flib.xml|Feature library]] (assembles library of feature elements)
 * [[attachment:fslib.xml|Feature structure library]] (assembles library of feature structure elements)
 * [[attachment:iso88592.wsd|Writing system declaration]]
 * [[attachment:morf.fsd|Feature structures representing morphological descriptions for Polish]]
 * [[attachment:dtd.zip|TEI P4 DTD]]

[[attachment:wksf.iso.bz2|ISO image]] of the CD-ROM with most of the materials.
Line 59: Line 86:
=== Concordances === == Concordances ==
Line 61: Line 88:
 * Concordances with location [17.9 MB]
 * Concordances without location [24 MB]
 * [[attachment:konkordancja-z.zip|Concordances with location ]] [17.9 MB]
 * [[attachment:konkordancja-bez.zip|Concordances without location ]] [24 MB]

Polish language of the 1960s

This page is dedicated to the corpus of frequency dictionary of contemporary Polish. The original purpose of the corpus was to create a general frequency dictionary of contemporary Polish. The work started in 1967. Partial results were published between 1972 and 1977, the completed dictionary in 1990. The corpus was later augmented in various respects, both by manual editing and automated procedures.

Corpus data contain 10,000 samples divided into 5 parts: essays, news, scientific texts, fiction and plays. Every sample is approximately 50 words long, they all come from texts published between 1963 and 1967 and contain bibliographic description of its source. Each word is tagged with its base form and some morphological properties. Sentence boundaries are also marked.

In 2001 corpus authors agreed to publish the data in the Internet under GNU licence. This site presents corpus data in base and extended (enhanced) version as well as additional materials and corpus documentation.

The corpus was manually annotated using the IPI PAN tagset — each sentence by one annotator, there was no superannotation stage.

In 2012, under ICT-PSP project CESAR, the corpus with IPI PAN tagset annotation was automatically converted to annotation adhering to the NKJP tagset. The data from the automatic conversion into the NKJP tagset was used when simulating the manual annotation of NKJP – it served as data from one of the two annotators. The second annotator was a human annotator, who annotated the sentence from scratch. As in NKJP manual annotation, the human annotator inspected interpretations of segments in a given sentence generated automatically by the morphological analyser – the human annotator could either choose one of these interpretations, or enter a different interpretation manually. If the human annotator selected a different annotation than the one resulting from automatic conversion, the sentence was returned to the human annotator for inspection. Without knowing the result of automatic conversion, the human annotator could either change their earlier annotation or keep it. After this stage, if there were still conflicting interpretations, the superannotator would decide about the correct interpretation.

NOTE: If you are looking for a more fresh frequency data, please take a look at NKJP ngrams.

Citing the electronic version of the corpus

  • Ogrodniczuk, Maciej. Nowa edycja wzbogaconego korpusu słownika frekwencyjnego. (In Polish, EN: New edition of the Enhanced corpus of the Frequency dictionary). [In:] Językoznawstwo w Polsce. Stan i perspektywy. Stanisław Gajda (ed.) Institute of Polish Philology, Polish Academy of Sciences – Linguistics Committee, Opole University. Opole 2003, pp. 181–190. ISBN 83-86881-36-4.

Corpus documentation

Selected bibliography

Corpus licence

Corpus data

Current version

The current version of the corpus features:

  • manually updated annotation
  • TEI P5 data format
  • both annotation and data format are compatible with NKJP corpus

Downloads:

Original version

Cluster

samples

"Raw"

Enhanced

TEI P4 XML

without codes

with codes

version

Style A: Scientific texts

1 MB

1,5 MB

1,1 MB

4,0 MB

10 MB

Style B: News

1 MB

1,5 MB

1,2 MB

3,9 MB

9 MB

Style C: Essays

1 MB

1,5 MB

1,2 MB

4,0 MB

10 MB

Style D: Fiction

1 MB

1,5 MB

1,1 MB

4,1 MB

11 MB

Style E: Plays

1 MB

1,5 MB

1,1 MB

4,4 MB

12 MB

Auxilliary files for the TEI P4-encoded XML version:

ISO image of the CD-ROM with most of the materials.

Concordances