Locked History Actions

Diff for "PL196x"

Differences between revisions 8 and 31 (spanning 23 versions)
Revision 8 as of 2011-04-11 14:45:30
Size: 9824
Comment:
Revision 31 as of 2017-06-29 14:35:57
Size: 12193
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Polish language of the XX century sixties = ## page was renamed from Polish language of the XX century sixties
= Polish language of the 1960s
=
Line 3: Line 4:
This page is dedicated to the corpus of frequency dictionary of contemporary Polish. The original purpose of the corpus was to create a general frequency dictionary of contemporary Polish. The work started in 1967. Partial results were published between 1972 and 1977, the completed dictionary in 1990. The corpus was later augmented in various respects, both by manual editing and automated procedures. This page is dedicated to the corpus of [[http://rcin.org.pl/dlibra/docmetadata?from=rss&id=2054|frequency dictionary of contemporary Polish]]. The original purpose of the corpus was to create a general frequency dictionary of contemporary Polish. The work started in 1967. Partial results were published between 1972 and 1977, the completed dictionary in 1990. The corpus was later augmented in various respects, both by manual editing and automated procedures.
Line 8: Line 9:

In 2012 all corpus data has been manually verified by two independent annotators – please see the most current data [[#newest|below]]. This work has been financed by an ICT-PSP project [[CESAR]].

NOTE: If you are looking for a more fresh frequency data, please take a look at [[http://zil.ipipan.waw.pl/NKJPNGrams|NKJP ngrams]].
Line 19: Line 24:
 * Awramiuk Elżbieta. ''Wpływ odstępstw od segmentacji ortograficznej na wyniki statystyczne Słownika frekwencyjnego polszczyzny współczesnej''. (In Polish). Roczniki Humanistyczne Uniwersytetu w Białymstoku 2001-2002, t. 49-50, z. 6, s. 31–43. Białystok 2001.  * Awramiuk Elżbieta. ''Wpływ odstępstw od segmentacji ortograficznej na wyniki statystyczne Słownika frekwencyjnego polszczyzny współczesnej''. (In Polish, EN: ''Influence of deviations from orthographic segmentation on the statistical results in the frequency dictionary of contemporary Polish''). Roczniki Humanistyczne Uniwersytetu w Białymstoku 2001-2002, t. 49-50, z. 6, s. 31–43. Białystok 2001.
Line 21: Line 26:
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. ''Vocabulary of contemporary Polish. Frequency lists''. Volume I. Scientific texts. (In Polish). Warszawa, 1974. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. ''Vocabulary of contemporary Polish. Frequency lists''. Volume II. News. (In Polish). Warszawa, 1974. Warsaw University.
 * Lewicki, Andrzej; Masłowski, Władysław; Sambor, Jadwiga; Woronczak, Jerzy. ''Vocabulary of contemporary Polish. Frequency lists''. Volume III. Essays. (In Polish). Warszawa, 1975. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. ''Vocabulary of contemporary Polish. Frequency lists''. Volume IV. Fiction. (In Polish). Warszawa, 1976. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. ''Vocabulary of contemporary Polish. Frequency lists''. Volume V. Plays. (In Polish) Warszawa, 1977. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Szafran, Krzysztof; Woronczak, Jerzy. ''Frequency dictionary of contemporary  Polish''. (In Polish). Kraków, 1990. Institute of Polish Philology, Polish Academy of Sciences.
 * Czerepowicka, Monika; Saloni, Zygmunt. ''[[attachment:saloni.pdf|Co skreślano i co dopisywano w korpusie Słownika frekwencyjnego polszczyzny współczesnej]]''. (In Polish, EN: ''What was deleted and what was added in the frequency dictionary of contemporary Polish''), pp. 381-391.
* Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. ''[[http://rcin.org.pl/dlibra/docmetadata?id=3279|Słownictwo współczesnego języka polskiego. Listy frekwencyjne. Tom I. Teksty popularnonaukowe – część I]] i [[http://rcin.org.pl/dlibra/docmetadata?id=3282|II]]''. (In Polish, EN: ''Vocabulary of contemporary Polish. Frequency lists. Volume I. Scientific texts''). Warszawa, 1974. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. ''[[http://rcin.org.pl/dlibra/docmetadata?id=3280|Słownictwo współczesnego języka polskiego. Listy frekwencyjne. Tom II. Drobne wiadomości prasowe – część I]]''. (In Polish, EN: ''Vocabulary of contemporary Polish. Frequency lists. Volume II. News''). Warszawa, 1974. Warsaw University.
 * Lewicki, Andrzej; Masłowski, Władysław; Sambor, Jadwiga; Woronczak, Jerzy. ''[[http://rcin.org.pl/dlibra/docmetadata?id=3284|Słownictwo współczesnego języka polskiego. Listy frekwencyjne. Tom III. Publicystyka – część I]] i [[http://rcin.org.pl/dlibra/docmetadata?id=3285|II]]''. (In Polish, EN: ''Vocabulary of contemporary Polish. Frequency lists. Volume III. Essays''). Warszawa, 1975. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. ''[[http://rcin.org.pl/dlibra/docmetadata?id=3286|Słownictwo współczesnego języka polskiego. Listy frekwencyjne. Tom IV. Proza artystyczna – część I]], [[http://rcin.org.pl/dlibra/docmetadata?id=3288|II]] i [[http://rcin.org.pl/dlibra/docmetadata?id=3292|III]]''. (In Polish, EN: ''Vocabulary of contemporary Polish. Frequency lists. Volume IV. Fiction''). Warszawa, 1976. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Woronczak, Jerzy. ''Słownictwo współczesnego języka polskiego. Listy frekwencyjne. Tom V. Dramat artystyczny''. (In Polish, EN: ''Vocabulary of contemporary Polish. Frequency lists. Volume V. Plays''). Warszawa, 1977. Warsaw University.
 * Kurcz, Ida; Lewicki, Andrzej; Sambor, Jadwiga; Szafran, Krzysztof; Woronczak, Jerzy. ''[[http://rcin.org.pl/dlibra/docmetadata?id=2054|Słownik frekwencyjny polszczyzny współczesnej – tom I]] i [[http://rcin.org.pl/dlibra/docmetadata?id=2093|II]]''. (In Polish, EN: ''Frequency dictionary of contemporary Polish''). Kraków, 1990. Institute of Polish Philology, Polish Academy of Sciences.
Line 29: Line 35:
 * Ogrodniczuk, Maciej. ''[[attachment:SFPW.pdf|Rozszerzenie opisów morfologicznych w tekstach korpusu
słownika frekwencyjnego polszczyzny współczesnej]]''. (In Polish, EN: ''Augmenting the morphological description in the corpus of Frequency dictionary of contemporary Polish''). [In:] Prace lingwistyczne dedykowane prof. Jadwidze Sambor. Jadwiga Linde-Usiekniewicz (ed.), pp. 164-168, Warszawa 2003, Faculty of Polish Philology, Warsaw University.
 * Ogrodniczuk, Maciej. ''[[attachment:SFPW.pdf|Rozszerzenie opisów morfologicznych w tekstach korpusu słownika frekwencyjnego polszczyzny współczesnej]]''. (In Polish, EN: ''Augmenting the morphological description in the corpus of Frequency dictionary of contemporary Polish''). [In:] Prace lingwistyczne dedykowane prof. Jadwidze Sambor. Jadwiga Linde-Usiekniewicz (ed.), pp. 164-168, Warszawa 2003, Faculty of Polish Philology, Warsaw University.
Line 32: Line 37:
 * Saloni, Zygmunt. ''Frequency dictionary of contemporary Polish''. (In Polish). ComputerWorld, November 4th 1991, pp. 16-17.
 * Saloni, Zygmunt. ''[[attachment:saloni.pdf|Co skreślano i co dopisywano w korpusie Słownika frekwencyjnego polszczyzny współczesnej]]''. (In Polish, EN: ''What was deleted and what was added in the frequency dictionary of contemporary Polish''). pp. 381-391.
 * Saloni, Zygmunt. ''[[http://www.computerworld.pl/artykuly/315352/Slownik.frekwencyjny.polszczyzny.wspolczesnej.html|Słownik frekwencyjny polszczyzny współczesnej]]'' (In Polish, EN: ''Frequency dictionary of contemporary Polish''). !ComputerWorld, November 4th 1991, pp. 16-17.
Line 38: Line 42:
 * [[attachment:gpl.txt|GNU General Public Licence]] for corpus data.  * [[attachment:gpl.txt|GNU GPL]] for corpus data.
Line 41: Line 45:

<<Anchor(newest)>>
=== Latest version ===

The latest version of the corpus features:
 * manually updated annotation
 * TEI P5 data format
 * both annotation and data format are compatible with NKJP corpus

Downloads: temporarily disabled, please get back after July 30.
{{{#!wiki comment
''' * [[attachment:KF.tar.gz|TEI P5 XML corpus files]]
''' * [[attachment:KF_poliqarp.tar.gz|Poliqarp files]]
}}}
=== Original version ===

Polish language of the 1960s

This page is dedicated to the corpus of frequency dictionary of contemporary Polish. The original purpose of the corpus was to create a general frequency dictionary of contemporary Polish. The work started in 1967. Partial results were published between 1972 and 1977, the completed dictionary in 1990. The corpus was later augmented in various respects, both by manual editing and automated procedures.

Corpus data contain 10,000 samples divided into 5 parts: essays, news, scientific texts, fiction and plays. Every sample is approximately 50 words long, they all come from texts published between 1963 and 1967 and contain bibliographic description of its source. Each word is tagged with its base form and some morphological properties. Sentence boundaries are also marked.

In 2001 corpus authors agreed to publish the data in the Internet under GNU licence. This site presents corpus data in base and extended (enhanced) version as well as additional materials and corpus documentation.

In 2012 all corpus data has been manually verified by two independent annotators – please see the most current data below. This work has been financed by an ICT-PSP project CESAR.

NOTE: If you are looking for a more fresh frequency data, please take a look at NKJP ngrams.

Corpus documentation

Selected bibliography

Corpus licence

Corpus data

Latest version

The latest version of the corpus features:

  • manually updated annotation
  • TEI P5 data format
  • both annotation and data format are compatible with NKJP corpus

Downloads: temporarily disabled, please get back after July 30.

Original version

Cluster

samples

"Raw"

Enhanced

TEI P4 XML

without codes

with codes

version

Style A: Scientific texts

1 MB

1,5 MB

1,1 MB

4,0 MB

10 MB

Style B: News

1 MB

1,5 MB

1,2 MB

3,9 MB

9 MB

Style C: Essays

1 MB

1,5 MB

1,2 MB

4,0 MB

10 MB

Style D: Fiction

1 MB

1,5 MB

1,1 MB

4,1 MB

11 MB

Style E: Plays

1 MB

1,5 MB

1,1 MB

4,4 MB

12 MB

Auxilliary files for the TEI P4-encoded XML version:

ISO image of the CD-ROM with most of the materials.

Concordances