Locked History Actions

Diff for "NationalCorpusOfPolish"

Differences between revisions 3 and 18 (spanning 15 versions)
Revision 3 as of 2014-01-30 15:15:35
Size: 777
Comment:
Revision 18 as of 2015-04-16 09:53:08
Size: 3668
Comment:
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
== Official project website ==
 * [[http://nkjp.pl/index.php?page=0&lang=1|National Corpus of Polish]] (NKJP),
The National Corpus of Polish is a shared initiative of four institutions: Institute of Computer Science at the Polish Academy of Sciences (coordinator), Institute of Polish Language at the Polish Academy of Sciences, Polish Scientific Publishers PWN, and the Department of Computational and Corpus Linguistics at the University of Łódź. It has been carried out as a research-development project of the Ministry of Science and Higher Education.

These four institutions have started cooperation to build a reference corpus of Polish language containing over fifteen hundred millions of words. The list of sources for the corpora contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts. For a corpus to be reliable, not only it is necessary to contain a high number of words, but it also needs a diversity of texts with respect to the subject and genre. The conversations ought to represent both male and female speakers, in various age groups, coming from various regions in Poland.

[[http://nkjp.pl/index.php?page=0&lang=1|Official project website]]
Line 6: Line 9:
 * [[attachment:NKJP-PodkorpusMilionowy-1.2.tgz]], the manually annotated 1-million word subcorpus of the NJKP, available on GNU GPL v.3, The manually annotated 1-million word subcorpus of the National Corpus of Polish, available on GNU GPL v.3:
 * [[attachment:NKJP-PodkorpusMilionowy-1.2.tar.gz]] -- GZip compressed tar archive,
 * [[attachment:NKJP-PodkorpusMilionowy-1.2.tar.bz2]] -- BZip2 compressed tar archive.

==== XCES encoded version ====
The latest version of the manually annotated subcorpus, encoded using the XCES format.
 * [[attachment:nkjp1m-1.2-xces.xml.bz2]] -- XCES encoded corpus; [[attachment:morph-order.txt|here]] you may find the order of the source directories in the combined file,
 * [[attachment:nkjp1m-1.2-xces-reanalysed-sgjp.xml.bz2]] -- XCES encoded corpus, reanalysed according to the procedure described [[http://nlp.pwr.wroc.pl/redmine/projects/wcrft/wiki/Training|here]], using Morfeusz SGJP,
 * [[attachment:nkjp1m-1.2-xces-reanalysed-polimorf.xml.bz2]] -- XCES encoded corpus, reanalysed according to the procedure described [[http://nlp.pwr.wroc.pl/redmine/projects/wcrft/wiki/Training|here]], using Morfeusz Polimorf.

==== Tagger models, trained on the latest version of the subcorpus ====
 * [[attachment:concraft-model-nkjp1m-1.2.gz]] -- model for Concraft 0.7.1 tagger,
 * [[attachment:pantera-model-nkjp1m-1.2.bz2]] -- model for PANTERA 0.9.1 tagger,
 * [[attachment:wcrft-model-nkjp1m-1.2.tar.gz]] -- model for WCRFT1 and WCRFT2 taggers,
 * [[attachment:wmbt-model-nkjp1m-1.2.tar.bz2]] -- model for WMBT tagger.

The training data for all models above has been preprocessed using Morfeusz SGJP morphological analyzer.
Line 9: Line 28:
 * [[http://zil.ipipan.waw.pl/DistrNKJP|Distributable version of NKJP]],  * [[http://zil.ipipan.waw.pl/DistrNKJP|Distributable versions of NKJP]],
Line 13: Line 32:

== Reporting errors ==
Please use the [[https://docs.google.com/spreadsheet/viewform?hl=pl&formkey=dERoLWhzYWNveXlvS09ZMDlRNmcydVE6MQ#gid=0|linked form]] to report any errors found in the corpus.

== Older files ==
For reference and comparison purposes, older versions of the manually annotated subcorpus are available:
 * [[attachment:NKJP-PodkorpusMilionowy-1.1.tgz]],
 * [[attachment:NKJP-PodkorpusMilionowy-1.0.tgz]].

National Corpus of Polish

The National Corpus of Polish is a shared initiative of four institutions: Institute of Computer Science at the Polish Academy of Sciences (coordinator), Institute of Polish Language at the Polish Academy of Sciences, Polish Scientific Publishers PWN, and the Department of Computational and Corpus Linguistics at the University of Łódź. It has been carried out as a research-development project of the Ministry of Science and Higher Education.

These four institutions have started cooperation to build a reference corpus of Polish language containing over fifteen hundred millions of words. The list of sources for the corpora contains classic literature, daily newspapers, specialist periodicals and journals, transcripts of conversations, and a variety of short-lived and internet texts. For a corpus to be reliable, not only it is necessary to contain a high number of words, but it also needs a diversity of texts with respect to the subject and genre. The conversations ought to represent both male and female speakers, in various age groups, coming from various regions in Poland.

Official project website

Manually annotated subcorpus

The manually annotated 1-million word subcorpus of the National Corpus of Polish, available on GNU GPL v.3:

XCES encoded version

The latest version of the manually annotated subcorpus, encoded using the XCES format.

Tagger models, trained on the latest version of the subcorpus

The training data for all models above has been preprocessed using Morfeusz SGJP morphological analyzer.

Additional resources

Reporting errors

Please use the linked form to report any errors found in the corpus.

Older files

For reference and comparison purposes, older versions of the manually annotated subcorpus are available: