Differences between revisions 78 and 115 (spanning 37 versions)

Language Tools and Resources for Polish

This page contains a list of publicly available language tools and resources.

Written corpora and corpus-related tools

National Corpus of Polish (NKJP)
- Poliqarp search engine for NKJP data, a search engine for the National Corpus of Polish,
- PELCRA search engine for NKJP data, a search engine for the National Corpus of Polish,
- Kolokator, a collocation extraction tool for NKJP data,
- TEI4NKJP, a collection of XML schemata used in NKJP,
- NKJP-PodkorpusMilionowy-1.0.tgz, the manually annotated 1-million word subcorpus of the NJKP, available on GNU GPL v.3,
- gramatyka_Spejd_NKJP_RC1.0.zip, a release candidate of a shallow Spejd grammar for NKJP, available on GNU GPL v.3,
- Nerf, a tool for named entity recognition, available on GNU GPL v.3,
IPI PAN Corpus,
PWN Corpus,
PELCRA Corpus,
Dictionaries as Corpora,
Polish language of the 1960s,
Old Polish corpus,
PICLE corpus, the Polish sub-corpus of the International Corpus of Learner English (ICLE),
Poliqarp, a corpus indexing and search engine,
Anotatornia, a system for multi-level manual annotation of corpora,
Smyrna, a simple, light-weight Polish concordancer.

Parallel corpora

ParaSol, a parallel corpus of Slavic and other languages,
PolUKR, a Polish-Ukrainian parallel corpus,
OPUS, an open source parallel corpus (European Parliament, EMEA, KDE, movie subtitles),
"1984", an annotated parallel corpus of George Orwell's "1984" in 15 languages, MULTEXT-East, v.4 (licensed download),
InterCorp, a multilingual parallel corpus,
Leeds collection of Internet corpora,
LAGUN corpus,
JRC-Acquis Multilingual Parallel Corpus,
PSI collection of parallel corpora, a growing collection of parallel corpora pairing Polish with other european languages.

Spoken corpora

The annotated corpus of spoken dialogues (LUNA project, corpus data available at the end of the page)

Translation memories

MyMemory, freely available multilingual TM,
TAUS Data, a multilingual TM from the members of TAUS Data Association.

Morphological tools and resources

Morfeusz SGJP, morphological analyser (Z. Saloni, W. Gruszczyński, M. Woliński, R. Wołosz),
Index a tergo of Polish word forms (J. Tokarski, Z. Saloni),
Morfologik, morphological analyser (M. Miłkowski, D. Weiss),
SAM, morphological analyser (K. Szafran),
UAM Text Tools (P. Obrębski, Z. Vetulani; see also http://utt.wmi.amu.edu.pl/trac/wiki/),
MULTEXT-East, v.4, morphosyntactic specifications and documentation for 16 languages,
KIPI->MTE, a converter from TaKIPI to MULTEXT-East morphosyntactic format (A. Radziszewski, N. Kotsyba),
MACA, Morphological Analysis Converter and Aggregator (A. Radziszewski, T. Śniatowski),
Lexical analyser and a Polish proof-reader (S. Galus),
Neurosoft Gram (demo of a morphological analyser),
Baza fleksyjna języka polskiego, inflection database of Polish words (W. Lubaszewski, B. Moskal, P. Pietras, P. Pisarek, T. Rokicka),
Finite state utilities (J. Daciuk),
Stempel, another stemmer (A. Białecki),
WCCL, toolkit for morphosyntactic feature generation (A. Radziszewski, A. Wardyński, T. Śniatowski, P. Kędzia).

Taggers

TaKIPI, a morphosyntactic tagger for Polish,
PANTERA, a morphosyntactic tagger for Polish,
WMBT, a morphosyntactic tagger for Polish.

Parsers, grammars, treebanks

Świgra, a DCG parser,
Spejd, a shallow parsing and disambiguation system,
Dendrarium, a treebank development system (under development),
A Treebank / Test Suite for Polish,
Visualisation of parsing tree forests (Świdziński's grammar, Świgra, Morfeusz, Bień's syntactic spreadsheets) by Andrzej Zaborowski,
Analizator syntaktyczny AS (M. Woliński),
Formalny opis składniowy zdań polskich (S. Szpakowicz),
Serwer LAS / Linguistic Analysis Server.

Machine-readable dictionaries

plWordNet, Polish WordNet (M. Piasecki),
Polish OpenThesaurus, a crowdsourced Polish thesaurus (M. Miłkowski),
Słownik języka polskiego (d. alternatywny), Polish ispell dictionaries, along with some definitions and online form display.
Sample morphosyntactic Polish lexicon, the MULTEXT-East morphosyntactic lexicons,
Słownik składniowy języka polskiego (Z. Greń),
N-gram model of Polish (AGH DSP),
Nowy słownik angielsko-polski (T. Piotrowski, Z. Saloni),
Polish OpenCYC (A. Pohl),
Polish machine-generated dictionaries, available on Creative Commons (J. Kazojć),
List of all Polish surnames, licence unknown, see further information on this resource,
Gazetteer for Polish Named Entities (A. Savary, J. Piskorski).

Human-readable dictionaries

Wielki Słownik Języka Polskiego,
Wikisłownik,
Słownik wyrazów obcych i zwrotów obcojęzycznych Władysława Kopalińskiego,
Słownik synonimów i antonimów Piotra Żmigrodzkiego,
Poliqarp for DjVu search engine for J. Karłowicz, A. Kryński, W. Niedźwiedzki. Dictionary of Polish. Warsaw 1900–1927,
Poliqarp for DjVu search engine for S. Bąk, M. R. Mayenowa, F. Pepłowski (eds.). Dictionary of the 16th century Polish. Wrocław — Warszawa, 1966-???? (work in progress),
Poliqarp for DjVu search engine for M. Samuel Bogumił Linde. Dictionary of Polish (2nd edition). Lwów 1854-1861,
Poliqarp for DjVu search engine for B. Chlebowski, F. Sulimierski, W. Walewski (eds.), The Geographical Dictionary of the Polish Kingdom and other Slavic Countries, Warszawa 1880-1902.

Speech analysis and synthesis tools

Skrybot, commercial speech recognition system (L. Pawlaczyk, P. Bosky),
Ivona, commercial text-to-speech system (Expressivo),
Acapela, text to speech demo,
Synteza mowy polskiej, automatic speech recognition and speech synthesis demos, with background information (K. Szklanny),
System syntezy mowy ciągłej (G. Demenko, S. Grocholewski),
Polish MBROLA database (K. Szklanny, K. Marasek),
SynTalk, commercial speech synthesis system (NeuroSoft),
PrimeSpeech, commercial speech recognition systems,
OrtFon, phonetic transcriber (AGH DSP),
ASR, automatic speech recognition system for Polish (AGH DSP),
Anotator, speech corpora anotator dedicated for Polish and focused on connecting existing resources (AGH DSP),
Speech corpus, Around 9 hours, word-annotated Polish speech corpus (AGH DSP),
AV corpus, Audiovideo corpus of Polish speech (AGH DSP),

Machine translation demonstrations

iTranslate4.eu (multiple languages, allows comparing translation engines),
Translatica (EN-PL-EN, DE-PL-DE, RU-PL-RU), see also Poleng website with an experimental FR-PL-FR version,
Bing Translator (multilingual),
Google Translate (multilingual),
InterTran (multilingual),
LingvoBit (EN-PL-EN),
Systran (EN-PL, PL-FR and some more),
Esperantilo (integrated Esperanto editor, with MT for EO-PL-DE-EN-SV),
Thetos (PL-Sign language).

Other

Kolokacje, a Web crawler and collocation finder (A. Buczyński),
WSDDE, a system for designing and performing Word Sense Disambiguation experiments (R. Młodzki et al.),
Frazeo, a search engine and clusterer of news in Polish (P. Pęzik),
Segment, a rule-based sentence tokenizer supporting SRX standard (J. Lipski; the Polish rules are available in LanguageTool project),
Translatica SRX sentence segmentation rules for Polish (LGPL)
Lakon, a system for news summarization (master's thesis by A. Dudczak),
SyMGIZA++, an extension of Giza++ that computes symmetric word alignment models,
Multiservice, a sample interface for running NLP Web services for Polish,
Hipisek, an experimental question answering system (M. Walas).

-  ⇤ ← Revision 78 as of 2011-05-19 16:20:12 → 
  Size: 10233
  Editor: MariuszZiolko
  Comment:
+   ← Revision 115 as of 2011-11-21 14:17:21 → ⇥
  Size: 12608
  Editor: MariuszZiolko
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 6:
- * [[http://nkjp.pl/index.php?page=0&amp;lang=1|National Corpus of Polish]] (under development),
+ * [[http://nkjp.pl/index.php?page=0&lang=1|National Corpus of Polish]] (NKJP)
  * [[http://nkjp.pl/poliqarp/|Poliqarp search engine for NKJP data]], a search engine for the National Corpus of Polish,
  * [[http://nkjp.uni.lodz.pl/|PELCRA search engine for NKJP data]], a search engine for the National Corpus of Polish,
  * [[http://www.nkjp.uni.lodz.pl/collocations.jsp|Kolokator]], a collocation extraction tool for NKJP data,
  * [[http://nlp.ipipan.waw.pl/TEI4NKJP/|TEI4NKJP]], a collection of XML schemata used in NKJP,
  * [[attachment:NKJP-PodkorpusMilionowy-1.0.tgz]], the manually annotated 1-million word subcorpus of the NJKP, available on GNU GPL v.3,
  * [[attachment:gramatyka_Spejd_NKJP_RC1.0.zip]], a release candidate of a shallow [[Spejd]] grammar for NKJP, available on GNU GPL v.3,
  * [[Nerf]], a tool for named entity recognition, available on GNU GPL v.3,
-Line 11:
+Line 18:
- * [[Polish language of the XX century sixties]],
+ * [[PL196x|Polish language of the 1960s]],
-Line 28:
+Line 35:
+== Spoken corpora ==

 * [[http://clip.ipipan.waw.pl/LUNA|The annotated corpus of spoken dialogues]] (LUNA project, corpus data available at the end of the page)
-Line 47:
+Line 58:
- * [[http://getopt.org/stempel/|Stempel]], another stemmer (A. Białecki).
+ * [[http://getopt.org/stempel/|Stempel]], another stemmer (A. Białecki),
 * [[http://nlp.pwr.wroc.pl/redmine/projects/joskipi/wiki/|WCCL]], toolkit for morphosyntactic feature generation (A. Radziszewski, A. Wardyński, T. Śniatowski, P. Kędzia).
-Line 51:
+Line 63:
- * [[http://code.google.com/p/pantera-tagger/|PANTERA]], a morphosyntactic tagger for Polish.
+ * [[http://code.google.com/p/pantera-tagger/|PANTERA]], a morphosyntactic tagger for Polish,
 * [[http://nlp.pwr.wroc.pl/redmine/projects/wmbt/wiki|WMBT]], a morphosyntactic tagger for Polish.
-Line 58:
+Line 71:
- * [[http://amos.klf.uw.edu.pl/| Visualisation of parsing tree forests]] (Świdziński's grammar,Świgra, Morfeusz, Bień's syntactic spreadsheets) by Andrzej Zaborowski,
+ * [[http://amos.klf.uw.edu.pl/| Visualisation of parsing tree forests]] (Świdziński's grammar, Świgra, Morfeusz, Bień's syntactic spreadsheets) by Andrzej Zaborowski,
-Line 68:
+Line 81:
- * [[http://www.ispan.waw.pl/zakjez/pracjcz/slowniki/slowniki.html|Słownik składniowy języka polskiego]] (Z. Greń).
 * [[http://home.agh.edu.pl/~ziolko/doku.php?id=pl:resources:ngram|N-gram model of Polish]] (AGH DSP)
+ * [[http://www.ispan.waw.pl/zakjez/pracjcz/slowniki/slowniki.html|Słownik składniowy języka polskiego]] (Z. Greń),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:ngram|N-gram model of Polish]] (AGH DSP),
 * [[Nowy_slownik_angielsko-polski|Nowy słownik angielsko-polski]] (T. Piotrowski, Z. Saloni),
 * [[https://github.com/apohllo/polish-cyc|Polish OpenCYC]] (A. Pohl),
 * [[http://www.slowniki.org.pl/pol.html|Polish machine-generated dictionaries]], available on Creative Commons (J. Kazojć),
 * [[http://futrega.org/etc/nazwiska.zip|List of all Polish surnames]], licence unknown, see [[http://futrega.org/etc/nazwiska.html|further information on this resource]],
 * [[http://clip.ipipan.waw.pl/Gazetteer|Gazetteer for Polish Named Entities]] (A. Savary, J. Piskorski).
-Line 90:
+Line 107:
- * [[http://home.agh.edu.pl/~ziolko/doku.php?id=pl:resources:ortfon|OrtFon]], phonetic transcriber (AGH DSP).
 * [[http://home.agh.edu.pl/~ziolko/doku.php?id=pl:resources:asr|ASR]], automatic speech recognition system for Polish  (AGH DSP). 
 * [[http://home.agh.edu.pl/~ziolko/doku.php?id=pl:resources:anotator|Anotator]], speech corpora anotator dedicated for Polish and focused on connecting existing resources (AGH DSP).
+ * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:ortfon|OrtFon]], phonetic transcriber (AGH DSP),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:asr|ASR]], automatic speech recognition system for Polish  (AGH DSP),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:anotator|Anotator]], speech corpora anotator dedicated for Polish and focused on connecting existing resources (AGH DSP),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:korpusmowy|Speech corpus]], Around 9 hours, word-annotated Polish speech corpus (AGH DSP),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=en:resources:korpusav|AV corpus]], Audiovideo corpus of Polish speech (AGH DSP),
-Line 95:
+Line 114:
+ * [[http://itranslate4.eu/|iTranslate4.eu]] (multiple languages, allows comparing translation engines),
-Line 112:
+Line 132:
- * [[http://psi.amu.edu.pl/en/index.php?title=SyMGIZA%2B%2B|SyMGIZA++]], an extension of Giza++ that computes symmetric word alignment models.
+ * [[http://psi.amu.edu.pl/en/index.php?title=SyMGIZA%2B%2B|SyMGIZA++]], an extension of Giza++ that computes symmetric word alignment models,
 * [[http://chopin.ipipan.waw.pl/multiservice/|Multiservice]], a sample interface for running NLP Web services for Polish,
 * [[http://hipisek.pl|Hipisek]], an experimental question answering system (M. Walas).

Diff for "LRT"

Menu

Wiki

Language Tools and Resources for Polish

Written corpora and corpus-related tools

Parallel corpora

Spoken corpora

Translation memories

Morphological tools and resources

Taggers

Parsers, grammars, treebanks

Machine-readable dictionaries

Human-readable dictionaries

Speech analysis and synthesis tools

Machine translation demonstrations

Other