Differences between revisions 226 and 228 (spanning 2 versions)

Language Tools and Resources for Polish

This page contains a list of publicly available language tools and resources.

Written corpora and corpus-related tools

National Corpus of Polish (NKJP),
- NKJP-PodkorpusMilionowy-1.1.tgz, the manually annotated 1-million word subcorpus of the NJKP, available on GNU GPL v.3,
- Distributable version of NKJP,
- N-grams extracted from balanced National Corpus of Polish,
- Economy-related subcorpus of the National Corpus of Polish, with manually created sense annotation layer,
- A java library for parsing NKJP-compatible TEI P5 files.
IPI PAN Corpus,
PWN Corpus,
PELCRA Corpus,
Dictionaries as Corpora,
Polish language of the 1960s,
PICLE corpus, the Polish sub-corpus of the International Corpus of Learner English (ICLE),
IMPACT ground-truth data for selected Polish historical documents from PIONIER Digital Libraries Federation,
- Now available also as corpora in the Poliqarp for DjVu search engine,
Poliqarp, a corpus indexing and search engine,
Anotatornia, a system for multi-level manual annotation of corpora,
Inforex, a web-based system designed for managing and annotating text corpora on the semantic level,
Smyrna, a simple, light-weight Polish concordancer,
KPWr, Polish Corpus of Wrocław University of Technology, collection of documents available on Creative Common license annotated with syntactic chunks, proper names, semantic relations, anaphora and word senses,
Polish Corpus of Suicide Notes,
Polish Wikipedia Corpus,
gpwEcono, a corpus of stock market reports, with manual word sense annotation,
plWikiEcono, a corpus of Polish Wikipedia articles from the domain of economy,
Polish Coreference Corpus, a corpus of Polish coreference relations, created as part of the CORE project,
ArgDB-pl, a Polish corpus of arguments in natural contexts.

Spoken corpora

The annotated corpus of spoken dialogues (LUNA project, corpus data available at the end of the page),
AGH speech corpus, around 9 hours, word-annotated Polish speech corpus (AGH DSP),
Audiovideo corpus of Polish speech (AGH DSP),
The PELCRA conversational corpus of Polish. TEI P5-encoded transcriptions of 1.8 million words of conversational spoken Polish collected in the years 2001-2011 (within the PELCRA and NKJP projects) available under CC-BY-NC,
NKJP search engine for spoken-conversational data,
Acoustic database for Polish unit selection speech synthesis (ELRA resources),
Acoustic database for Polish concatenative speech synthesis (ELRA resources),
Corpus of emotions in speech (AGH DSP).

Parallel corpora and translation memories

ParaSol, a parallel corpus of Slavic and other languages,
PolUKR, a Polish-Ukrainian parallel corpus,
OPUS, an open source parallel corpus (European Parliament, EMEA, KDE, movie subtitles),
"1984", an annotated parallel corpus of George Orwell's "1984" in 15 languages, MULTEXT-East, v.4 (licensed download),
InterCorp, a multilingual parallel corpus,
Leeds collection of Internet corpora,
LAGUN corpus,
PELCRA Parallel corpora, a collection of downloadable parallel corpora available under the CC-BY and CC-BY-NC licensed developed by the PELCRA team within the CESAR project,
JRC-Acquis Multilingual Parallel Corpus,
PSI collection of parallel corpora, a growing collection of parallel corpora pairing Polish with other european languages,
Polish-Russian Parallel Corpus,
MyMemory, freely available multilingual TM,
TAUS Data, a multilingual TM from the members of TAUS Data Association.

Morphological tools and resources

PoliMorf, the ultimate inflectional dictionary of Polish (under development),
Morfeusz SGJP, morphological analyser (Z. Saloni, W. Gruszczyński, M. Woliński, R. Wołosz),
Index a tergo of Polish word forms (J. Tokarski, Z. Saloni),
Morfologik, morphological analyser (M. Miłkowski, D. Weiss),
SAM, morphological analyser (K. Szafran),
UAM Text Tools (P. Obrębski, Z. Vetulani; see also http://utt.wmi.amu.edu.pl/trac/wiki/),
MULTEXT-East, v.4, morphosyntactic specifications and documentation for 16 languages,
KIPI->MTE, a converter from TaKIPI to MULTEXT-East morphosyntactic format (A. Radziszewski, N. Kotsyba),
MACA, Morphological Analysis Converter and Aggregator (A. Radziszewski, T. Śniatowski),
Lexical analyser and a Polish proof-reader (S. Galus),
Neurosoft Gram (demo of a morphological analyser),
Baza fleksyjna języka polskiego, inflection database of Polish words (W. Lubaszewski, B. Moskal, P. Pietras, P. Pisarek, T. Rokicka),
Finite state utilities (J. Daciuk),
Stempel, another stemmer (A. Białecki),
WCCL, toolkit for morphosyntactic feature generation (A. Radziszewski, A. Wardyński, T. Śniatowski, P. Kędzia),
LemmaGen, Multilingual Open Source Lemmatisation for 11 EU languages, including Polish (M. Jursic, T. Erjavec et al.)

Taggers

TaKIPI, a morphosyntactic tagger for Polish (Decision Trees),
PANTERA, a morphosyntactic tagger for Polish (Transformation-Based Learning),
WMBT, a morphosyntactic tagger for Polish (Memory-Based Learning),
TaCo, a statistical morphosyntactic tagset converter for positional tagsets (e.g. Polish),
WCRFT, a morphosyntactic tagger for Polish (Conditional Random Fields),
Concraft, a morphosyntactic disambiguation tool for Polish (Constrained Conditional Random Fields),
NKJP model for TnT Tagger, a trained model usable on Morfeusz-segmented text with TnT Tagger.

Parsers, grammars, treebanks

Składnica, a hybrid constituency/dependency treebank of Polish (under development),
TAG grammar of Polish,
Świgra, a DCG parser,
- version 1.0 (2005),
- version 1.5 as used in Składnica (2011),
Spejd, a shallow parsing and disambiguation system,
- the current version of the system,
- Spejd grammar of Polish (version 1.0), developed by K. Głowińska within NKJP, available on GNU GPL v.3,
- SEJFEK4Spejd - a Spejd grammar version of SEJFEK and a converter from dictionary to grammar,
Dendrarium, a treebank development system,
A Treebank / Test Suite for Polish,
Visualisation of parsing tree forests (Świdziński's grammar, Świgra, Morfeusz, Bień's syntactic spreadsheets) by Andrzej Zaborowski,
Analizator syntaktyczny AS (M. Woliński),
Formalny opis składniowy zdań polskich (S. Szpakowicz),
Serwer LAS / Linguistic Analysis Server,
Disaster (DISAmbiguator and STatistical chunkER) – a Python module for chunking and morphosyntactic disambiguation,
Iobber, a CRF chunker for Polish.

Machine-readable dictionaries

plWordNet, Polish WordNet (M. Piasecki),
Polish OpenThesaurus, słownik synonimów – a crowdsourced Polish thesaurus (M. Miłkowski),
Słownik języka polskiego (d. alternatywny), Polish ispell dictionaries, along with some definitions and online form display.
Sample morphosyntactic Polish lexicon, the MULTEXT-East morphosyntactic lexicons,
Słowniki składniowe języka czeskiego i polskiego (Z. Greń),
N-gram model of Polish (AGH DSP),
Nowy słownik angielsko-polski (T. Piotrowski, Z. Saloni),
Polish OpenCYC (A. Pohl),
Polish machine-generated dictionaries, available on Creative Commons (J. Kazojć),
List of all Polish surnames, licence unknown, see further information on this resource,
Gazetteer for Polish Named Entities (A. Savary, M. Lenart, J. Piskorski),
Triggers for Polish Named Entities (M. Baron, L. Manicki, A. Savary),
NELexicon contains more than 1.4 millions of proper names (M. Marcińczuk, A. Musiał, M. Janicki),
Walenty, the Polish Valence Dictionary (F. Skwarski, M. Świdziński, W. Kieraś, E. Hajnicz, A. Patejuk, A. Przepiórkowski, M. Woliński),
Syntatic-generative dictionary of Polish verbs (K. Polański),
SAWA, the Grammatical Lexicon of Warsaw Urban Proper Names (M. Marciniak, C. Heliasz, J. Rabiega-Wiśniewska, P. Sikora, M. Woliński, A. Savary),
SEJF, the Grammatical Lexicon of Polish Phraseology (M. Czerepowicka, A. Savary),
SEJFEK, the Grammatical Lexicon of Polish Economical Phraseology (F. Makowiecki, A. Savary),
WikiTopoPl, a multilingual lexicon of 155,000 Polish geographical proper names extracted from Wikipedia and their equivalents in Bulgarian, Croatian, English, German, modern Greek, Hungarian, Romanian, Serbian and Slovak (L. Manicki).
Prolexbase 2.0, a multiligual relational dictionary of proper names in Polish, English and French (M. Baron, B. Bouchou-Markhoff, L. Manicki, D. Maurel, A. Savary, M. Tran).

Human-readable dictionaries

Wielki Słownik Języka Polskiego,
Słownik języka polskiego PAN pod red. W. Doroszewskiego,
Wikisłownik,
Słownik wyrazów obcych i zwrotów obcojęzycznych Władysława Kopalińskiego,
Słownik synonimów i antonimów Piotra Żmigrodzkiego,
Poliqarp for DjVu search engine for J. Karłowicz, A. Kryński, W. Niedźwiedzki. Dictionary of Polish. Warsaw 1900–1927,
Poliqarp for DjVu search engine for S. Bąk, M. R. Mayenowa, F. Pepłowski (eds.). Dictionary of the 16th century Polish. Wrocław — Warszawa, 1966-???? (work in progress),
Poliqarp for DjVu search engine for M. Samuel Bogumił Linde. Dictionary of Polish (2nd edition). Lwów 1854-1861,
Poliqarp for DjVu search engine for B. Chlebowski, F. Sulimierski, W. Walewski (eds.), The Geographical Dictionary of the Polish Kingdom and other Slavic Countries, Warszawa 1880-1902,
Edycja elektroniczna Słownika wileńskiego,
PELCRA HASK Collocation Dictionaries.

Speech analysis and synthesis tools

Skrybot, commercial speech recognition system (L. Pawlaczyk, P. Bosky),
Ivona, commercial text-to-speech system (Expressivo),
Acapela, text to speech demo,
Synteza mowy polskiej, automatic speech recognition and speech synthesis demos, with background information (K. Szklanny),
System syntezy mowy ciągłej (G. Demenko, S. Grocholewski),
Polish MBROLA database (K. Szklanny, K. Marasek),
SynTalk, commercial speech synthesis system (NeuroSoft),
PrimeSpeech, commercial speech recognition systems,
OrtFon, phonetic transcriber (AGH DSP),
Sarmata, automatic speech recognition system for Polish (AGH DSP),
Anotator, speech corpora anotator dedicated for Polish and focused on connecting existing resources (AGH DSP),
System rozpoznawania mówcy, (AGH DSP).

Machine translation demonstrations

iTranslate4.eu (multiple languages, allows comparing translation engines),
Bing Translator (multilingual),
Google Translate (multilingual),
InterTran (multilingual),
LingvoBit (EN-PL-EN),
Systran (EN-PL, PL-FR and some more),
Esperantilo (integrated Esperanto editor, with MT for EO-PL-DE-EN-SV),
Thetos (PL-Sign language).

Other

Kolokacje, a Web crawler and collocation finder (A. Buczyński),
WSDDE, a system for designing and performing Word Sense Disambiguation experiments (R. Młodzki et al.),
Frazeo, a search engine and clusterer of news in Polish (P. Pęzik),
Segment, a rule-based sentence tokenizer supporting SRX standard (J. Lipski; the Polish rules are available in LanguageTool project, see here for short instructions on how to use the tool),
Toki, a tokenizer supporting SRX standard, C++ library and toolkit (T. Śniatowski and A. Radziszewski)
Translatica SRX sentence segmentation rules for Polish (LGPL)
Lakon, a system for news summarization (master's thesis by A. Dudczak),
SyMGIZA++, an extension of Giza++ that computes symmetric word alignment models,
Multiservice, a sample interface for running NLP Web services for Polish,
Hipisek, an experimental question answering system (M. Walas),
Narzędzia dygitalizacji tekstów, Poliqarp for DjVu i inne programy,
Nerf, a tool for named entity recognition, available on GNU GPL v.3.
Liner2, named entity recognizer released on GNU GPL with models to recognize 5 and 56 categories of proper names (M. Marcińczuk and M. Janicki).
PSI-Toolkit, a chain of publicly available tools for automatic processing of Polish.
Fextor, a feature extraction framework.
LexCSD, a system for semi-automatic sense disambiguation.
SuperMatrix, a general tool for lexical semantic knowledge acquisition
WordnetLoom, an wordnet editor application.
Toposław, tool for the creation of electronic inflectional dictionaries of multi-word units.
CorpCor, a web-based tool for correcting morphosyntactic annotation in TEI XML encoded corpora (e.g. NCP).
Polish Coreference Tools, a suite of Polish coreference resolution tools, created as part of the CORE project.

-  ⇤ ← Revision 226 as of 2013-06-05 21:23:59 → 
  Size: 20679
  Editor: MaciejOgrodniczuk
  Comment: tylko przecinek
+   ← Revision 228 as of 2013-06-08 20:01:15 → ⇥
  Size: 20672
  Editor: AdamPrzepiorkowski
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 38:
-== Parallel corpora ==
+== Spoken corpora ==

 * [[http://clip.ipipan.waw.pl/LUNA|The annotated corpus of spoken dialogues]] (LUNA project, corpus data available at the end of the page),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=en:resources:korpusmowy|AGH speech corpus]], around 9 hours, word-annotated Polish speech corpus (AGH DSP),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=en:resources:korpusav|Audiovideo corpus]] of Polish speech (AGH DSP),
 *  [[http://pelcra.pl/corpora/spoken|The PELCRA conversational corpus of Polish]]. TEI P5-encoded transcriptions of 1.8 million words of conversational spoken Polish collected in the years 2001-2011 (within the PELCRA and NKJP projects) available under CC-BY-NC,
 * [[http://nkjp.uni.lodz.pl/spoken.jsp|NKJP search engine for  spoken-conversational data]],
 * [[http://catalog.elra.info/product_info.php?cPath=37_39&products_id=1164|Acoustic database for Polish unit selection speech synthesis]] (ELRA resources),
 * [[http://catalog.elra.info/product_info.php?cPath=37_39&products_id=1168|Acoustic database for Polish concatenative speech synthesis]] (ELRA resources),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:korpusemo|Corpus of emotions in speech]] (AGH DSP).

== Parallel corpora and translation memories ==
-Line 49:
+Line 60:
- * [[http://www.pol-ros.polon.uw.edu.pl/|Polish-Russian Parallel Corpus]].

== Spoken corpora ==

 * [[http://clip.ipipan.waw.pl/LUNA|The annotated corpus of spoken dialogues]] (LUNA project, corpus data available at the end of the page),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=en:resources:korpusmowy|AGH speech corpus]], around 9 hours, word-annotated Polish speech corpus (AGH DSP),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=en:resources:korpusav|Audiovideo corpus]] of Polish speech (AGH DSP),
 *  [[http://pelcra.pl/corpora/spoken|The PELCRA conversational corpus of Polish]]. TEI P5-encoded transcriptions of 1.8 million words of conversational spoken Polish collected in the years 2001-2011 (within the PELCRA and NKJP projects) available under CC-BY-NC,
 * [[http://nkjp.uni.lodz.pl/spoken.jsp|NKJP search engine for  spoken-conversational data]],
 * [[http://catalog.elra.info/product_info.php?cPath=37_39&products_id=1164|Acoustic database for Polish unit selection speech synthesis]] (ELRA resources),
 * [[http://catalog.elra.info/product_info.php?cPath=37_39&products_id=1168|Acoustic database for Polish concatenative speech synthesis]] (ELRA resources),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:korpusemo|Corpus of emotions in speech]] (AGH DSP),

== Translation memories ==
+ * [[http://www.pol-ros.polon.uw.edu.pl/|Polish-Russian Parallel Corpus]],

Diff for "LRT"

Menu

Wiki

Language Tools and Resources for Polish

Written corpora and corpus-related tools

Spoken corpora

Parallel corpora and translation memories

Morphological tools and resources

Taggers

Parsers, grammars, treebanks

Machine-readable dictionaries

Human-readable dictionaries

Speech analysis and synthesis tools

Machine translation demonstrations

Other