Locked History Actions

Diff for "LRT"

Differences between revisions 6 and 357 (spanning 351 versions)
Revision 6 as of 2011-03-07 13:39:53
Size: 3372
Comment:
Revision 357 as of 2017-11-06 18:25:51
Size: 27407
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
== Written corpora and corpus-related tools ==
 * [[http://nkjp.pl/index.php?page=0&lang=1|National Corpus of Polish]] (under development),
 * [[http://www.korpus.pl/index.php?lang=en|IPI PAN Corpus]],
This page contains a list of ''publicly available'' language tools and resources.

/* * [[attachment:NKJP-PodkorpusMilionowy-1.0.tgz]], the manually annotated 1-million word subcorpus of the NJKP, available on GNU GPL v.3, */
/* * [[attachment:NKJP-PodkorpusMilionowy-1.1.tgz]], the manually annotated 1-million word subcorpus of the NJKP, available on GNU GPL v.3, */
/* * [[attachment:NKJP-PodkorpusMilionowy-1.0-poliqarp-bin.tgz]], the binary version of corpus to be used with standalone Poliqarp tool. */

== Written corpora of contemporary Polish ==
 * [[NationalCorpusOfPolish|National Corpus of Polish]] (NKJP),
 * [[http://nfjp.pl/|National Photocorpus of Polish]] (NFJP),
Line 7: Line 13:
 * [[http://korpus.ia.uni.lodz.pl/|PELCRA Corpus]],
 * [[http://www.mimuw.edu.pl/polszczyzna/pl196x/index_en.htm|Polish language of the XX century sixties]],
 * [[http://ifa.amu.edu.pl/~ifaconc/blog/?page_id=60|PICLE corpus]] (the Polish sub-corpus of the [[http://www.fltr.ucl.ac.be/fltr/germ/etan/cecl/Cecl-Projects/Icle/icle.htm|International Corpus of Learner English]] (ICLE),
 * [[http://poliqarp.sourceforge.net/|Poliqarp]] – a corpus indexing and search engine,
 * [[http://nlp.ipipan.waw.pl/Anotatornia/|Anotatornia]] – a system for multi-level manual annotation of corpora.

== Parallel corpora ==
 * [[http://opus.lingfil.uu.se/index.php|OPUS]] – an open source parallel corpus (European Parliament, EMEA, KDE, movie subtitles),
 * [[http://poliqarp.wbl.klf.uw.edu.pl/|Dictionaries as Corpora]],
 * [[PSC|Polish Sejm Corpus]],
 * [[http://zil.ipipan.waw.pl/PolishSummariesCorpus|Polish Summaries Corpus]],
 * [[http://ifa.amu.edu.pl/~ifaconc/blog/?page_id=60|PICLE corpus]], the Polish sub-corpus of the [[http://www.fltr.ucl.ac.be/fltr/germ/etan/cecl/Cecl-Projects/Icle/icle.htm|International Corpus of Learner English]] (ICLE),
 * [[http://dl.psnc.pl/activities/projekty/impact/results/| IMPACT ground-truth data]] for selected Polish historical documents from PIONIER Digital Libraries Federation,
  * Now available also as corpora in the Poliqarp for !DjVu [[http://poliqarp.wbl.klf.uw.edu.pl|search engine]],
 * [[http://nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/kpwr|KPWr]], Polish Corpus of Wrocław University of Technology, collection of documents available on Creative Common license annotated with syntactic chunks, proper names, semantic relations, anaphora and word senses,
 * [[http://www.pcsn.uni.wroc.pl/|Polish Corpus of Suicide Notes]],
 * [[PolishWikipediaCorpus|Polish Wikipedia Corpus]],
 * [[http://zil.ipipan.waw.pl/gpwEcono|gpwEcono]], a corpus of stock market reports, with manual word sense annotation,
 * [[http://zil.ipipan.waw.pl/plWikiEcono|plWikiEcono]], a corpus of Polish Wikipedia articles from the domain of economy,
 * [[http://zil.ipipan.waw.pl/PolishCoreferenceCorpus|Polish Coreference Corpus]], a corpus of Polish coreference relations, created as part of the [[http://core.ipipan.waw.pl/About|CORE project]],
 * [[http://argumentacja.pdg.pl/argdbpl/|ArgDB-pl]], a Polish corpus of arguments in natural contexts,
 * [[http://www.staff.amu.edu.pl/~romang/wiki_errors_pl.php|PIEWiC]], a Polish corpus of errors automatically extracted from Wikipedia revisions,
 * [[http://clip.ipipan.waw.pl/PolEval|PolEval]], corpora and other text resources created for [[http://www.poleval.pl|PolEval]] shared tasks,

== Written corpora of historical Polish ==
 * [[PL196x|Polish language of the 1960s / Frequency corpus]] (I. Kurcz, A. Lewicki, J. Sambor, J. Woronczak, K. Szafran, J. S. Bień, M. Woliński),
 * [[http://chronopress.clarin-pl.eu/|ChronoPress]], corpus of press texts from 1945–1954 (A. Pawłowski),
 * [[http://www.f19.uw.edu.pl/|Microcorpus of Polish: 1830-1918]], (M. Derwojedowa),
 * [[http://korba.edu.pl|KORBA]], electronic corpus of 17th and 18th century Polish texts (W. Gruszczyński),
 * [[http://fedora.clarin-d.uni-saarland.de/poldilemma/|PolDiLemma]], the Middle Polish Diachrone Lemmatised Corpus (R. Meyer),
 * [[http://www.spxvi.edu.pl/korpus/|Corpus of 16. century Polish]] (IBL PAN),
 * [[https://www.ijp-pan.krakow.pl/publikacje-elektroniczne/korpus-tekstow-staropolskich|Corpus of old Polish (up to 1500)]] (IJP PAN).


== Corpus-related tools and resources ==
 * [[http://poliqarp.sourceforge.net/|Poliqarp]], a corpus indexing and search engine (please see also [[http://nlp.ipipan.waw.pl/Poliqarp/|the beta version of Poliqarp 1.1]] and [[http://clip.ipipan.waw.pl/Poliqarp|1.3]] with statistical extensions and [[http://liszt.ipipan.waw.pl/|several corpora indexed with Poliqarp 2]]),
 * [[http://zil.ipipan.waw.pl/Anotatornia|Anotatornia]], a system for multi-level manual annotation of corpora,
 * [[http://nlp.pwr.wroc.pl/en/tools-and-resources/inforex|Inforex]], a web-based system designed for managing and annotating text corpora on the semantic level,
 * [[http://smyrna.danieljanus.pl/|Smyrna]], a simple, light-weight Polish concordancer,
 * [[http://korpusy.net/|korpusy.net]], a corpus research-related website (B. Gałkowski),
 * [[http://korpusomat.nlp.ipipan.waw.pl/|Korpusomat]], a tool for creation of searchable own corpora (Ł. Kobyliński, W. Kieraś).

== Spoken corpora ==
 * The PELCRA conversational corpus of Polish: approx. 2.2 million words of casual conversational spoken Polish collected and processed in the years 2001-2015 in a number of research projects, including PELCRA, NKJP, CESAR and CLARIN-PL. Available under CC-BY-NC. All transcriptions can be accessedthrough the [[http://spokes.clarin-pl.eu|Spokes web interface]] and programmatically through [[http://clarin.pelcra.pl/apidocs/spokes| a REST API]].
 * [[http://clip.ipipan.waw.pl/LUNA|The annotated corpus of spoken dialogues]] (LUNA project, corpus data available at the end of the page),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=en:resources:korpusmowy|AGH speech corpus]], around 9 hours, word-annotated Polish speech corpus (AGH DSP),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=en:resources:korpusav|Audiovideo corpus]] of Polish speech (AGH DSP),
 * [[http://nkjp.uni.lodz.pl/spoken.jsp|NKJP search engine for spoken-conversational data]],
 * [[http://catalog.elra.info/product_info.php?cPath=37_39&products_id=1164|Acoustic database for Polish unit selection speech synthesis]] (ELRA resources),
 * [[http://catalog.elra.info/product_info.php?cPath=37_39&products_id=1168|Acoustic database for Polish concatenative speech synthesis]] (ELRA resources),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:korpusemo|Corpus of emotions in speech]] (AGH DSP).

== Language models ==
 * [[http://zil.ipipan.waw.pl/NKJPNGrams|N-grams from the balanced subcorpus of the National Corpus of Polish]],
 * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:ngram|N-gram model of Polish]] (AGH DSP),
 * [[http://mozart.ipipan.waw.pl/~axw/models/|Distributional semantic models]] trained on orthographical, lemmatized word forms (A. Wawer),
 * Even more [[http://dsmodels.nlp.ipipan.waw.pl|distributional semantic models]] based on NKJP (A. Mykowiecka, M. Marciniak, P. Rychlik).


== Parallel corpora and translation memories ==
 * [[http://parasolcorpus.org/|ParaSol]], a parallel corpus of Slavic and other languages,
 * [[http://www.domeczek.pl/~polukr/index.php?option=search|PolUKR]], a Polish-Ukrainian parallel corpus,
 * [[http://opus.lingfil.uu.se/index.php|OPUS]], an open source parallel corpus (European Parliament, EMEA, KDE, movie subtitles),
 * [[http://nl.ijs.si/ME/V4|"1984"]], an annotated parallel corpus of George Orwell's "1984" in 15 languages, MULTEXT-East, v.4 (licensed download),
 * [[http://www.korpus.cz/intercorp/?req=page:info|InterCorp]], a multilingual parallel corpus,
Line 17: Line 72:
 * [[http://pelcra.pl/new/cesar|PELCRA Parallel corpora]], a collection of downloadable parallel corpora available under the CC-BY and CC-BY-NC licensed developed by the PELCRA team
 * [[http://paralela.clarin-pl.eu|Paralela Polish-English corpus]]
Line 18: Line 75:
 * [[http://psi.amu.edu.pl/en/index.php?title=Parallel_Corpora|PSI collection of parallel corpora]], a growing collection of parallel corpora pairing Polish with other european languages,
 * [[http://www.pol-ros.polon.uw.edu.pl/|Polish-Russian Parallel Corpus]],
 * [[http://mymemory.translated.net/|MyMemory]], freely available multilingual TM,
 * [[http://www.tausdata.org/|TAUS Data]], a multilingual TM from the members of TAUS Data Association,
 * [[http://glosbe.com/|Glosbe]], an open source TM.

== Machine-readable dictionaries ==
 * [[http://plwordnet.pwr.wroc.pl/wordnet|plWordNet, Polish WordNet, Słowosieć]] (M. Piasecki),
 * [[http://www.ltc.amu.edu.pl/polnet/|POLNET, another Polish Wordnet]] (Z. Vetulani),
 * [[http://synonimy.ux.pl/|Polish OpenThesaurus]], słownik synonimów – a crowdsourced Polish thesaurus (M. Miłkowski),
 * [[http://www.sjp.pl/|Słownik języka polskiego (d. alternatywny)]], Polish ispell dictionaries, along with some definitions and online form display.
 * [[Nowy_slownik_angielsko-polski|Nowy słownik angielsko-polski]] (T. Piotrowski, Z. Saloni),
 * [[http://zil.ipipan.waw.pl/OpenCYCPL|Polish OpenCYC]] (A. Pohl),
 * [[http://www.slowniki.org.pl/pol.html|Polish machine-generated dictionaries]], available on Creative Commons (J. Kazojć),
 * [[http://futrega.org/etc/nazwiska.zip|List of all Polish surnames]], licence unknown, see [[http://futrega.org/etc/nazwiska.html|further information on this resource]],
 * [[http://clip.ipipan.waw.pl/Gazetteer|Gazetteer for Polish Named Entities]] (A. Savary, M. Lenart, J. Piskorski),
 * [[http://zil.ipipan.waw.pl/PNET|Triggers for Polish Named Entities]] (M. Baron, L. Manicki, A. Savary),
 * [[http://nlp.pwr.wroc.pl/en/tools-and-resources/nelexicon|NELexicon]] contains more than 1.4 millions of proper names (M. Marcińczuk, A. Musiał, M. Janicki),
 * [[http://zil.ipipan.waw.pl/Walenty|Walenty]], the Polish Valence Dictionary (E. Hajnicz, W. Kieraś, A. Patejuk, A. Przepiórkowski, F. Skwarski, M. Świdziński, M. Woliński),
 * [[http://zil.ipipan.waw.pl/SGDPV|Syntatic-generative dictionary of Polish verbs]] (K. Polański),
 * [[http://zil.ipipan.waw.pl/SAWA|SAWA]], the Grammatical Lexicon of Warsaw Urban Proper Names (M. Marciniak, C. Heliasz, J. Rabiega-Wiśniewska, P. Sikora, M. Woliński, A. Savary),
 * [[http://zil.ipipan.waw.pl/SEJF|SEJF]], the Grammatical Lexicon of Polish Phraseology (M. Czerepowicka, A. Savary),
 * [[http://zil.ipipan.waw.pl/SEJFEK|SEJFEK]], the Grammatical Lexicon of Polish Economical Phraseology (F. Makowiecki, A. Savary),
 * [[http://zil.ipipan.waw.pl/WikiTopoPl|WikiTopoPl]], a multilingual lexicon of 155,000 Polish geographical proper names extracted from Wikipedia and their equivalents in Bulgarian, Croatian, English, German, modern Greek, Hungarian, Romanian, Serbian and Slovak (L. Manicki).
 * [[http://zil.ipipan.waw.pl/Prolexbase|Prolexbase 2.0]], a multiligual relational dictionary of proper names in Polish, English and French (M. Baron, B. Bouchou-Markhoff, L. Manicki, D. Maurel, A. Savary, M. Tran).
 * [[http://clip.ipipan.waw.pl/DeepEREntityLibrary|DeepER Entity Library]], a database containing around 900,000 entities, each described by its textual representations in Polish (names) and `WordNet` synsets.
 * [[http://publications.it.p.lodz.pl/2016/word_embeddings/|Word embeddings for Polish - Wikipedia based]], (M. Rogalski, P. Szczepaniak).

== Human-readable dictionaries ==
 * [[http://www.wsjp.pl/|Wielki Słownik Języka Polskiego]],
 * [[http://doroszewski.pwn.pl|Słownik języka polskiego PAN pod red. W. Doroszewskiego]],

 * [[http://pl.wiktionary.org|Wikisłownik]],
 * [[http://www.slownik-online.pl/index.php|Słownik wyrazów obcych i zwrotów obcojęzycznych Władysława Kopalińskiego]],
 * [[http://leksykony.interia.pl/synonim|Słownik synonimów i antonimów Piotra Żmigrodzkiego]],
 * [[http://kpbc.umk.pl/dlibra/publication?id=17781|Słownik polszczyzny XVI wieku]],
 * [[http://xvii-wiek.ijp-pan.krakow.pl/pan_klient/|Słownik Elektroniczny Słownik Języka Polskiego XVII i XVIII wieku]],
 * [[http://poliqarp.wbl.klf.uw.edu.pl/slownik-warszawski/| Poliqarp for DjVu search engine]] for J. Karłowicz, A. Kryński, W. Niedźwiedzki. Dictionary of Polish. Warsaw 1900–1927,
 * [[http://poliqarp.wbl.klf.uw.edu.pl/slownik-polszczyzny-xvi-wieku/| Poliqarp for DjVu search engine]] for S. Bąk, M. R. Mayenowa, F. Pepłowski (eds.). Dictionary of the 16th century Polish. Wrocław — Warszawa, 1966-???? (work in progress),
 * [[http://poliqarp.wbl.klf.uw.edu.pl/slownik-lindego/| Poliqarp for DjVu search engine]] for M. Samuel Bogumił Linde. Dictionary of Polish (2nd edition). Lwów 1854-1861,
 * [[http://poliqarp.wbl.klf.uw.edu.pl/slownik-geograficzny/| Poliqarp for DjVu search engine]] for B. Chlebowski, F. Sulimierski, W. Walewski (eds.), The Geographical Dictionary of the Polish Kingdom and other Slavic Countries, Warszawa 1880-1902,
 * [[http://eswil.ijp-pan.krakow.pl/|Edycja elektroniczna Słownika wileńskiego]],
 * PELCRA HASK Collocation Dictionaries generated for [[http://pelcra.pl/hask_pl|Polish]] and [[http://pelcra.pl/hask_en|English]],
 * [[http://clip.ipipan.waw.pl/UkrPolDict|Słownik ukraińsko-polski]] pod redakcją Janusza A. Riegera. Materiały do słownika: Litera „O”.
Line 20: Line 121:
 * [[http://sgjp.pl/morfeusz/|Morfeusz SGJP]] – morphological analyser (Z. Saloni, W. Gruszczyński, M. Woliński, R. Wołosz),
 * [[http://morfologik.blogspot.com/|Morfologik]] – morphological analyser (M. Miłkowski),
 * [[http://sgalus.republika.pl/indexe.html]] – lexical analyser and a Polish proof-reader (S. Galus),
 * [[http://zil.ipipan.waw.pl/PoliMorf|PoliMorf]], the ultimate inflectional dictionary of Polish (under development),
 * [[http://sgjp.pl/morfeusz/|Morfeusz SGJP]], morphological analyser (Z. Saloni, W. Gruszczyński, M. Woliński, R. Wołosz),
 * [[http://sgjp.pl/siat/|Index a tergo of Polish word forms]] (J. Tokarski, Z. Saloni),
 * [[http://morfologik.blogspot.com/|Morfologik]], morphological analyser (M. Miłkowski, D. Weiss),
 * [[http://duch.mimuw.edu.pl/~kszafran/index.php?option=com_docman&task=cat_view&gid=49&Itemid=93|SAM]], morphological analyser (K. Szafran),
 * [[http://utt.amu.edu.pl/|UAM Text Tools]] (P. Obrębski, Z. Vetulani; see also [[http://utt.wmi.amu.edu.pl/trac/wiki/]]),
 * [[http://nl.ijs.si/ME/V4/msd/html|MULTEXT-East, v.4 ]], morphosyntactic specifications and documentation for 16 languages,
 * [[http://nl.ijs.si/ME/V4/doc/index.html#sec-lex|Sample morphosyntactic Polish lexicon]], the MULTEXT-East morphosyntactic lexicons,
 * [[http://www.domeczek.pl/~polukr/mte-conv|KIPI->MTE]], a converter from TaKIPI to MULTEXT-East morphosyntactic format (A. Radziszewski, N. Kotsyba),
 * [[http://nlp.pwr.wroc.pl/redmine/projects/libpltagger/wiki|MACA]], Morphological Analysis Converter and Aggregator (A. Radziszewski, T. Śniatowski),
 * [[http://sgalus.republika.pl/indexe.html|Lexical analyser and a Polish proof-reader]] (S. Galus),
Line 24: Line 133:
 * [[http://winnie.ics.agh.edu.pl/proj_uk/fleksbaz/|Baza fleksyjna języka polskiego]], inflection database of Polish words (W. Lubaszewski, B. Moskal, P. Pietras, P. Pisarek, T. Rokicka),
Line 25: Line 135:
 * [[http://www.cs.put.poznan.pl/dweiss/xml/projects/lametyzator/index.xml?lang=en|Stemming engine for Polish]] (D. Weiss),
 * [[http://getopt.org/stempel/|Stempel]], another stemmer (A. Białecki).
 * [[http://getopt.org/stempel/|Stempel]], another stemmer (A. Białecki),
 * [[http://nlp.pwr.wroc.pl/redmine/projects/joskipi/wiki/|WCCL]], toolkit for morphosyntactic feature generation (A. Radziszewski, A. Wardyński, T. Śniatowski, P. Kędzia),
 * [[http://lemmatise.ijs.si/Services|LemmaGen]], Multilingual Open Source Lemmatisation for 11 EU languages, including Polish (M. Jursic, T. Erjavec et al.),
 * [[http://zil.ipipan.waw.pl/LemmaPL|LemmaPL]], a lemmatization tool for Polish.
Line 29: Line 141:
 * [[http://nlp.pwr.wroc.pl/takipi/|TaKIPI]] – a morphosyntactic tagger for Polish,
 * [[http://code.google.com/p/pantera-tagger/|PANTERA]] – a morphosyntactic tagger for Polish,
 * [[http://nlp.pwr.wroc.pl/takipi/|TaKIPI]], a morphosyntactic tagger for Polish (Decision Trees),
 * [[http://zil.ipipan.waw.pl/PANTERA|PANTERA]], a morphosyntactic tagger for Polish (Transformation-Based Learning),
 * [[http://nlp.pwr.wroc.pl/redmine/projects/wmbt/wiki|WMBT]], a morphosyntactic tagger for Polish (Memory-Based Learning),
 * [[http://zil.ipipan.waw.pl/TaCo|TaCo]], a statistical morphosyntactic tagset converter for positional tagsets (e.g. Polish),
 * [[http://nlp.pwr.wroc.pl/redmine/projects/wcrft/wiki|WCRFT]], a morphosyntactic tagger for Polish (Conditional Random Fields),
 * [[http://zil.ipipan.waw.pl/Concraft|Concraft]], a morphosyntactic disambiguation tool for Polish (Constrained Conditional Random Fields),
 * [[http://zil.ipipan.waw.pl/PoliTa|PoliTa]], a morphosyntactic meta-tagger,
 * [[http://zil.ipipan.waw.pl/NKJP%20model%20for%20TnT%20Tagger|NKJP model for TnT Tagger]], a trained model usable on Morfeusz-segmented text with [[http://www.coli.uni-saarland.de/~thorsten/tnt/|TnT Tagger]],
 * [[http://clarin.pelcra.pl/tools/tagger|A PoS tagger trained on the 1M NKJP corpus and Morfeusz]] with a [[http://clarin.pelcra.pl/tools/api/tagger/tag?text=Ala%20ma%20kota.%20&tagger=openNLP&tagset=standard&format=JSON&lang=pl|REST API]].
Line 33: Line 152:
 * [[http://nlp.ipipan.waw.pl/~wolinski/swigra/|Świgra]] – a DCG parser,
 * [[http://nlp.ipipan.waw.pl/Spejd/|Spejd]] – a shallow parsing and disambiguation system,
 * [[http://sourceforge.net/projects/dendrarium/|Dendrarium]] – a treebank development system (under development),
 * [[http://nlp.ipipan.waw.pl/CRIT2/|A Treebank / Test Suite for Polish]].
 * [[http://zil.ipipan.waw.pl/PolishDependencyParser|Polish Dependency Parser]] (A. Wróblewska),
 * [[http://zil.ipipan.waw.pl/Sk%C5%82adnica|Składnica]], a hybrid constituency/dependency treebank of Polish (under development),
 * [[http://zil.ipipan.waw.pl/Sk%C5%82adnicaMWE|SkładnicaMWE]], a constituency version of Składnica with multiword expression annotations (J. Waszczuk, A. Savary),
 * [[http://zil.ipipan.waw.pl/plTAG|TAG grammar of Polish]],
 * [[http://zil.ipipan.waw.pl/LFG|POLFIE, an LFG grammar of Polish]]
   * [[http://iness.mozart.ipipan.waw.pl/iness/xle-web|POLFIE as a web service]].
 * Świgra, a DCG parser,
   * [[http://nlp.ipipan.waw.pl/~wolinski/swigra/|version 1.0]] (2005),
   * [[http://zil.ipipan.waw.pl/Sk%C5%82adnica|version 1.5]] as used in Składnica (2011),
   * [[http://swigra.nlp.ipipan.waw.pl/|Świgra 2.0 demo]] (2015, please use Firefox).
 * Spejd, a shallow parsing and disambiguation system,
   * the [[http://zil.ipipan.waw.pl/Spejd|current version]] of the system,
   * Spejd [[attachment:gramatyka_Spejd_NKJP_1.0.zip|grammar of Polish]] (version 1.0), developed by K. Głowińska within [[http://nkjp.pl/|NKJP]], available on GNU GPL v.3,
   * Spejd [[http://clip.ipipan.waw.pl/SpejdLemmatizingGrammar|grammar of Polish with lemmatisation of Polish nominal syntactic groups]],
   * [[http://zil.ipipan.waw.pl/SEJFEK4Spejd|SEJFEK4Spejd]] - a Spejd grammar version of [[http://zil.ipipan.waw.pl/SEJFEK|SEJFEK]] and a converter from dictionary to grammar,
 * [[http://sourceforge.net/projects/dendrarium/|Dendrarium]], a treebank development system,
 * [[http://nlp.ipipan.waw.pl/CRIT2/|A Treebank / Test Suite for Polish]],
 * [[ftp://ftp.mimuw.edu.pl/pub/People/polszczyzna/AS/index.html|Analizator syntaktyczny AS]] (M. Woliński),
 * [[http://www.site.uottawa.ca/~szpak/oldStuff/|Formalny opis składniowy zdań polskich]] (S. Szpakowicz),
 * [[http://las.aei.polsl.pl/las2/|Serwer LAS / Linguistic Analysis Server]],
 * [[http://nlp.pwr.wroc.pl/en/tools-and-resources/disaster|Disaster]] (DISAmbiguator and STatistical chunkER) – a Python module for chunking and morphosyntactic disambiguation,
 * [[http://nlp.pwr.wroc.pl/redmine/projects/iobber/wiki|Iobber]], a CRF chunker for Polish,
 * [[http://zil.ipipan.waw.pl/Krzaki|Krzaki (bushes)]], a manually annotated for dependency structure 20k-sentence corpus of Polish.
 * [[http://zil.ipipan.waw.pl/ENIAM|ENIAM]] (W. Jaworski).

== Sentiment analysis ==
 * [[http://zil.ipipan.waw.pl/SlownikWydzwieku|Polish sentiment dictionary]], with sentiment scores computed using supervised methods (A. Wawer),
 * [[http://zil.ipipan.waw.pl/HateSpeech|HateSpeech corpus]], 2000 manually annotated documents representing various types and degrees of offensive language expressed toward minorities,
 * [[http://zil.ipipan.waw.pl/Korpus%20Szczerosci|Sincerity Corpus (Korpus Szczerości)]], a collection of fake and real reviews,
 * you can also test Sentipejd – sentiment analysis tool in the [[http://multiservice.nlp.ipipan.waw.pl/|Multiservice]] (please select a tagger first).

== Coreference ==
 * [[http://zil.ipipan.waw.pl/PolishCoreferenceCorpus|Polish Coreference Corpus]], a 500 M corpus of general nominal coreference in Polish (M. Ogrodniczuk),
 * [[http://zil.ipipan.waw.pl/PolishCoreferenceTools|Polish Coreference Tools]], a suite of Polish coreference resolution tools, created as part of the [[http://zil.ipipan.waw.pl/CORE|CORE project]].

== Speech analysis and synthesis tools ==
 * [[http://skrybot.pl/en/products/skrybot-home-speech-recognition/|Skrybot]], commercial speech recognition system (L. Pawlaczyk, P. Bosky),
 * [[http://www.ivona.com/|Ivona]], commercial text-to-speech system (Expressivo),
 * [[http://techmo.pl/index.php?option=com_content&view=article&id=54&Itemid=166&lang=pl|Techmo]] TTS demo (Techmo),
 * [[http://www.nuance.com/landing-pages/playground/Vocalizer_Demo2/vocalizer_modal.html?demo=true|Vocalizer]], commercial text-to-speech system (Nuance),
 * [[http://www.acapela-group.com/text-to-speech-interactive-demo.html|Acapela]], text to speech demo,
 * [[http://www.syntezamowy.pjwstk.edu.pl/index.html|Synteza mowy polskiej]], automatic speech recognition and speech synthesis demos, with background information (K. Szklanny),
 * [[http://www.staff.amu.edu.pl/~fonetyka/synteza/index.htm|System syntezy mowy ciągłej]] (G. Demenko, S. Grocholewski),
 * [[http://www.tcts.fpms.ac.be/synthesis/mbrola/|Polish MBROLA database]] (K. Szklanny, K. Marasek),
 * [[http://www.neurosoft.pl/?page_name=Produkty_SynTalk|SynTalk]], commercial speech synthesis system (!NeuroSoft),
 * [[http://www.primespeech.pl/|PrimeSpeech]], commercial speech recognition systems,
 * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:ortfon|OrtFon]], phonetic transcriber (AGH DSP),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:asr|Sarmata]], automatic speech recognition system for Polish (AGH DSP),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:anotator|Anotator]], speech corpora anotator dedicated for Polish and focused on connecting existing resources (AGH DSP),
 * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:spkreco|System rozpoznawania mówcy]], (AGH DSP).
Line 39: Line 204:
 * [[http://www.translatica.pl/|Translatica]] (EN-PL-EN),  * [[http://itranslate4.eu/|iTranslate4.eu]] (multiple languages, allows comparing translation engines),
 * [[http://www.microsofttranslator.com/|Bing Translator]] (multilingual),
 * [[http://translate.google.com/|Google Translate]] (multilingual),
Line 42: Line 209:
 * [[http://www.systran.co.uk/|Systran]] (EN-PL, PL-FR and some more).  * [[http://www.systran.co.uk/|Systran]] (EN-PL, PL-FR and some more),
 * [[http://www.xdobry.de/esperantoedit/index_pl.html|Esperantilo]] (integrated Esperanto editor, with MT for EO-PL-DE-EN-SV),
 * [[http://thetos.polsl.pl/|Thetos]] (PL-Sign language).

== Summarizers ==
 * [[http://www.cs.put.poznan.pl/dweiss/research/lakon/|Lakon]], a system for news summarization (A. Dudczak),
 * [[http://las.aei.polsl.pl/PolSum/#/Home|PolSum]] (S. Kulików),
 * [[http://clip.ipipan.waw.pl/Summar|Summar]] (Ł. Pawluczuk),
 * [[http://clip.ipipan.waw.pl/Summarizer|Summarizer]] (J. Świetlicka),
 * you can also test Lakon, Open Text Summarizer and Summarizer in [[http://multiservice.nlp.ipipan.waw.pl/|Multiservice]]
 * and take a look at the [[http://zil.ipipan.waw.pl/PolishSummariesCorpus|Polish Summaries Corpus]].

== Diacritization ==
 * [[http://www.gzegzolka.com/poliszynel/|Poliszynel]] (P. Sawicki),
 * [[http://www.spolszcz.pl/|spolszcz.pl]] (P. Sawicki),
 * [[http://www.polszczyzna.info/polonizator|Polonizator]] (TiP),
 * [[http://slowniki.zoni.pl/?s=ogonki|Polonizer]],
 * [[http://galaxy.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciuk/personal/man/fsa_accent.1.html|fsa_accent]] (J. Daciuk),
 * [[http://wm.ite.pl/proj/pliterki/index.html|pliterki]] (W. Muła).

== Named Entity Recognition ==
 * [[http://zil.ipipan.waw.pl/Nerf|Nerf]], a tool for named entity recognition, available on GNU GPL v.3 (J. Waszczuk),
 * [[http://nlp.pwr.wroc.pl/narzedzia-i-zasoby/narzedzia/liner2|Liner2]], named entity recognizer released on GNU GPL with models to recognize 5 and 56 categories of proper names (M. Marcińczuk and M. Janicki),
 * [[https://clarin-pl.eu/dspace/handle/11321/302|TIMEX]], a model for Liner2 to recognize and normalize temporal expressions (J. Kocoń and M. Marcińczuk).

== Aggregating services ==
 * [[http://multiservice.nlp.ipipan.waw.pl/|Multiservice]], a sample interface for running NLP Web services for Polish (see also [[http://redmine.nlp.ipipan.waw.pl/redmine/projects/multiserwis/wiki/Usage|usage]] and [[http://redmine.nlp.ipipan.waw.pl/redmine/projects/multiserwis/wiki/InOut|format]]),
 * [[http://ws.clarin-pl.eu/|Online demos of tools for processing Polish texts]] (CLARIN-PL),
 * [[http://psi-toolkit.wmi.amu.edu.pl/index.html|PSI-Toolkit]], a chain of publicly available tools for automatic processing of Polish.
Line 45: Line 240:
 * [[http://plwordnet.pwr.wroc.pl/wordnet|plWordNet, Polish WordNet]] (M. Piasecki),
 * [[http://www.mimuw.edu.pl/polszczyzna/kolokacje/index.htm|Kolokacje]], a Web crawler and collocation finder (A. Buczyński)
 * [[http://nlp.ipipan.waw.pl/WSDDE/|WSDDE]] – a system for designing and performing Word Sense Disambiguation experiments (forthcoming),
 * [[http://nlp.ipipan.waw.pl/PPJP/|etc.]]
 * [[https://play.google.com/store/apps/details?id=com.pwr.plwordnet|Mobile plWordNet]], free mobile application for plWordNet browsing (J. Kocoń),
 * [[http://www.mimuw.edu.pl/polszczyzna/kolokacje/index.htm|Kolokacje]], a Web crawler and collocation finder (A. Buczyński),
 * [[http://zil.ipipan.waw.pl/WSDDE|WSDDE]], a system for designing and performing Word Sense Disambiguation experiments (R. Młodzki ''et al.''),
 * [[http://frazeo.pl/|Frazeo]], a search engine and clusterer of news in Polish (P. Pęzik),
 * [[http://segment.sourceforge.net/|Segment]], a rule-based sentence tokenizer supporting SRX standard (J. Lipski; the Polish rules are available in [[http://sourceforge.net/p/languagetool/code/HEAD/tree/trunk/languagetool/languagetool-core/src/main/resources/org/languagetool/resource/segment.srx?format=raw|LanguageTool project]], see [[http://zil.ipipan.waw.pl/Segment|here]] for short instructions on how to use the tool),
 * [[http://nlp.pwr.wroc.pl/redmine/projects/toki/wiki|Toki]], a tokenizer supporting SRX standard, C++ library and toolkit (T. Śniatowski and A. Radziszewski),
 * [[http://poleng.pl/translatica-pl.srx|Translatica SRX sentence segmentation rules for Polish (LGPL)]],
 * [[http://psi.amu.edu.pl/en/index.php?title=SyMGIZA%2B%2B|SyMGIZA++]], an extension of Giza++ that computes symmetric word alignment models,
 * [[http://hipisek.pl|Hipisek]], an experimental question answering system (M. Walas),
 * [[https://bitbucket.org/jsbien/ndt|Narzędzia dygitalizacji tekstów]], Poliqarp for !DjVu i inne programy,
 * [[http://www.nlp.pwr.wroc.pl/en/tools-and-resources/fextor|Fextor]], a feature extraction framework,
 * [[http://www.nlp.pwr.wroc.pl/en/tools-and-resources/lexcsd|LexCSD]], a system for semi-automatic sense disambiguation,
 * [[http://www.nlp.pwr.wroc.pl/en/tools-and-resources/supermatrix|SuperMatrix]], a general tool for lexical semantic knowledge acquisition,
 * [[http://nlp.pwr.wroc.pl/en/tools-and-resources/wordnetloom|WordnetLoom]], an wordnet editor application,
 * [[http://zil.ipipan.waw.pl/Toposlaw|Toposław]], tool for the creation of electronic inflectional dictionaries of multi-word units,
 * [[http://zil.ipipan.waw.pl/CorpCor|CorpCor]], a web-based tool for correcting morphosyntactic annotation in TEI XML encoded corpora (e.g. NKJP).
 * [[http://ws.clarin-pl.eu/demo/stylo2.html|Stylo 2]], stylometry demo,
 * [[http://zil.ipipan.waw.pl/TermoPL|TermoPL]], multiword expression extraction tool.
 * [[http://clip.ipipan.waw.pl/DeepEvents|DeepEvents]], event extraction in Polish, based on deep neural networks.
 * [[http://dsmodels.nlp.ipipan.waw.pl/sim1.html|Word similarity]], calculation of the similarity of words based on word embeddings, on-line service.

Language Tools and Resources for Polish

This page contains a list of publicly available language tools and resources.

Written corpora of contemporary Polish

Written corpora of historical Polish

Spoken corpora

Language models

Parallel corpora and translation memories

Machine-readable dictionaries

Human-readable dictionaries

Morphological tools and resources

Taggers

Parsers, grammars, treebanks

Sentiment analysis

Coreference

Speech analysis and synthesis tools

Machine translation demonstrations

Summarizers

Diacritization

Named Entity Recognition

  • Nerf, a tool for named entity recognition, available on GNU GPL v.3 (J. Waszczuk),

  • Liner2, named entity recognizer released on GNU GPL with models to recognize 5 and 56 categories of proper names (M. Marcińczuk and M. Janicki),

  • TIMEX, a model for Liner2 to recognize and normalize temporal expressions (J. Kocoń and M. Marcińczuk).

Aggregating services

Other

  • Mobile plWordNet, free mobile application for plWordNet browsing (J. Kocoń),

  • Kolokacje, a Web crawler and collocation finder (A. Buczyński),

  • WSDDE, a system for designing and performing Word Sense Disambiguation experiments (R. Młodzki et al.),

  • Frazeo, a search engine and clusterer of news in Polish (P. Pęzik),

  • Segment, a rule-based sentence tokenizer supporting SRX standard (J. Lipski; the Polish rules are available in LanguageTool project, see here for short instructions on how to use the tool),

  • Toki, a tokenizer supporting SRX standard, C++ library and toolkit (T. Śniatowski and A. Radziszewski),

  • Translatica SRX sentence segmentation rules for Polish (LGPL),

  • SyMGIZA++, an extension of Giza++ that computes symmetric word alignment models,

  • Hipisek, an experimental question answering system (M. Walas),

  • Narzędzia dygitalizacji tekstów, Poliqarp for DjVu i inne programy,

  • Fextor, a feature extraction framework,

  • LexCSD, a system for semi-automatic sense disambiguation,

  • SuperMatrix, a general tool for lexical semantic knowledge acquisition,

  • WordnetLoom, an wordnet editor application,

  • Toposław, tool for the creation of electronic inflectional dictionaries of multi-word units,

  • CorpCor, a web-based tool for correcting morphosyntactic annotation in TEI XML encoded corpora (e.g. NKJP).

  • Stylo 2, stylometry demo,

  • TermoPL, multiword expression extraction tool.

  • DeepEvents, event extraction in Polish, based on deep neural networks.

  • Word similarity, calculation of the similarity of words based on word embeddings, on-line service.