Locked History Actions

Diff for "LRT"

Differences between revisions 248 and 429 (spanning 181 versions)
Revision 248 as of 2014-04-29 09:48:41
Size: 20540
Comment:
Revision 429 as of 2021-07-28 10:34:26
Size: 31735
Comment:
Deletions are marked like this. Additions are marked like this.
Line 5: Line 5:
/* * [[attachment:NKJP-PodkorpusMilionowy-1.0.tgz]], the manually annotated 1-million word subcorpus of the NJKP, available on GNU GPL v.3, */ /* * [[attachment:NKJP-PodkorpusMilionowy-1.0.tgz]], the manually annotated 1-million worsd subcorpus of the NJKP, available on GNU GPL v.3, */
Line 9: Line 9:
== Written corpora and corpus-related tools == == Written corpora of contemporary Polish ==
Line 11: Line 11:
 * [[http://www.korpus.pl/index.php?lang=en|IPI PAN Corpus]],
Line 13: Line 12:
 * [[http://korpus.ia.uni.lodz.pl/|PELCRA Corpus]],  * [[http://nfjp.pl/|National Photocorpus of Polish]] (NFJP),
Line 15: Line 14:
 * [[PL196x|Polish language of the 1960s]],  * [[PPC|Polish Parliamentary Corpus]],
 * [[http://zil.ipipan.waw.pl/PolishSummariesCorpus|Polish Summaries Corpus]],
Line 18: Line 18:
  * Now available also as corpora in  the Poliqarp for !DjVu [[http://poliqarp.wbl.klf.uw.edu.pl|search engine]],
 * [[http://poliqarp.sourceforge.net/|Poliqarp]], a corpus indexing and search engine,
 * [[http://
zil.ipipan.waw.pl/Anotatornia|Anotatornia]], a system for multi-level manual annotation of corpora,
 * [[http://nlp.pwr.wroc.pl/en/tools
-and-resources/inforex|Inforex]], a web-based system designed for managing and annotating text corpora on the semantic level,
 * [[http://smyrna.danieljanus.pl/|Smyrna]], a simple, light-weight Polish concordancer,
 * [[http://nlp.pwr.wroc.pl
/kpwr|KPWr]], Polish Corpus of Wrocław University of Technology, collection of documents available on Creative Common license annotated with syntactic chunks, proper names, semantic relations, anaphora and word senses,
  * Now available also as corpora in the Poliqarp for !DjVu [[http://poliqarp.wbl.klf.uw.edu.pl|search engine]],
 * [[http://nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/kpwr|KPWr]], Polish Corpus of Wrocław University of Technology, collection of documents available on Creative Common license annotated with syntactic chunks, proper names, semantic relations, anaphora and word senses,
Line 28: Line 24:
 * [[http://zil.ipipan.waw.pl/PolishCoreferenceCorpus|Polish Coreference Corpus]], a corpus of Polish coreference relations, created as part of the [[http://zil.ipipan.waw.pl/CORE|CORE project]],  * [[http://zil.ipipan.waw.pl/PolishCoreferenceCorpus|Polish Coreference Corpus]], a corpus of Polish coreference relations, created as part of the [[http://core.ipipan.waw.pl/About|CORE project]],
Line 30: Line 26:
 * [[http://www.staff.amu.edu.pl/~romang/wiki_errors_pl.php|PIEWiC]], a Polish corpus of errors automatically extracted from Wikipedia revisions.  * [[http://www.staff.amu.edu.pl/~romang/wiki_errors_pl.php|PIEWiC]], a Polish corpus of errors automatically extracted from Wikipedia revisions,
 * [[http://clip.ipipan.waw.pl/PolEval|PolEval]], corpora and other text resources created for [[http://www.poleval.pl|PolEval]] shared tasks,
 * [[http://clip.ipipan.waw.pl/PARSEME-PL|Polish PARSEME corpus]], annotated manually for verbal multiword expressions in 20 languages including Polish, used in the [[http://multiword.sourceforge.net/sharedtask2018/|PARSEME shared task 1.1]]; the Polish subcorpus is aligned with automatic dependency annotations in the [[http://universaldependencies.org/guidelines.html|UD]] format (A. Savary),
 * [[http://clip.ipipan.waw.pl/MweLitRead|MweLitRead]] - a corpus of literal readings of Polish verbal MWEs stemming from the [[http://clip.ipipan.waw.pl/PARSEME-PL|Polish PARSEME corpus]] (A. Savary, S. Cordeiro),
 * [[http://pelcra.pl/plec/downloads|PELCRA Learner English Corpus]] (PLEC),
 * [[https://www.sketchengine.eu/user-guide/user-manual/corpora/by-language/polish-text-corpora/|Polish text corpora]] included in Sketch Engine,
 * [[http://synamet.polon.uw.edu.pl/|Microcorpus of Synesthetic Metaphors]].

== Written corpora of historical Polish ==
 * [[http://scriptores.pl/efontes/|eFontes Mediae et Infimae Latinitatis Polonorum]] (1000–1550, IJP PAN)
 * [[https://www.ijp-pan.krakow.pl/publikacje-elektroniczne/korpus-tekstow-staropolskich|Corpus of old Polish (up to 1500)]] (IJP PAN)
 * [[http://stnt.ijp.pan.pl/|15. century New Testament translations]] (IJP PAN)
 * [[https://szukajwslownikach.uw.edu.pl/IMPACT_GT_1/|IMPACT project corpus]] (1570–1756, KLF UW)
 * [[http://spxvi.edu.pl/korpus/|Corpus of 16. century Polish]] (IBL PAN)
 * [[http://fedora.clarin-d.uni-saarland.de/poldilemma/|PolDiLemma]], the Middle Polish Diachrone Lemmatised Corpus (16–18th c., R. Meyer)
 * [[http://rhssl1.uni-regensburg.de/SlavKo/korpus/poldi|PolDi]], a Polish Diachronic Online Corpus (R. Meyer)
 * [[http://korba.edu.pl|KORBA]], electronic corpus of 17th and 18th century Polish texts (1601–1772, IJP PAN)
 * [[http://www.f19.uw.edu.pl/|Corpus of the 19. century Polish]] (1830–1918, IJP UW)
 * [[http://korpus19.nlp.ipipan.waw.pl/|Manually annotated and transcribed corpus of the 19th century Polish]], (1830–1918, IPI PAN)
 * [[http://chronopress.clarin-pl.eu/|ChronoPress]], corpus of press texts from 1945–1954 (A. Pawłowski),
 * [[PL196x|Polish language of the 1960s / Frequency corpus]] (I. Kurcz, A. Lewicki, J. Sambor, J. Woronczak, K. Szafran, J. S. Bień, M. Woliński).

== Corpus-related tools and resources ==
 * [[http://poliqarp.sourceforge.net/|Poliqarp]], a corpus indexing and search engine (please see also [[http://nlp.ipipan.waw.pl/Poliqarp/|the beta version of Poliqarp 1.1]] and [[http://clip.ipipan.waw.pl/Poliqarp|1.3]] with statistical extensions and [[http://liszt.ipipan.waw.pl/|several corpora indexed with Poliqarp 2]]),
 * [[http://zil.ipipan.waw.pl/Anotatornia|Anotatornia]], a system for multi-level manual annotation of corpora,
 * [[http://zil.ipipan.waw.pl/Anotatornia2|Anotatornia2]], new version of Anotatornia geared towards annotation of historical corpora,
 * [[http://nlp.pwr.wroc.pl/en/tools-and-resources/inforex|Inforex]], a web-based system designed for managing and annotating text corpora on the semantic level,
 * [[http://smyrna.danieljanus.pl/|Smyrna]], a simple, light-weight Polish concordancer,
 * [[http://korpusy.net/|korpusy.net]], a corpus research-related website (B. Gałkowski),
 * [[http://zil.ipipan.waw.pl/Korpusomat|Korpusomat]], a tool for creation of searchable own corpora (Ł. Kobyliński, W. Kieraś).
Line 33: Line 58:
 * The PELCRA conversational corpus of Polish: approx. 2.2 million words of casual conversational spoken Polish collected and processed in the years 2001-2015 in a number of research projects, including PELCRA, NKJP, CESAR and CLARIN-PL. Available under CC-BY-NC. All transcriptions can be accessed through the [[http://spokes.clarin-pl.eu|Spokes web interface]] and programmatically through [[http://clarin.pelcra.pl/apidocs/spokes| a REST API]].
Line 37: Line 62:
 * [[http://pelcra.pl/corpora/spoken|The PELCRA conversational corpus of Polish]]. TEI P5-encoded transcriptions of 1.8 million words of conversational spoken Polish collected in the years 2001-2011 (within the PELCRA and NKJP projects) available under CC-BY-NC,
Line 43: Line 67:
== Language models ==
 * [[http://zil.ipipan.waw.pl/NKJPNGrams|N-grams from the balanced subcorpus of the National Corpus of Polish]],
 * [[http://mozart.ipipan.waw.pl/~axw/models/|Distributional semantic models]] trained on orthographical, lemmatized word forms (A. Wawer),
 * Even more [[http://dsmodels.nlp.ipipan.waw.pl|distributional semantic models]] based on NKJP (A. Mykowiecka, M. Marciniak, P. Rychlik),
 * [[http://publications.it.p.lodz.pl/2016/word_embeddings/|Wikipedia-based word embeddings for Polish]] (M. Rogalski, P. Szczepaniak),
 * [[https://wikipedia2vec.github.io/wikipedia2vec/pretrained/|Wikipedia2Vec – pretrained embeddings for Polish]] (I. Yamada, A. Asai, H. Shindo, H. Takeda., Y Takefuji),
 * [[https://github.com/deepmipt/Slavic-BERT-NER|Slavic BERT NER]] (see also [[http://docs.deeppavlov.ai/en/master/features/pretrained_vectors.html#bert|Deep Pavlov website]]),
 * [[https://github.com/sdadas/polish-nlp-resources|RoBERTa]] and links many other useful resources (S. Dadas),
 * [[https://github.com/kldarek/polbert|Polbert]] (D. Kłeczek),


Line 44: Line 80:
 * [[http://parasol.unibe.ch|ParaSol]], a parallel corpus of Slavic and other languages,  * [[http://parasolcorpus.org/|ParaSol]], a parallel corpus of Slavic and other languages,
Line 51: Line 87:
 * [[http://pelcra.pl/new/cesar|PELCRA Parallel corpora]], a collection of downloadable parallel corpora available under the CC-BY and CC-BY-NC licensed developed by the PELCRA team within the CESAR project,  * [[http://pelcra.pl/new/cesar|PELCRA Parallel corpora]], a collection of downloadable parallel corpora available under the CC-BY and CC-BY-NC licensed developed by the PELCRA team
 * [[http://paralela.clarin-pl.eu|Paralela Polish-
English corpus]]
Line 56: Line 93:
 * [[http://www.tausdata.org/|TAUS Data]], a multilingual TM from the members of TAUS Data Association.  * [[http://www.tausdata.org/|TAUS Data]], a multilingual TM from the members of TAUS Data Association,
 * [[http://glosbe.com/|Glosbe]], an open source TM.
 * [[https://github.com/poethan/AlphaMWE|AlphaMWE]] Parallel English-Chinese, English-Polish and English-German parallel corpus annotated with multiword expressions
Line 59: Line 98:
 * [[http://plwordnet.pwr.wroc.pl/wordnet|plWordNet, Polish WordNet]] (M. Piasecki),  * [[http://plwordnet.pwr.wroc.pl/wordnet|plWordNet, Polish WordNet, Słowosieć]] (M. Piasecki),
 * [[http://www.ltc.amu.edu.pl/polnet/|POLNET, another Polish Wordnet]] (Z. Vetulani),
Line 61: Line 101:
 * [[http://www.sjp.pl/|Słownik języka polskiego (d. alternatywny)]], Polish ispell dictionaries, along with some definitions and online form display.
 * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:ngram|N-gram model of Polish]] (AGH DSP)
,
 * [[http://www.sjp.pl/|Słownik języka polskiego (d. alternatywny)]], Polish ispell dictionaries, along with some definitions and online form display,
Line 69: Line 108:
 * [[http://nlp.pwr.wroc.pl/en/tools-and-resources/nelexicon|NELexicon]] contains more than 1.4 millions of proper names (M. Marcińczuk, A. Musiał, M. Janicki),  * [[http://www.nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/nelexicon|NELexicon]] contains more than 1.4 millions of proper names (M. Marcińczuk, A. Musiał, M. Janicki),
Line 75: Line 114:
 * [[http://zil.ipipan.waw.pl/WikiTopoPl|WikiTopoPl]], a multilingual lexicon of 155,000 Polish geographical proper names extracted from Wikipedia and their equivalents in Bulgarian, Croatian, English, German, modern Greek, Hungarian, Romanian, Serbian and Slovak (L. Manicki).
 * [[http://zil.ipipan.waw.pl/Prolexbase|Prolexbase 2.0]], a multiligual relational dictionary of proper names in Polish, English and French (M. Baron, B. Bouchou-Markhoff, L. Manicki, D. Maurel, A. Savary, M. Tran).
 * [[http://zil.ipipan.waw.pl/WikiTopoPl|WikiTopoPl]], a multilingual lexicon of 155,000 Polish geographical proper names extracted from Wikipedia and their equivalents in Bulgarian, Croatian, English, German, modern Greek, Hungarian, Romanian, Serbian and Slovak (L. Manicki),
 * [[http://zil.ipipan.waw.pl/Prolexbase|Prolexbase 2.0]], a multiligual relational dictionary of proper names in Polish, English and French (M. Baron, B. Bouchou-Markhoff, L. Manicki, D. Maurel, A. Savary, M. Tran),
 * [[http://clip.ipipan.waw.pl/DeepEREntityLibrary|DeepER Entity Library]], a database containing around 900,000 entities, each described by its textual representations in Polish (names) and `WordNet` synsets.
Line 79: Line 119:
 * [[http://sgjp.pl|Słownik gramatyczny języka polskiego]],
Line 81: Line 122:
Line 84: Line 126:
 * [[http://kpbc.umk.pl/dlibra/publication?id=17781|Słownik polszczyzny XVI wieku]],
 * [[https://sxvii.pl/|Elektroniczny słownik języka polskiego XVII i XVIII wieku]],
Line 89: Line 133:
 * PELCRA HASK Collocation Dictionaries generated for [[http://pelcra.pl/hask_pl|Polish]] and [[http://pelcra.pl/hask_en|English]].  * PELCRA HASK Collocation Dictionaries generated for [[http://pelcra.pl/hask_pl|Polish]] and [[http://pelcra.pl/hask_en|English]],
 * [[http://clip.ipipan.waw.pl/UkrPolDict|Słownik ukraińsko-polski]] pod redakcją Janusza A. Riegera. Materiały do słownika: Litera „O”.
Line 92: Line 137:
 * [[http://zil.ipipan.waw.pl/PoliMorf|PoliMorf]], the ultimate inflectional dictionary of Polish (under development),
 * [[http://sgjp.pl/morfeusz/|Morfeusz SGJP]], morphological analyser (Z. Saloni, W. Gruszczyński, M. Woliński, R. Wołosz),
 * [[http://sgjp.pl|SGJP]], Grammatical Dictionary of Polish (the list of inflected forms is available with [[http://morfeusz.sgjp.pl/download/|Morfeusz]]),
 * [[http://zil.ipipan.waw.pl/PoliMorf|PoliMorf]], an inflectional dictionary of Polish,
 * [[http://morfeusz.sgjp.pl/|Morfeusz SGJP]], morphological analyser,
Line 96: Line 142:
 * [[ftp://ftp.mimuw.edu.pl/pub/users/polszczyzna/SAM-95/|SAM]], morphological analyser (K. Szafran),  * [[http://duch.mimuw.edu.pl/~kszafran/index.php?option=com_docman&task=cat_view&gid=49&Itemid=93|SAM]], morphological analyser (K. Szafran),
Line 108: Line 154:
 * [[http://lemmatise.ijs.si/Services|LemmaGen]], Multilingual Open Source Lemmatisation for 11 EU languages, including Polish (M. Jursic, T. Erjavec et al.)  * [[http://lemmatise.ijs.si/Services|LemmaGen]], Multilingual Open Source Lemmatisation for 11 EU languages, including Polish (M. Jursic, T. Erjavec et al.),
 * [[http://zil.ipipan.waw.pl/LemmaPL|LemmaPL]], a lemmatization tool for Polish.
Line 117: Line 164:
 * [[http://zil.ipipan.waw.pl/NKJP%20model%20for%20TnT%20Tagger|NKJP model for TnT Tagger]], a trained model usable on Morfeusz-segmented text with [[http://www.coli.uni-saarland.de/~thorsten/tnt/|TnT Tagger]].
 * [[http://clarin.pelcra.pl/tools/tagger|OpenNLP-based PoS tagger trained on the 1M NKJP corpus]] with a [[http://clarin.pelcra.pl/tools/api/hask/application.wadl|REST API]]
 * [[http://zil.ipipan.waw.pl/PoliTa|PoliTa]], a morphosyntactic meta-tagger,
 * [[https://github.com/kwrobel-nlp/krnnt|KRNNT]], a morphological tagger for Polish based on recurrent neural networks,
* [[http://zil.ipipan.waw.pl/NKJP%20model%20for%20TnT%20Tagger|NKJP model for TnT Tagger]], a trained model usable on Morfeusz-segmented text with [[http://www.coli.uni-saarland.de/~thorsten/tnt/|TnT Tagger]],
 * [[http://clarin.pelcra.pl/tools/tagger|A PoS tagger trained on the 1M NKJP corpus and using Morfeusz]] [[http://ltc.amu.edu.pl/book/papers/PolEval1-3.pdf|(Pęzik & Laskowski 2017)]] with a [[http://clarin.pelcra.pl/apt_pl/?sentences=%5B%22Ala%20lubi%20kota.%22%2C%22Jurek%20ma%20worek.%22%5D|REST API]].
Line 121: Line 170:
 * [[http://zil.ipipan.waw.pl/Sk%C5%82adnica|Składnica]], a hybrid constituency/dependency treebank of Polish (under development),  * [[http://zil.ipipan.waw.pl/PDB|PDB 2.0]], a dependency treebank of Polish (A. Wróblewska),
 * [[http://git.nlp.ipipan.waw.pl/alina/PDBUD|PDB-UD]], a version of PDB 2.0 in Universal Dependencies format (A. Wróblewska),
 * [[http://zil.ipipan.waw.pl/PDB/PDBparser|PDBparser]], a Polish dependency parser (A. Wróblewska),
 * Składnica, a hybrid constituency/dependency treebank of Polish,
   * [[http://zil.ipipan.waw.pl/Sk%C5%82adnica|Składnica main page]],
   * [[http://zil.ipipan.waw.pl/Sk%C5%82adnicaMWE|SkładnicaMWE]], a constituency version of Składnica with multiword expression annotations (J. Waszczuk, A. Savary),
   * [[http://treebank.nlp.ipipan.waw.pl/|Składnica search engine]] (M. Woliński),
Line 123: Line 178:
 * Świgra, a DCG parser,
   * [[http://nlp.ipipan.waw.pl/~wolinski/swigra/|version 1.0]] (2005),
   * [[http://zil.ipipan.waw.pl/Sk%C5%82adnica|version 1.5]] as used in Składnica (2011),
 * [[http://zil.ipipan.waw.pl/LFG|POLFIE, an LFG grammar of Polish]]
   * [[http://iness.mozart.ipipan.waw.pl/iness/xle-web|POLFIE as a web service]].
 * [[http://zil.ipipan.waw.pl/%C5%9Awigra|Świgra]], a DCG parser,
   * [[http://swigra.nlp.ipipan.waw.pl/|On-line demo]],
Line 129: Line 185:
   * Spejd [[http://clip.ipipan.waw.pl/SpejdLemmatizingGrammar|grammar of Polish with lemmatisation of Polish nominal syntactic groups]],
Line 132: Line 189:
 * [[http://amos.klf.uw.edu.pl/| Visualisation of parsing tree forests]] (Świdziński's grammar, Świgra, Morfeusz, Bień's syntactic spreadsheets) by Andrzej Zaborowski,
Line 134: Line 190:
 * [[ftp://ftp.mimuw.edu.pl/pub/People/polszczyzna/Szpakowicz/|Formalny opis składniowy zdań polskich]] (S. Szpakowicz),  * [[http://www.site.uottawa.ca/~szpak/oldStuff/|Formalny opis składniowy zdań polskich]] (S. Szpakowicz),
Line 139: Line 195:
 * [[http://zil.ipipan.waw.pl/ENIAM|ENIAM]] (W. Jaworski).

== Semantic resources ==
 * [[http://zil.ipipan.waw.pl/Scwad/CDSCorpus|CDSCorpus]], a dataset of 10k pairs of Polish sentences manually annotated for semantic relatedness and entailment (A. Wróblewska)
 * [[http://git.nlp.ipipan.waw.pl/Scwad/SCWAD-probing-data|Probing datasets]], Polish and English probing datasets for linguistic verification of sentence embeddings (A. Wróblewska)

== Sentiment analysis, opinion mining ==
 * [[http://zil.ipipan.waw.pl/SlownikWydzwieku|Polish sentiment dictionary]], with sentiment scores computed using supervised methods (A. Wawer),
 * [[http://zil.ipipan.waw.pl/LCM-PL|Polish Linguistic Category Model]] following a typology of verb categorization in terms of their abstractness, also a tool to measure language abstraction (A. Wawer),
 * [[http://zil.ipipan.waw.pl/TreebankWydzwieku|Polish dependency treebank with sentiment annotations]] (A. Wawer),
 * [[http://zil.ipipan.waw.pl/HateSpeech|HateSpeech corpus]], 2000 manually annotated documents representing various types and degrees of offensive language expressed toward minorities,
 * [[http://zil.ipipan.waw.pl/Korpus%20Szczerosci|Sincerity Corpus (Korpus Szczerości)]], a collection of fake and real reviews,
 * you can also test Sentipejd – sentiment analysis tool in the [[http://multiservice.nlp.ipipan.waw.pl/|Multiservice]] (please select a tagger first),
 * [[https://exp.lobi.nencki.gov.pl/nawl-analysis|Nencki Affective Word List]].

== Coreference ==
 * [[http://zil.ipipan.waw.pl/PolishCoreferenceCorpus|Polish Coreference Corpus]], a 500 M corpus of general nominal coreference in Polish (M. Ogrodniczuk),
 * [[http://zil.ipipan.waw.pl/PolishCoreferenceTools|Polish Coreference Tools]], a suite of Polish coreference resolution tools, created as part of the [[http://zil.ipipan.waw.pl/CORE|CORE project]].
Line 143: Line 217:
 * [[http://techmo.pl/index.php?option=com_content&view=article&id=54&Itemid=166&lang=pl|Techmo]] TTS demo (Techmo),
 * [[http://www.nuance.com/landing-pages/playground/Vocalizer_Demo2/vocalizer_modal.html?demo=true|Vocalizer]], commercial text-to-speech system (Nuance),
Line 152: Line 228:
 * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:spkreco|System rozpoznawania mówcy]], (AGH DSP).  * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:spkreco|System rozpoznawania mówcy]] (AGH DSP).
Line 165: Line 241:
 * [[http://las.aei.polsl.pl/PolSum/#/Home|PolSum]] by S. Kulików,
 * [[http://www.cs.put.poznan.pl/dweiss/research/lakon/|Lakon]], a system for news summarization (master's thesis by A. Dudczak).
 * [[http://www.cs.put.poznan.pl/dweiss/research/lakon/|Lakon]], a system for news summarization (A. Dudczak),
 * [[http://las.aei.polsl.pl/PolSum/#/Home|PolSum]] (S. Kulików),
 * [[http://clip.ipipan.waw.pl/Summar|Summar]] (Ł. Pawluczuk),
 * [[http://clip.ipipan.waw.pl/Summarizer|Summarizer]] (J. Świetlicka),
 * you can also test Lakon, Open Text Summarizer and Summarizer in [[http://multiservice.nlp.ipipan.waw.pl/|Multiservice]]
 * and take a look at the [[http://zil.ipipan.waw.pl/PolishSummariesCorpus|Polish Summaries Corpus]].

== Diacritization ==
 * [[http://www.gzegzolka.com/poliszynel/|Poliszynel]] (P. Sawicki),
 * [[http://www.spolszcz.pl/|spolszcz.pl]] (P. Sawicki),
 * [[http://www.polszczyzna.info/polonizator|Polonizator]] (TiP),
 * [[http://slowniki.zoni.pl/?s=ogonki|Polonizer]],
 * [[http://galaxy.eti.pg.gda.pl/katedry/kiw/pracownicy/Jan.Daciuk/personal/man/fsa_accent.1.html|fsa_accent]] (J. Daciuk),
 * [[http://wm.ite.pl/proj/pliterki/index.html|pliterki]] (W. Muła).

== Named entity recognition ==
 * [[http://zil.ipipan.waw.pl/Nerf|Nerf]], a tool for named entity recognition, available on GNU GPL v.3 (J. Waszczuk),
 * [[http://nlp.pwr.wroc.pl/narzedzia-i-zasoby/narzedzia/liner2|Liner2]], named entity recognizer released on GNU GPL with models to recognize 5 and 56 categories of proper names (M. Marcińczuk and M. Janicki),
 * [[https://clarin-pl.eu/dspace/handle/11321/302|TIMEX]], a model for Liner2 to recognize and normalize temporal expressions (J. Kocoń and M. Marcińczuk).

== Multiword expression software ==
 * [[http://zil.ipipan.waw.pl/TermoPL|TermoPL]], multiword expression extraction tool,
 * [[http://multiword.sourceforge.net/sharedtaskresults2018/|VMWE identifiers]], systems having participated in the PARSEME shared task for automatic identification of verbal MWEs, 13 out of 17 systems submitted results for Polish,
 * [[https://mwedemonstrator.atilf.fr/mwetools/accueil/|PARSEME-FR demonstrator]], including the ATILF-LLF multiword expression identifier for Polish,

== Aggregating services ==
 * [[http://multiservice.nlp.ipipan.waw.pl/|Multiservice]], a sample interface for running NLP Web services for Polish (see also [[http://redmine.nlp.ipipan.waw.pl/redmine/projects/multiserwis/wiki/Usage|usage]] and [[http://redmine.nlp.ipipan.waw.pl/redmine/projects/multiserwis/wiki/InOut|format]]),
 * [[http://ws.clarin-pl.eu/|Online demos of tools for processing Polish texts]] (CLARIN-PL),
 * [[http://psi-toolkit.wmi.amu.edu.pl/index.html|PSI-Toolkit]], a chain of publicly available tools for automatic processing of Polish.
Line 169: Line 272:
 * [[http://www.mimuw.edu.pl/polszczyzna/kolokacje/index.htm|Kolokacje]], a Web crawler and collocation finder (A. Buczyński),  * [[https://play.google.com/store/apps/details?id=com.pwr.plwordnet|Mobile plWordNet]], free mobile application for plWordNet browsing (J. Kocoń), /* * [[http://www.mimuw.edu.pl/polszczyzna/kolokacje/index.htm|Kolokacje]], a Web crawler and collocation finder (A. Buczyński),*/
Line 173: Line 276:
 * [[http://nlp.pwr.wroc.pl/redmine/projects/toki/wiki|Toki]], a tokenizer supporting SRX standard, C++ library and toolkit (T. Śniatowski and A. Radziszewski)
 * [[http://poleng.pl/translatica-pl.srx|Translatica SRX sentence segmentation rules for Polish (LGPL)]] 
 * [[http://nlp.pwr.wroc.pl/redmine/projects/toki/wiki|Toki]], a tokenizer supporting SRX standard, C++ library and toolkit (T. Śniatowski and A. Radziszewski),
 * [[http://poleng.pl/translatica-pl.srx|Translatica SRX sentence segmentation rules for Polish (LGPL)]],
Line 176: Line 279:
 * [[http://glass.ipipan.waw.pl/multiservice/|Multiservice]], a sample interface for running NLP Web services for Polish,
Line 179: Line 281:
 * [[http://zil.ipipan.waw.pl/Nerf|Nerf]], a tool for named entity recognition, available on GNU GPL v.3.
 * [[http://nlp.pwr.wroc.pl/en/tools-and-resources/liner2|Liner2]], named entity recognizer released on GNU GPL with models to recognize 5 and 56 categories of proper names (M. Marcińczuk and M. Janicki).
 * [[http://psi-toolkit.wmi.amu.edu.pl/index.html|PSI-Toolkit]], a chain of publicly available tools for automatic processing of Polish.
* [[http://www.nlp.pwr.wroc.pl/en/tools-and-resources/fextor|Fextor]], a feature extraction framework.
 * [[http://www.nlp.pwr.wroc.pl/en/tools-and-resources/lexcsd|LexCSD]], a system for semi-automatic sense disambiguation.
 * [[http://www.nlp.pwr.wroc.pl/en/tools-and-resources/supermatrix|SuperMatrix]], a general tool for lexical semantic knowledge acquisition
 * [[http://nlp.pwr.wroc.pl/en/tools-and-resources/wordnetloom|WordnetLoom]], an wordnet editor application.
 * [[http://zil.ipipan.waw.pl/Toposlaw|Toposław]], tool for the creation of electronic inflectional dictionaries of multi-word units.
 * [[http://zil.ipipan.waw.pl/CorpCor|CorpCor]], a web-based tool for correcting morphosyntactic annotation in TEI XML encoded corpora (e.g. NCP).
 * [[http://zil.ipipan.waw.pl/PolishCoreferenceTools|Polish Coreference Tools]], a suite of Polish coreference resolution tools, created as part of the [[http://zil.ipipan.waw.pl/CORE|CORE project]].
 * [[http://www.nlp.pwr.wroc.pl/en/tools-and-resources/fextor|Fextor]], a feature extraction framework,
 * [[http://www.nlp.pwr.wroc.pl/en/tools-and-resources/lexcsd|LexCSD]], a system for semi-automatic sense disambiguation,
 * [[http://www.nlp.pwr.wroc.pl/en/tools-and-resources/supermatrix|SuperMatrix]], a general tool for lexical semantic knowledge acquisition,
 * [[http://nlp.pwr.wroc.pl/en/tools-and-resources/wordnetloom|WordnetLoom]], an wordnet editor application,
 * [[http://zil.ipipan.waw.pl/Toposlaw|Toposław]], tool for the creation of electronic inflectional dictionaries of multi-word units,
 * [[http://zil.ipipan.waw.pl/CorpCor|CorpCor]], a web-based tool for correcting morphosyntactic annotation in TEI XML encoded corpora (e.g. NKJP).
 * [[http://ws.clarin-pl.eu/demo/stylo2.html|Stylo 2]], stylometry demo,
 * [[http://clip.ipipan.waw.pl/DeepEvents|DeepEvents]], event extraction in Polish, based on deep neural networks.
 * [[http://dsmodels.nlp.ipipan.waw.pl/sim1.html|Word similarity]], calculation of the similarity of words based on word embeddings, on-line service,
 * [[http://baltoslav.eu/?mova=pl|Baltoslav]], with several script converters (Romani
zer, Cyrillizer, IPA Converter etc.),
 * [[http://zil.ipipan.waw.pl/SpacyPL|SpacyPL]], Polish language models and resources
for [[https://spacy.io|Spacy]]
 * [[https://jasnopis.pl/|Jasnopis]], analyzer of text obscurity level
 * [[http://zil.ipipan.waw.pl/Scwad/AIDe|AIDe]], corpus of image descriptions in
Polish (A. Wróblewska)

Language Tools and Resources for Polish

This page contains a list of publicly available language tools and resources.

Written corpora of contemporary Polish

Written corpora of historical Polish

Spoken corpora

Language models

Parallel corpora and translation memories

Machine-readable dictionaries

Human-readable dictionaries

Morphological tools and resources

Taggers

Parsers, grammars, treebanks

Semantic resources

  • CDSCorpus, a dataset of 10k pairs of Polish sentences manually annotated for semantic relatedness and entailment (A. Wróblewska)

  • Probing datasets, Polish and English probing datasets for linguistic verification of sentence embeddings (A. Wróblewska)

Sentiment analysis, opinion mining

Coreference

Speech analysis and synthesis tools

Machine translation demonstrations

Summarizers

Diacritization

Named entity recognition

  • Nerf, a tool for named entity recognition, available on GNU GPL v.3 (J. Waszczuk),

  • Liner2, named entity recognizer released on GNU GPL with models to recognize 5 and 56 categories of proper names (M. Marcińczuk and M. Janicki),

  • TIMEX, a model for Liner2 to recognize and normalize temporal expressions (J. Kocoń and M. Marcińczuk).

Multiword expression software

  • TermoPL, multiword expression extraction tool,

  • VMWE identifiers, systems having participated in the PARSEME shared task for automatic identification of verbal MWEs, 13 out of 17 systems submitted results for Polish,

  • PARSEME-FR demonstrator, including the ATILF-LLF multiword expression identifier for Polish,

Aggregating services

Other

  • Mobile plWordNet, free mobile application for plWordNet browsing (J. Kocoń),

  • WSDDE, a system for designing and performing Word Sense Disambiguation experiments (R. Młodzki et al.),

  • Frazeo, a search engine and clusterer of news in Polish (P. Pęzik),

  • Segment, a rule-based sentence tokenizer supporting SRX standard (J. Lipski; the Polish rules are available in LanguageTool project, see here for short instructions on how to use the tool),

  • Toki, a tokenizer supporting SRX standard, C++ library and toolkit (T. Śniatowski and A. Radziszewski),

  • Translatica SRX sentence segmentation rules for Polish (LGPL),

  • SyMGIZA++, an extension of Giza++ that computes symmetric word alignment models,

  • Hipisek, an experimental question answering system (M. Walas),

  • Narzędzia dygitalizacji tekstów, Poliqarp for DjVu i inne programy,

  • Fextor, a feature extraction framework,

  • LexCSD, a system for semi-automatic sense disambiguation,

  • SuperMatrix, a general tool for lexical semantic knowledge acquisition,

  • WordnetLoom, an wordnet editor application,

  • Toposław, tool for the creation of electronic inflectional dictionaries of multi-word units,

  • CorpCor, a web-based tool for correcting morphosyntactic annotation in TEI XML encoded corpora (e.g. NKJP).

  • Stylo 2, stylometry demo,

  • DeepEvents, event extraction in Polish, based on deep neural networks.

  • Word similarity, calculation of the similarity of words based on word embeddings, on-line service,

  • Baltoslav, with several script converters (Romanizer, Cyrillizer, IPA Converter etc.),

  • SpacyPL, Polish language models and resources for Spacy

  • Jasnopis, analyzer of text obscurity level

  • AIDe, corpus of image descriptions in Polish (A. Wróblewska)