Locked History Actions

Diff for "LRT"

Differences between revisions 368 and 429 (spanning 61 versions)
Revision 368 as of 2017-12-06 22:12:33
Size: 28185
Editor: AgataSavary
Comment:
Revision 429 as of 2021-07-28 10:34:26
Size: 31735
Comment:
Deletions are marked like this. Additions are marked like this.
Line 5: Line 5:
/* * [[attachment:NKJP-PodkorpusMilionowy-1.0.tgz]], the manually annotated 1-million word subcorpus of the NJKP, available on GNU GPL v.3, */ /* * [[attachment:NKJP-PodkorpusMilionowy-1.0.tgz]], the manually annotated 1-million worsd subcorpus of the NJKP, available on GNU GPL v.3, */
Line 11: Line 11:
 * [[http://korpus.pwn.pl/|PWN Corpus]],
Line 12: Line 13:
 * [[http://korpus.pwn.pl/|PWN Corpus]],
Line 14: Line 14:
 * [[PSC|Polish Sejm Corpus]],  * [[PPC|Polish Parliamentary Corpus]],
Line 28: Line 28:
 * [[http://clip.ipipan.waw.pl/PARSEME-PL|Polish PARSEME corpus]], annotated manually for verbal multiword expressions in 18 languages including Polish, used in the [[http://multiword.sourceforge.net/sharedtask2017|PARSEME shared task 1.0]]; the Polish subcorpus is aligned with automatic dependency annotations in the [[http://universaldependencies.org/guidelines.html|UD]] format (A. Savary)
 * [[http://clip.ipipan.waw.pl/PARSEME-PL|Literal readings]] of Polish verbal MWEs from the [[http://clip.ipipan.waw.pl/PARSEME-PL|Polish PARSEME corpus] (A. Savary, S. Cordeiro)
 * [[http://clip.ipipan.waw.pl/PARSEME-PL|Polish PARSEME corpus]], annotated manually for verbal multiword expressions in 20 languages including Polish, used in the [[http://multiword.sourceforge.net/sharedtask2018/|PARSEME shared task 1.1]]; the Polish subcorpus is aligned with automatic dependency annotations in the [[http://universaldependencies.org/guidelines.html|UD]] format (A. Savary),
 * [[http://clip.ipipan.waw.pl/MweLitRead|MweLitRead]] - a corpus of literal readings of Polish verbal MWEs stemming from the [[http://clip.ipipan.waw.pl/PARSEME-PL|Polish PARSEME corpus]] (A. Savary, S. Cordeiro),
 * [[http://pelcra.pl/plec/downloads|PELCRA Learner English Corpus]] (PLEC),
 * [[https://www.sketchengine.eu/user-guide/user-manual/corpora/by-language/polish-text-corpora/|Polish text corpora]] included in Sketch Engine,
 * [[http://synamet.polon.uw.edu.pl/|Microcorpus of Synesthetic Metaphors]].
Line 32: Line 35:
 * [[PL196x|Polish language of the 1960s / Frequency corpus]] (I. Kurcz, A. Lewicki, J. Sambor, J. Woronczak, K. Szafran, J. S. Bień, M. Woliński),  * [[http://scriptores.pl/efontes/|eFontes Mediae et Infimae Latinitatis Polonorum]] (1000–1550, IJP PAN)
 * [[https://www.ijp-pan.krakow.pl/publikacje-elektroniczne/korpus-tekstow-staropolskich|Corpus of old Polish (up to 1500)]] (IJP PAN)
 * [[http://stnt.ijp.pan.pl/|15. century New Testament translations]] (IJP PAN)
 * [[https://szukajwslownikach.uw.edu.pl/IMPACT_GT_1/|IMPACT project corpus]] (1570–1756, KLF UW)
 * [[http://spxvi.edu.pl/korpus/|Corpus of 16. century Polish]] (IBL PAN)
 * [[http://fedora.clarin-d.uni-saarland.de/poldilemma/|PolDiLemma]], the Middle Polish Diachrone Lemmatised Corpus (16–18th c., R. Meyer)
 * [[http://rhssl1.uni-regensburg.de/SlavKo/korpus/poldi|PolDi]], a Polish Diachronic Online Corpus (R. Meyer)
 * [[http://korba.edu.pl|KORBA]], electronic corpus of 17th and 18th century Polish texts (1601–1772, IJP PAN)
 * [[http://www.f19.uw.edu.pl/|Corpus of the 19. century Polish]] (1830–1918, IJP UW)
 * [[http://korpus19.nlp.ipipan.waw.pl/|Manually annotated and transcribed corpus of the 19th century Polish]], (1830–1918, IPI PAN)
Line 34: Line 46:
 * [[http://www.f19.uw.edu.pl/|Microcorpus of Polish: 1830-1918]], (M. Derwojedowa),
 * [[http://korba.edu.pl|KORBA]], electronic corpus of 17th and 18th century Polish texts (W. Gruszczyński),
 * [[http://fedora.clarin-d.uni-saarland.de/poldilemma/|PolDiLemma]], the Middle Polish Diachrone Lemmatised Corpus (R. Meyer),
 * [[http://www.spxvi.edu.pl/korpus/|Corpus of 16. century Polish]] (IBL PAN),
 * [[https://www.ijp-pan.krakow.pl/publikacje-elektroniczne/korpus-tekstow-staropolskich|Corpus of old Polish (up to 1500)]] (IJP PAN).
 * [[PL196x|Polish language of the 1960s / Frequency corpus]] (I. Kurcz, A. Lewicki, J. Sambor, J. Woronczak, K. Szafran, J. S. Bień, M. Woliński).
Line 44: Line 51:
 * [[http://zil.ipipan.waw.pl/Anotatornia2|Anotatornia2]], new version of Anotatornia geared towards annotation of historical corpora,
Line 47: Line 55:
 * [[http://korpusomat.nlp.ipipan.waw.pl/|Korpusomat]], a tool for creation of searchable own corpora (Ł. Kobyliński, W. Kieraś).  * [[http://zil.ipipan.waw.pl/Korpusomat|Korpusomat]], a tool for creation of searchable own corpora (Ł. Kobyliński, W. Kieraś).
Line 50: Line 58:
 * The PELCRA conversational corpus of Polish: approx. 2.2 million words of casual conversational spoken Polish collected and processed in the years 2001-2015 in a number of research projects, including PELCRA, NKJP, CESAR and CLARIN-PL. Available under CC-BY-NC. All transcriptions can be accessedthrough the [[http://spokes.clarin-pl.eu|Spokes web interface]] and programmatically through [[http://clarin.pelcra.pl/apidocs/spokes| a REST API]].  * The PELCRA conversational corpus of Polish: approx. 2.2 million words of casual conversational spoken Polish collected and processed in the years 2001-2015 in a number of research projects, including PELCRA, NKJP, CESAR and CLARIN-PL. Available under CC-BY-NC. All transcriptions can be accessed through the [[http://spokes.clarin-pl.eu|Spokes web interface]] and programmatically through [[http://clarin.pelcra.pl/apidocs/spokes| a REST API]].
Line 61: Line 69:
 * [[http://www.dsp.agh.edu.pl/doku.php?id=pl:resources:ngram|N-gram model of Polish]] (AGH DSP),
Line 63: Line 70:
 * Even more [[http://dsmodels.nlp.ipipan.waw.pl|distributional semantic models]] based on NKJP (A. Mykowiecka, M. Marciniak, P. Rychlik).  * Even more [[http://dsmodels.nlp.ipipan.waw.pl|distributional semantic models]] based on NKJP (A. Mykowiecka, M. Marciniak, P. Rychlik),
 * [[http://publications.it.p.lodz.pl/2016/word_embeddings/|Wikipedia-based word embeddings for Polish]] (M. Rogalski, P. Szczepaniak),
 * [[https://wikipedia2vec.github.io/wikipedia2vec/pretrained/|Wikipedia2Vec – pretrained embeddings for Polish]] (I. Yamada, A. Asai, H. Shindo, H. Takeda., Y Takefuji),
 * [[https://github.com/deepmipt/Slavic-BERT-NER|Slavic BERT NER]] (see also [[http://docs.deeppavlov.ai/en/master/features/pretrained_vectors.html#bert|Deep Pavlov website]]),
 * [[https://github.com/sdadas/polish-nlp-resources|RoBERTa]] and links many other useful resources (S. Dadas),
 * [[https://github.com/kldarek/polbert|Polbert]] (D. Kłeczek),
Line 82: Line 95:
 * [[https://github.com/poethan/AlphaMWE|AlphaMWE]] Parallel English-Chinese, English-Polish and English-German parallel corpus annotated with multiword expressions
Line 94: Line 108:
 * [[http://nlp.pwr.wroc.pl/en/tools-and-resources/nelexicon|NELexicon]] contains more than 1.4 millions of proper names (M. Marcińczuk, A. Musiał, M. Janicki),  * [[http://www.nlp.pwr.wroc.pl/narzedzia-i-zasoby/zasoby/nelexicon|NELexicon]] contains more than 1.4 millions of proper names (M. Marcińczuk, A. Musiał, M. Janicki),
Line 102: Line 116:
 * [[http://clip.ipipan.waw.pl/DeepEREntityLibrary|DeepER Entity Library]], a database containing around 900,000 entities, each described by its textual representations in Polish (names) and `WordNet` synsets,
 * [[http://publications.it.p.lodz.pl/2016/word_embeddings/|Word embeddings for Polish - Wikipedia based]] (M. Rogalski, P. Szczepaniak).
 * [[http://clip.ipipan.waw.pl/DeepEREntityLibrary|DeepER Entity Library]], a database containing around 900,000 entities, each described by its textual representations in Polish (names) and `WordNet` synsets.
Line 106: Line 119:
 * [[http://sgjp.pl|Słownik gramatyczny języka polskiego]],
Line 113: Line 127:
 * [[http://xvii-wiek.ijp-pan.krakow.pl/pan_klient/|Słownik Elektroniczny Słownik Języka Polskiego XVII i XVIII wieku]],  * [[https://sxvii.pl/|Elektroniczny słownik języka polskiego XVII i XVIII wieku]],
Line 123: Line 137:
 * [[http://zil.ipipan.waw.pl/PoliMorf|PoliMorf]], the ultimate inflectional dictionary of Polish (under development),
 * [[http://sgjp.pl/morfeusz/|Morfeusz SGJP]], morphological analyser (Z. Saloni, W. Gruszczyński, M. Woliński, R. Wołosz),
 * [[http://sgjp.pl|SGJP]], Grammatical Dictionary of Polish (the list of inflected forms is available with [[http://morfeusz.sgjp.pl/download/|Morfeusz]]),
 * [[http://zil.ipipan.waw.pl/PoliMorf|PoliMorf]], an inflectional dictionary of Polish,
 * [[http://morfeusz.sgjp.pl/|Morfeusz SGJP]], morphological analyser,
Line 150: Line 165:
 * [[https://github.com/kwrobel-nlp/krnnt|KRNNT]], a morphological tagger for Polish based on recurrent neural networks,
Line 151: Line 167:
 * [[http://clarin.pelcra.pl/tools/tagger|A PoS tagger trained on the 1M NKJP corpus and Morfeusz]] with a [[http://clarin.pelcra.pl/tools/api/tagger/tag?text=Ala%20ma%20kota.%20&tagger=openNLP&tagset=standard&format=JSON&lang=pl|REST API]].  * [[http://clarin.pelcra.pl/tools/tagger|A PoS tagger trained on the 1M NKJP corpus and using Morfeusz]] [[http://ltc.amu.edu.pl/book/papers/PolEval1-3.pdf|(Pęzik & Laskowski 2017)]] with a [[http://clarin.pelcra.pl/apt_pl/?sentences=%5B%22Ala%20lubi%20kota.%22%2C%22Jurek%20ma%20worek.%22%5D|REST API]].
Line 154: Line 170:
 * [[http://zil.ipipan.waw.pl/PolishDependencyParser|Polish Dependency Parser]] (A. Wróblewska),
 * [[http://zil.ipipan.waw.pl/Sk%C5%82adnica|Składnica]], a hybrid constituency/dependency treebank of Polish (under development),
 * [[http://zil.ipipan.waw.pl/Sk%C5%82adnicaMWE|SkładnicaMWE]], a constituency version of Składnica with multiword expression annotations (J. Waszczuk, A. Savary),
 * [[http://zil.ipipan.waw.pl/PDB|PDB 2.0]], a dependency treebank of Polish (A. Wróblewska),
 * [[http://git.nlp.ipipan.waw.pl/alina/PDBUD|PDB-UD]], a version of PDB 2.0 in Universal Dependencies format (A. Wróblewska),
 * [[http://zil.ipipan.waw.pl/PDB/PDBparser|PDBparser]], a Polish dependency parser (A. Wróblewska),
 * Składnica, a hybrid constituency/dependency treebank of Polish,
  
* [[http://zil.ipipan.waw.pl/Sk%C5%82adnica|Składnica main page]],
   * [[http://zil.ipipan.waw.pl/Sk%C5%82adnicaMWE|SkładnicaMWE]], a constituency version of Składnica with multiword expression annotations (J. Waszczuk, A. Savary),
   * [[http://treebank.nlp.ipipan.waw.pl/|Składnica search engine]] (M. Woliński),
Line 160: Line 180:
 * Świgra, a DCG parser,
   * [[http://nlp.ipipan.waw.pl/~wolinski/swigra/|version 1.0]] (2005),
   * [[http://zil.ipipan.waw.pl/Sk%C5%82adnica|version 1.5]] as used in Składnica (2011),
   * [[http://swigra.nlp.ipipan.waw.pl/|Świgra 2.0 demo]] (2015, please use Firefox).
 * [[http://zil.ipipan.waw.pl/%C5%9Awigra|Świgra]], a DCG parser,
   * [[http://swigra.nlp.ipipan.waw.pl/|On-line demo]],
Line 179: Line 197:
== Sentiment analysis == == Semantic resources ==
 * [[http://zil.ipipan.waw.pl/Scwad/CDSCorpus|CDSCorpus]], a dataset of 10k pairs of Polish sentences manually annotated for semantic relatedness and entailment (A. Wróblewska)
 * [[http://git.nlp.ipipan.waw.pl/Scwad/SCWAD-probing-data|Probing datasets]], Polish and English probing datasets for linguistic verification of sentence embeddings (A. Wróblewska)

== Sentiment analysis, opinion mining ==
Line 182: Line 204:
 * [[http://zil.ipipan.waw.pl/TreebankWydzwieku|Polish dependency treebank with sentiment annotations]] (A. Wawer),
Line 184: Line 207:
 * you can also test Sentipejd – sentiment analysis tool in the [[http://multiservice.nlp.ipipan.waw.pl/|Multiservice]] (please select a tagger first).  * you can also test Sentipejd – sentiment analysis tool in the [[http://multiservice.nlp.ipipan.waw.pl/|Multiservice]] (please select a tagger first),
 * [[https://exp.lobi.nencki.gov.pl/nawl-analysis|Nencki Affective Word List]].
Line 232: Line 256:
== Named Entity Recognition == == Named entity recognition ==
Line 236: Line 260:

== Multiword expression software ==
 * [[http://zil.ipipan.waw.pl/TermoPL|TermoPL]], multiword expression extraction tool,
 * [[http://multiword.sourceforge.net/sharedtaskresults2018/|VMWE identifiers]], systems having participated in the PARSEME shared task for automatic identification of verbal MWEs, 13 out of 17 systems submitted results for Polish,
 * [[https://mwedemonstrator.atilf.fr/mwetools/accueil/|PARSEME-FR demonstrator]], including the ATILF-LLF multiword expression identifier for Polish,
Line 243: Line 272:
 * [[https://play.google.com/store/apps/details?id=com.pwr.plwordnet|Mobile plWordNet]], free mobile application for plWordNet browsing (J. Kocoń),
* [[http://www.mimuw.edu.pl/polszczyzna/kolokacje/index.htm|Kolokacje]], a Web crawler and collocation finder (A. Buczyński),
 * [[https://play.google.com/store/apps/details?id=com.pwr.plwordnet|Mobile plWordNet]], free mobile application for plWordNet browsing (J. Kocoń), /* * [[http://www.mimuw.edu.pl/polszczyzna/kolokacje/index.htm|Kolokacje]], a Web crawler and collocation finder (A. Buczyński),*/
Line 260: Line 288:
 * [[http://zil.ipipan.waw.pl/TermoPL|TermoPL]], multiword expression extraction tool.
Line 262: Line 289:
 * [[http://dsmodels.nlp.ipipan.waw.pl/sim1.html|Word similarity]], calculation of the similarity of words based on word embeddings, on-line service.  * [[http://dsmodels.nlp.ipipan.waw.pl/sim1.html|Word similarity]], calculation of the similarity of words based on word embeddings, on-line service,
 * [[http://baltoslav.eu/?mova=pl|Baltoslav]], with several script converters (Romanizer, Cyrillizer, IPA Converter etc.),
 * [[http://zil.ipipan.waw.pl/SpacyPL|SpacyPL]], Polish language models and resources for [[https://spacy.io|Spacy]]
 * [[https://jasnopis.pl/|Jasnopis]], analyzer of text obscurity level
 * [[http://zil.ipipan.waw.pl/Scwad/AIDe|AIDe]], corpus of image descriptions in Polish (A. Wróblewska)

Language Tools and Resources for Polish

This page contains a list of publicly available language tools and resources.

Written corpora of contemporary Polish

Written corpora of historical Polish

Spoken corpora

Language models

Parallel corpora and translation memories

Machine-readable dictionaries

Human-readable dictionaries

Morphological tools and resources

Taggers

Parsers, grammars, treebanks

Semantic resources

  • CDSCorpus, a dataset of 10k pairs of Polish sentences manually annotated for semantic relatedness and entailment (A. Wróblewska)

  • Probing datasets, Polish and English probing datasets for linguistic verification of sentence embeddings (A. Wróblewska)

Sentiment analysis, opinion mining

Coreference

Speech analysis and synthesis tools

Machine translation demonstrations

Summarizers

Diacritization

Named entity recognition

  • Nerf, a tool for named entity recognition, available on GNU GPL v.3 (J. Waszczuk),

  • Liner2, named entity recognizer released on GNU GPL with models to recognize 5 and 56 categories of proper names (M. Marcińczuk and M. Janicki),

  • TIMEX, a model for Liner2 to recognize and normalize temporal expressions (J. Kocoń and M. Marcińczuk).

Multiword expression software

  • TermoPL, multiword expression extraction tool,

  • VMWE identifiers, systems having participated in the PARSEME shared task for automatic identification of verbal MWEs, 13 out of 17 systems submitted results for Polish,

  • PARSEME-FR demonstrator, including the ATILF-LLF multiword expression identifier for Polish,

Aggregating services

Other

  • Mobile plWordNet, free mobile application for plWordNet browsing (J. Kocoń),

  • WSDDE, a system for designing and performing Word Sense Disambiguation experiments (R. Młodzki et al.),

  • Frazeo, a search engine and clusterer of news in Polish (P. Pęzik),

  • Segment, a rule-based sentence tokenizer supporting SRX standard (J. Lipski; the Polish rules are available in LanguageTool project, see here for short instructions on how to use the tool),

  • Toki, a tokenizer supporting SRX standard, C++ library and toolkit (T. Śniatowski and A. Radziszewski),

  • Translatica SRX sentence segmentation rules for Polish (LGPL),

  • SyMGIZA++, an extension of Giza++ that computes symmetric word alignment models,

  • Hipisek, an experimental question answering system (M. Walas),

  • Narzędzia dygitalizacji tekstów, Poliqarp for DjVu i inne programy,

  • Fextor, a feature extraction framework,

  • LexCSD, a system for semi-automatic sense disambiguation,

  • SuperMatrix, a general tool for lexical semantic knowledge acquisition,

  • WordnetLoom, an wordnet editor application,

  • Toposław, tool for the creation of electronic inflectional dictionaries of multi-word units,

  • CorpCor, a web-based tool for correcting morphosyntactic annotation in TEI XML encoded corpora (e.g. NKJP).

  • Stylo 2, stylometry demo,

  • DeepEvents, event extraction in Polish, based on deep neural networks.

  • Word similarity, calculation of the similarity of words based on word embeddings, on-line service,

  • Baltoslav, with several script converters (Romanizer, Cyrillizer, IPA Converter etc.),

  • SpacyPL, Polish language models and resources for Spacy

  • Jasnopis, analyzer of text obscurity level

  • AIDe, corpus of image descriptions in Polish (A. Wróblewska)