|Deletions are marked like this.||Additions are marked like this.|
|Line 160:||Line 160:|
|* [[http://treebank.nlp.ipipan.waw.pl/|Searcheable Składnica]] (M. Woliński),|
Language Tools and Resources for Polish
This page contains a list of publicly available language tools and resources.
Written corpora of contemporary Polish
National Corpus of Polish (NKJP),
National Photocorpus of Polish (NFJP),
Now available also as corpora in the Poliqarp for DjVu search engine,
gpwEcono, a corpus of stock market reports, with manual word sense annotation,
plWikiEcono, a corpus of Polish Wikipedia articles from the domain of economy,
ArgDB-pl, a Polish corpus of arguments in natural contexts,
PIEWiC, a Polish corpus of errors automatically extracted from Wikipedia revisions,
Polish PARSEME corpus, annotated manually for verbal multiword expressions in 18 languages including Polish, used in the PARSEME shared task 1.0; the Polish subcorpus is aligned with automatic dependency annotations in the UD format (A. Savary),
PELCRA Learner English Corpus (PLEC).
Written corpora of historical Polish
eFontes Mediae et Infimae Latinitatis Polonorum (1000–1550, IJP PAN)
Corpus of old Polish (up to 1500) (IJP PAN)
15. century New Testament translations (IJP PAN)
IMPACT project corpus (1570–1756, KLF UW)
Corpus of 16. century Polish (IBL PAN)
PolDiLemma, the Middle Polish Diachrone Lemmatised Corpus (16–18th c., R. Meyer)
KORBA, electronic corpus of 17th and 18th century Polish texts (1601–1772, IJP PAN)
Corpus of the 19. century Polish, (1830–1918, IJP UW)
ChronoPress, corpus of press texts from 1945–1954 (A. Pawłowski),
Polish language of the 1960s / Frequency corpus (I. Kurcz, A. Lewicki, J. Sambor, J. Woronczak, K. Szafran, J. S. Bień, M. Woliński)
Corpus-related tools and resources
Anotatornia, a system for multi-level manual annotation of corpora,
Inforex, a web-based system designed for managing and annotating text corpora on the semantic level,
Smyrna, a simple, light-weight Polish concordancer,
korpusy.net, a corpus research-related website (B. Gałkowski),
Korpusomat, a tool for creation of searchable own corpora (Ł. Kobyliński, W. Kieraś).
The PELCRA conversational corpus of Polish: approx. 2.2 million words of casual conversational spoken Polish collected and processed in the years 2001-2015 in a number of research projects, including PELCRA, NKJP, CESAR and CLARIN-PL. Available under CC-BY-NC. All transcriptions can be accessed through the Spokes web interface and programmatically through a REST API.
AGH speech corpus, around 9 hours, word-annotated Polish speech corpus (AGH DSP),
Audiovideo corpus of Polish speech (AGH DSP),
Acoustic database for Polish unit selection speech synthesis (ELRA resources),
Acoustic database for Polish concatenative speech synthesis (ELRA resources),
Corpus of emotions in speech (AGH DSP).
N-gram model of Polish (AGH DSP),
Distributional semantic models trained on orthographical, lemmatized word forms (A. Wawer),
Even more distributional semantic models based on NKJP (A. Mykowiecka, M. Marciniak, P. Rychlik).
Parallel corpora and translation memories
ParaSol, a parallel corpus of Slavic and other languages,
PolUKR, a Polish-Ukrainian parallel corpus,
OPUS, an open source parallel corpus (European Parliament, EMEA, KDE, movie subtitles),
InterCorp, a multilingual parallel corpus,
MyMemory, freely available multilingual TM,
TAUS Data, a multilingual TM from the members of TAUS Data Association,
Glosbe, an open source TM.
plWordNet, Polish WordNet, Słowosieć (M. Piasecki),
POLNET, another Polish Wordnet (Z. Vetulani),
Polish OpenThesaurus, słownik synonimów – a crowdsourced Polish thesaurus (M. Miłkowski),
Słownik języka polskiego (d. alternatywny), Polish ispell dictionaries, along with some definitions and online form display,
Nowy słownik angielsko-polski (T. Piotrowski, Z. Saloni),
Polish OpenCYC (A. Pohl),
Polish machine-generated dictionaries, available on Creative Commons (J. Kazojć),
Gazetteer for Polish Named Entities (A. Savary, M. Lenart, J. Piskorski),
Triggers for Polish Named Entities (M. Baron, L. Manicki, A. Savary),
NELexicon contains more than 1.4 millions of proper names (M. Marcińczuk, A. Musiał, M. Janicki),
Syntatic-generative dictionary of Polish verbs (K. Polański),
SEJF, the Grammatical Lexicon of Polish Phraseology (M. Czerepowicka, A. Savary),
SEJFEK, the Grammatical Lexicon of Polish Economical Phraseology (F. Makowiecki, A. Savary),
WikiTopoPl, a multilingual lexicon of 155,000 Polish geographical proper names extracted from Wikipedia and their equivalents in Bulgarian, Croatian, English, German, modern Greek, Hungarian, Romanian, Serbian and Slovak (L. Manicki),
Prolexbase 2.0, a multiligual relational dictionary of proper names in Polish, English and French (M. Baron, B. Bouchou-Markhoff, L. Manicki, D. Maurel, A. Savary, M. Tran),
DeepER Entity Library, a database containing around 900,000 entities, each described by its textual representations in Polish (names) and WordNet synsets,
Word embeddings for Polish - Wikipedia based (M. Rogalski, P. Szczepaniak).
Słownik ukraińsko-polski pod redakcją Janusza A. Riegera. Materiały do słownika: Litera „O”.
Morphological tools and resources
PoliMorf, an inflectional dictionary of Polish,
Morfeusz SGJP, morphological analyser (Z. Saloni, W. Gruszczyński, M. Woliński, R. Wołosz),
Index a tergo of Polish word forms (J. Tokarski, Z. Saloni),
Morfologik, morphological analyser (M. Miłkowski, D. Weiss),
SAM, morphological analyser (K. Szafran),
MULTEXT-East, v.4, morphosyntactic specifications and documentation for 16 languages,
Sample morphosyntactic Polish lexicon, the MULTEXT-East morphosyntactic lexicons,
MACA, Morphological Analysis Converter and Aggregator (A. Radziszewski, T. Śniatowski),
Lexical analyser and a Polish proof-reader (S. Galus),
Neurosoft Gram (demo of a morphological analyser),
Finite state utilities (J. Daciuk),
Stempel, another stemmer (A. Białecki),
LemmaPL, a lemmatization tool for Polish.
TaKIPI, a morphosyntactic tagger for Polish (Decision Trees),
PANTERA, a morphosyntactic tagger for Polish (Transformation-Based Learning),
WMBT, a morphosyntactic tagger for Polish (Memory-Based Learning),
TaCo, a statistical morphosyntactic tagset converter for positional tagsets (e.g. Polish),
WCRFT, a morphosyntactic tagger for Polish (Conditional Random Fields),
Concraft, a morphosyntactic disambiguation tool for Polish (Constrained Conditional Random Fields),
PoliTa, a morphosyntactic meta-tagger,
Parsers, grammars, treebanks
PDBparser, a Polish dependency parser (A. Wróblewska),
Składnica, a hybrid constituency/dependency treebank of Polish,
Searcheable Składnica (M. Woliński),
- Świgra, a DCG parser,
- Spejd, a shallow parsing and disambiguation system,
the current version of the system,
Dendrarium, a treebank development system,
Analizator syntaktyczny AS (M. Woliński),
Formalny opis składniowy zdań polskich (S. Szpakowicz),
Iobber, a CRF chunker for Polish,
Krzaki (bushes), a manually annotated for dependency structure 20k-sentence corpus of Polish.
ENIAM (W. Jaworski).
Polish sentiment dictionary, with sentiment scores computed using supervised methods (A. Wawer),
Polish Linguistic Category Model following a typology of verb categorization in terms of their abstractness, also a tool to measure language abstraction (A. Wawer),
Sincerity Corpus (Korpus Szczerości), a collection of fake and real reviews,
Polish Coreference Corpus, a 500 M corpus of general nominal coreference in Polish (M. Ogrodniczuk),
Speech analysis and synthesis tools
Skrybot, commercial speech recognition system (L. Pawlaczyk, P. Bosky),
Ivona, commercial text-to-speech system (Expressivo),
Techmo TTS demo (Techmo),
Vocalizer, commercial text-to-speech system (Nuance),
Acapela, text to speech demo,
System syntezy mowy ciągłej (G. Demenko, S. Grocholewski),
Polish MBROLA database (K. Szklanny, K. Marasek),
SynTalk, commercial speech synthesis system (NeuroSoft),
PrimeSpeech, commercial speech recognition systems,
OrtFon, phonetic transcriber (AGH DSP),
Sarmata, automatic speech recognition system for Polish (AGH DSP),
System rozpoznawania mówcy (AGH DSP).
Machine translation demonstrations
iTranslate4.eu (multiple languages, allows comparing translation engines),
Bing Translator (multilingual),
Google Translate (multilingual),
Systran (EN-PL, PL-FR and some more),
Esperantilo (integrated Esperanto editor, with MT for EO-PL-DE-EN-SV),
Thetos (PL-Sign language).
Lakon, a system for news summarization (A. Dudczak),
PolSum (S. Kulików),
Summar (Ł. Pawluczuk),
Summarizer (J. Świetlicka),
you can also test Lakon, Open Text Summarizer and Summarizer in Multiservice
and take a look at the Polish Summaries Corpus.
Poliszynel (P. Sawicki),
spolszcz.pl (P. Sawicki),
fsa_accent (J. Daciuk),
pliterki (W. Muła).
Named Entity Recognition
Nerf, a tool for named entity recognition, available on GNU GPL v.3 (J. Waszczuk),
Online demos of tools for processing Polish texts (CLARIN-PL),
PSI-Toolkit, a chain of publicly available tools for automatic processing of Polish.
Mobile plWordNet, free mobile application for plWordNet browsing (J. Kocoń),
Frazeo, a search engine and clusterer of news in Polish (P. Pęzik),
SyMGIZA++, an extension of Giza++ that computes symmetric word alignment models,
Hipisek, an experimental question answering system (M. Walas),
Narzędzia dygitalizacji tekstów, Poliqarp for DjVu i inne programy,
Fextor, a feature extraction framework,
LexCSD, a system for semi-automatic sense disambiguation,
SuperMatrix, a general tool for lexical semantic knowledge acquisition,
WordnetLoom, an wordnet editor application,
Toposław, tool for the creation of electronic inflectional dictionaries of multi-word units,
Stylo 2, stylometry demo,
TermoPL, multiword expression extraction tool.
DeepEvents, event extraction in Polish, based on deep neural networks.
Word similarity, calculation of the similarity of words based on word embeddings, on-line service,
Baltoslav, with several script converters (Romanizer, Cyrillizer, IPA Converter etc.)