Polish Wikipedia Corpus
Full textual content of Polish Wikipedia on 03.03.2013, annotated using NLP tools. The corpus has been originally created as a knowledge base for Polish Question Answering System RAFAEL [1,2].
Download plain text: Wikipedia_PL.tar.gz
Download annotations: part1.tar part2.tar part3.tar part4.tar part5.tar part6.tar
Wikipedia's text content is released under the Creative Commons Attribution-Share-Alike License 3.0.
Downloaded, processed and made available by Piotr Przybyła.
Plain text
The main directory contains subdirectories named AA to CF. They consist of subdirectories 00 to 99, containing approximately 200 kB of text each, one Wikipedia article per file.
Each article file has the following format:
<doc id="2854972" url="http://pl.wikipedia.org/wiki/?curid=2854972" title="Kamionka (powiat krośnieński)"> Kamionka (powiat krośnieński) Kamionka – wieś w Polsce położona w województwie podkarpackim, w powiecie krośnieńskim, w gminie Dukla. W latach 1975-1998 miejscowość należała administracyjnie do województwa krośnieńskiego. </doc>
In 895,486 articles there are 169 million words, 1.06 GB of text in total:
- Only ordinary articles are present - no stubs, templates, disambiguation pages, history of changes etc.
- All the multimedia, tables, references, links, and other non-plaintext elements have been removed.
- Text is encoded as UTF-8.
The corpus has been created by applying Wikipedia Extractor 2.2 script to Wikipedia dump and dividing 200 kB files into individual articles using own code.
Annotations
The annotations archive has the same structure as in case of plain text, but each of the 00-99 folders contains a single archive named 'annotationsMSNL.tar.gz'. Its subdirectories, each corresponding to a plain text file, contain the following elements:
- Text structure, dividing the document into paragraphs according to newline characters using own Python script (text_tructure.xml),
- Segmentation created using Morfeusz Polimorf 0.82 [3] (ann_segmentation.xml),
- Morphosyntactic analysis obtained from Morfeusz Polimorf and disambiguated using PANTERA 0.9.1 [4], including guesser Odgadywacz from tagger TaKIPI 1.8 [5] (ann_morphosyntax.xml),
- Syntactic words and groups generated with shallow parser Spejd 1.3.7 [6], using improved version of lemmatization grammar [7], basing on NKJP grammar from 14.02.2012 [8] (ann_words.xml and ann_groups.xml),
- Named entities recognized by Nerf 0.1 [9] (ann_named.xml),
- Named entities recognized by Liner2 2.3 [10] converted from CCL to NKJP format using own Python script (ann_named_liner.xml),
All annotations are stored in a variant of XML-based TEI P5 format, created for purposes of National Corpus of Polish [8].
References
[1] Przybyła, P. (2015). Gathering Knowledge for Question Answering Beyond Named Entities. Proceedings of the 20th International Conference on Application of Natural Language to Information Systems (NLDB 2015).
[2] Przybyła, P. (2014). Odpowiadanie na pytania w języku polskim z użyciem głębokiego rozpoznawania nazw. Doctoral thesis, Institute of Computer Science, Polish Academy of Sciences.
[3] Woliński, M., Miłkowski, M., Ogrodniczuk, M., Przepiórkowski, A., and Szałkiewicz, L. (2012). PoliMorf: a (not so) new open morphological dictionary for Polish. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).
[4] Acedański, S. (2010). A morphosyntactic Brill Tagger for inflectional languages. Proceedings of the 7th international conference on Advances in Natural Language Processing (IceTAL’10 ).
[5] Piasecki, M. (2007). Polish Tagger TaKIPI: Rule Based Construction and Optimisation. Task Quarterly, 11(1-2):151–167.
[6] Przepiórkowski, A. (2008). Powierzchniowe przetwarzanie języka polskiego. Akademicka Oficyna Wydawnicza EXIT.
[7] Degórski, Ł. (2012). Towards the Lemmatisation of Polish Nominal Syntactic Groups Using a Shallow Grammar. Proceedings of the International Joint Conference on Security and Intelligent Information Systems.
[8] Przepiórkowski, A., Bańko, M., Górski, R. L., and Lewandowska-Tomaszczyk, B. (2012). Narodowy Korpus Języka Polskiego. Wydawnictwo Naukowe PWN.
[9] Savary, A. and Waszczuk, J. (2010). Towards the Annotation of Named Entities in the National Corpus of Polish. Procedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010).
[10] Marcińczuk, M. and Janicki, M. (2012). Optimizing CRF-based Model for Proper Name Recognition in Polish Texts. Proceedings of CICLing 2012.