Polish Wikipedia Corpus

Full textual content of Polish Wikipedia on 03.03.2013, annotated using NLP tools. The corpus has been originally created as a knowledge base for Polish Question Answering System RAFAEL [1,2].

Download plain text: Wikipedia_PL.tar.gz

Download annotations: part1.tar part2.tar part3.tar part4.tar part5.tar part6.tar

Wikipedia's text content is released under the Creative Commons Attribution-Share-Alike License 3.0.

Downloaded, processed and made available by Piotr Przybyła.

Plain text

The main directory contains subdirectories named AA to CF. They consist of subdirectories 00 to 99, containing approximately 200 kB of text each, one Wikipedia article per file.

Each article file has the following format:

<doc id="2854972" url="http://pl.wikipedia.org/wiki/?curid=2854972" title="Kamionka (powiat krośnieński)">
Kamionka (powiat krośnieński)

Kamionka – wieś w Polsce położona w województwie podkarpackim, w powiecie krośnieńskim, w gminie Dukla.
W latach 1975-1998 miejscowość należała administracyjnie do województwa krośnieńskiego.
</doc>

In 895,486 articles there are 169 million words, 1.06 GB of text in total:

The corpus has been created by applying Wikipedia Extractor 2.2 script to Wikipedia dump and dividing 200 kB files into individual articles using own code.

Annotations

The annotations archive has the same structure as in case of plain text, but each of the 00-99 folders contains a single archive named 'annotationsMSNL.tar.gz'. Its subdirectories, each corresponding to a plain text file, contain the following elements:

All annotations are stored in a variant of XML-based TEI P5 format, created for purposes of National Corpus of Polish [8].

References

[1] Przybyła, P. (2015). Gathering Knowledge for Question Answering Beyond Named Entities. Proceedings of the 20th International Conference on Application of Natural Language to Information Systems (NLDB 2015).

[2] Przybyła, P. (2014). Odpowiadanie na pytania w języku polskim z użyciem głębokiego rozpoznawania nazw. Doctoral thesis, Institute of Computer Science, Polish Academy of Sciences.

[3] Woliński, M., Miłkowski, M., Ogrodniczuk, M., Przepiórkowski, A., and Szałkiewicz, L. (2012). PoliMorf: a (not so) new open morphological dictionary for Polish. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012).

[4] Acedański, S. (2010). A morphosyntactic Brill Tagger for inflectional languages. Proceedings of the 7th international conference on Advances in Natural Language Processing (IceTAL’10 ).

[5] Piasecki, M. (2007). Polish Tagger TaKIPI: Rule Based Construction and Optimisation. Task Quarterly, 11(1-2):151–167.

[6] Przepiórkowski, A. (2008). Powierzchniowe przetwarzanie języka polskiego. Akademicka Oficyna Wydawnicza EXIT.

[7] Degórski, Ł. (2012). Towards the Lemmatisation of Polish Nominal Syntactic Groups Using a Shallow Grammar. Proceedings of the International Joint Conference on Security and Intelligent Information Systems.

[8] Przepiórkowski, A., Bańko, M., Górski, R. L., and Lewandowska-Tomaszczyk, B. (2012). Narodowy Korpus Języka Polskiego. Wydawnictwo Naukowe PWN.

[9] Savary, A. and Waszczuk, J. (2010). Towards the Annotation of Named Entities in the National Corpus of Polish. Procedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010).

[10] Marcińczuk, M. and Janicki, M. (2012). Optimizing CRF-based Model for Proper Name Recognition in Polish Texts. Proceedings of CICLing 2012.