<?xml version="1.0" encoding="utf-8"?><!DOCTYPE article  PUBLIC '-//OASIS//DTD DocBook XML V4.4//EN'  'http://www.docbook.org/xml/4.4/docbookx.dtd'><article><articleinfo><title>PolishWikipediaCorpus</title><revhistory><revision><revnumber>12</revnumber><date>2015-04-13 11:31:55</date><authorinitials>MateuszKopec</authorinitials></revision><revision><revnumber>11</revnumber><date>2015-04-13 11:30:53</date><authorinitials>MateuszKopec</authorinitials></revision><revision><revnumber>10</revnumber><date>2015-04-13 11:30:28</date><authorinitials>MateuszKopec</authorinitials></revision><revision><revnumber>9</revnumber><date>2012-10-02 17:20:22</date><authorinitials>AdamPrzepiorkowski</authorinitials></revision><revision><revnumber>8</revnumber><date>2012-10-02 15:44:41</date><authorinitials>AdamPrzepiorkowski</authorinitials></revision><revision><revnumber>7</revnumber><date>2012-10-02 15:44:10</date><authorinitials>AdamPrzepiorkowski</authorinitials></revision><revision><revnumber>6</revnumber><date>2012-10-02 15:43:51</date><authorinitials>AdamPrzepiorkowski</authorinitials></revision><revision><revnumber>5</revnumber><date>2012-10-02 14:47:24</date><authorinitials>MichalLenart</authorinitials><revremark>Renamed from 'Wikipedia'.</revremark></revision><revision><revnumber>4</revnumber><date>2012-10-02 14:46:31</date><authorinitials>MichalLenart</authorinitials></revision><revision><revnumber>3</revnumber><date>2012-10-02 14:40:54</date><authorinitials>MichalLenart</authorinitials></revision><revision><revnumber>2</revnumber><date>2012-10-02 14:40:44</date><authorinitials>MichalLenart</authorinitials></revision><revision><revnumber>1</revnumber><date>2012-10-02 14:39:45</date><authorinitials>MichalLenart</authorinitials></revision></revhistory></articleinfo><section><title>Polish Wikipedia Corpus</title><para>Full textual content of <ulink url="http://pl.wikipedia.org/">Polish Wikipedia</ulink> on 03.03.2013, annotated using NLP tools. The corpus has been originally created as a knowledge base for Polish Question Answering System RAFAEL [1,2]. </para><para><emphasis role="strong">Download plain text</emphasis>: <ulink url="http://clip.ipipan.waw.pl/wiki_static/large_data/clip/PolishWikipediaCorpus/Wikipedia_PL.tar.gz">Wikipedia_PL.tar.gz</ulink> </para><para><emphasis role="strong">Download annotations</emphasis>: <ulink url="http://clip.ipipan.waw.pl/wiki_static/large_data/clip/PolishWikipediaCorpus/Wikipedia_PL_annotated/part1.tar">part1.tar</ulink> <ulink url="http://clip.ipipan.waw.pl/wiki_static/large_data/clip/PolishWikipediaCorpus/Wikipedia_PL_annotated/part2.tar">part2.tar</ulink> <ulink url="http://clip.ipipan.waw.pl/wiki_static/large_data/clip/PolishWikipediaCorpus/Wikipedia_PL_annotated/part3.tar">part3.tar</ulink> <ulink url="http://clip.ipipan.waw.pl/wiki_static/large_data/clip/PolishWikipediaCorpus/Wikipedia_PL_annotated/part4.tar">part4.tar</ulink> <ulink url="http://clip.ipipan.waw.pl/wiki_static/large_data/clip/PolishWikipediaCorpus/Wikipedia_PL_annotated/part5.tar">part5.tar</ulink> <ulink url="http://clip.ipipan.waw.pl/wiki_static/large_data/clip/PolishWikipediaCorpus/Wikipedia_PL_annotated/part6.tar">part6.tar</ulink> </para><para>Wikipedia's text content is released under the <ulink url="http://creativecommons.org/licenses/by-sa/3.0/">Creative Commons Attribution-Share-Alike License 3.0</ulink>. </para><para>Downloaded, processed and made available by Piotr Przybyła. </para><section><title>Plain text</title><para>The main directory contains subdirectories named AA to CF. They consist of subdirectories 00 to 99, containing approximately 200 kB of text each, one Wikipedia article per file. </para><para>Each article file has the following format: </para><screen><![CDATA[<doc id="2854972" url="http://pl.wikipedia.org/wiki/?curid=2854972" title="Kamionka (powiat krośnieński)">
Kamionka (powiat krośnieński)
]]><![CDATA[
Kamionka – wieś w Polsce położona w województwie podkarpackim, w powiecie krośnieńskim, w gminie Dukla.
W latach 1975-1998 miejscowość należała administracyjnie do województwa krośnieńskiego.
</doc>]]></screen><para>In 895,486 articles there are 169 million words, 1.06 GB of text in total: </para><itemizedlist><listitem><para>Only ordinary articles are present - no stubs, templates, disambiguation pages, history of changes etc. </para></listitem><listitem><para>All the multimedia, tables, references, links, and other non-plaintext elements have been removed. </para></listitem><listitem><para>Text is encoded as UTF-8. </para></listitem></itemizedlist><para>The corpus has been created by applying <ulink url="http://medialab.di.unipi.it/wiki/Wikipedia_Extractor">Wikipedia Extractor 2.2 script</ulink> to <ulink url="http://dumps.wikimedia.org/backup-index.html">Wikipedia dump</ulink> and dividing 200 kB files into individual articles using own code. </para></section><section><title>Annotations</title><para>The annotations archive has the same structure as in case of plain text, but each of the 00-99 folders contains a single archive named 'annotationsMSNL.tar.gz'. Its subdirectories, each corresponding to a plain text file, contain the following elements: </para><itemizedlist><listitem><para>Text structure, dividing the document into paragraphs according to newline characters using own Python script (text_tructure.xml), </para></listitem><listitem><para>Segmentation created using Morfeusz Polimorf 0.82 [3] (ann_segmentation.xml), </para></listitem><listitem><para>Morphosyntactic analysis obtained from Morfeusz Polimorf and disambiguated using PANTERA 0.9.1 [4], including guesser Odgadywacz from tagger TaKIPI 1.8 [5] (ann_morphosyntax.xml), </para></listitem><listitem><para>Syntactic words and groups generated with shallow parser Spejd 1.3.7 [6], using improved version of lemmatization grammar [7], basing on NKJP grammar from 14.02.2012 [8] (ann_words.xml and ann_groups.xml), </para></listitem><listitem><para>Named entities recognized by Nerf 0.1 [9] (ann_named.xml), </para></listitem><listitem><para>Named entities recognized by Liner2 2.3 [10] converted from CCL to NKJP format using own Python script (ann_named_liner.xml), </para></listitem></itemizedlist><para>All annotations are stored in a variant of XML-based TEI P5 format, created for purposes of National Corpus of Polish [8]. </para></section><section><title>References</title><para>[1] Przybyła, P. (2015). Gathering Knowledge for Question Answering Beyond Named Entities. Proceedings of the 20th International Conference on Application of Natural Language to Information Systems (NLDB 2015). </para><para>[2] Przybyła, P. (2014). Odpowiadanie na pytania w języku polskim z użyciem głębokiego rozpoznawania nazw. Doctoral thesis, Institute of Computer Science, Polish Academy of Sciences. </para><para>[3] Woliński, M., Miłkowski, M., Ogrodniczuk, M., Przepiórkowski, A., and Szałkiewicz, L. (2012). <code>PoliMorf</code>: a (not so) new open morphological dictionary for Polish. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012). </para><para>[4] Acedański, S. (2010). A morphosyntactic Brill Tagger for inflectional languages. Proceedings of the 7th international conference on Advances in Natural Language Processing (IceTAL’10 ). </para><para>[5] Piasecki, M. (2007). Polish Tagger TaKIPI: Rule Based Construction and Optimisation. Task Quarterly, 11(1-2):151–167. </para><para>[6] Przepiórkowski, A. (2008). Powierzchniowe przetwarzanie języka polskiego. Akademicka Oficyna Wydawnicza EXIT. </para><para>[7] Degórski, Ł. (2012). Towards the Lemmatisation of Polish Nominal Syntactic Groups Using a Shallow Grammar. Proceedings of the International Joint Conference on Security and Intelligent Information Systems. </para><para>[8] Przepiórkowski, A., Bańko, M., Górski, R. L., and Lewandowska-Tomaszczyk, B. (2012). Narodowy Korpus Języka Polskiego. Wydawnictwo Naukowe PWN. </para><para>[9] Savary, A. and Waszczuk, J. (2010). Towards the Annotation of Named Entities in the National Corpus of Polish. Procedings of the 7th International Conference on Language Resources and Evaluation (LREC 2010). </para><para>[10] Marcińczuk, M. and Janicki, M. (2012). Optimizing CRF-based Model for Proper Name Recognition in Polish Texts. Proceedings of CICLing 2012. </para></section></section></article>