Locked History Actions

Diff for "PolishWikipediaCorpus"

Differences between revisions 5 and 6
Revision 5 as of 2012-10-02 14:47:24
Size: 1035
Editor: MichalLenart
Comment:
Revision 6 as of 2012-10-02 15:43:51
Size: 1099
Comment:
Deletions are marked like this. Additions are marked like this.
Line 23: Line 23:

Downloaded, processed and made available by Piotr Przybyła.

Polish Wikipedia Corpus

Full textual content of Polish Wikipedia (http://pl.wikipedia.org/) on 28.04.2012.

Download: Wikipedia_PL.tar.gz

The main directory contains subdirectories named AA to DQ. Each of those consists of subdirectories 00 to 99, containing approximately 100 kB of text each, one Wikipedia article per file.

Article files start with a title, followed by a blank line.

  • Only ordinary articles are present - no stubs, templates, disambiguation pages, history of changes etc.
  • All the multimedia, tables, references, links, and other non-plaintext elements have been removed.
  • Text is encoded as UTF-8.
  • In 839 269 articles there are 127 million words, 918 MB of text in total.

The corpus has been created by applying WikiExtractor script (http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) to Wikipedia dump (http://dumps.wikimedia.org/backup-index.html) and dividing 100 kB files into individual articles using own code.

Downloaded, processed and made available by Piotr Przybyła.