Locked History Actions

Diff for "PolishWikipediaCorpus"

Differences between revisions 2 and 3
Revision 2 as of 2012-10-02 14:40:44
Size: 947
Editor: MichalLenart
Comment:
Revision 3 as of 2012-10-02 14:40:54
Size: 945
Editor: MichalLenart
Comment:
Deletions are marked like this. Additions are marked like this.
Line 11: Line 11:
 * Only ordinary articles are present - no stubs, templates,
disambiguation pages, history of changes etc.
 * All the multimedia, tables, references, links, and other
non-plaintext elements have been removed.
 * Only ordinary articles are present - no stubs, templates, disambiguation pages, history of changes etc.
 * All the multimedia, tables, references, links, and other non-plaintext elements have been removed.

Polish Wikipedia Corpus

Full textual content of Polish Wikipedia (http://pl.wikipedia.org/) on 28.04.2012.

The main directory contains subdirectories named AA to DQ. Each of those consists of subdirectories 00 to 99, containing approximately 100 kB of text each, one Wikipedia article per file.

Article files start with a title, followed by a blank line.

  • Only ordinary articles are present - no stubs, templates, disambiguation pages, history of changes etc.
  • All the multimedia, tables, references, links, and other non-plaintext elements have been removed.
  • Text is encoded as UTF-8.
  • In 839 269 articles there are 127 million words, 918 MB of text in total.

The corpus has been created by applying WikiExtractor script (http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) to Wikipedia dump (http://dumps.wikimedia.org/backup-index.html) and dividing 100 kB files into individual articles using own code.