Locked History Actions

Diff for "PolishWikipediaCorpus"

Differences between revisions 1 and 2
Revision 1 as of 2012-10-02 14:39:45
Size: 936
Editor: MichalLenart
Comment:
Revision 2 as of 2012-10-02 14:40:44
Size: 947
Editor: MichalLenart
Comment:
Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:
28.04.2012. The main directory contains subdirectories named AA to DQ. 28.04.2012.

The main directory contains subdirectories named AA to DQ.
Line 7: Line 9:
Line 8: Line 11:
- Only ordinary articles are present - no stubs, templates,  * Only ordinary articles are present - no stubs, templates,
Line 10: Line 13:
- All the multimedia, tables, references, links, and other  * All the multimedia, tables, references, links, and other
Line 12: Line 15:
- Text is encoded as UTF-8.
- In 839 269 articles there are 127 million words, 918 MB of text in total.
 * Text is encoded as UTF-8.
 * In 839 269 articles there are 127 million words, 918 MB of text in total.

Polish Wikipedia Corpus

Full textual content of Polish Wikipedia (http://pl.wikipedia.org/) on 28.04.2012.

The main directory contains subdirectories named AA to DQ. Each of those consists of subdirectories 00 to 99, containing approximately 100 kB of text each, one Wikipedia article per file.

Article files start with a title, followed by a blank line.

  • Only ordinary articles are present - no stubs, templates,

disambiguation pages, history of changes etc.

  • All the multimedia, tables, references, links, and other

non-plaintext elements have been removed.

  • Text is encoded as UTF-8.
  • In 839 269 articles there are 127 million words, 918 MB of text in total.

The corpus has been created by applying WikiExtractor script (http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) to Wikipedia dump (http://dumps.wikimedia.org/backup-index.html) and dividing 100 kB files into individual articles using own code.