Differences between revisions 1 and 2
⇤ ← Revision 1 as of 2012-10-02 14:39:45
Size: 936
Comment:
|
Size: 947
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 4: | Line 4: |
28.04.2012. The main directory contains subdirectories named AA to DQ. | 28.04.2012. The main directory contains subdirectories named AA to DQ. |
Line 7: | Line 9: |
Line 8: | Line 11: |
- Only ordinary articles are present - no stubs, templates, | * Only ordinary articles are present - no stubs, templates, |
Line 10: | Line 13: |
- All the multimedia, tables, references, links, and other | * All the multimedia, tables, references, links, and other |
Line 12: | Line 15: |
- Text is encoded as UTF-8. - In 839 269 articles there are 127 million words, 918 MB of text in total. |
* Text is encoded as UTF-8. * In 839 269 articles there are 127 million words, 918 MB of text in total. |
Polish Wikipedia Corpus
Full textual content of Polish Wikipedia (http://pl.wikipedia.org/) on 28.04.2012.
The main directory contains subdirectories named AA to DQ. Each of those consists of subdirectories 00 to 99, containing approximately 100 kB of text each, one Wikipedia article per file.
Article files start with a title, followed by a blank line.
- Only ordinary articles are present - no stubs, templates,
disambiguation pages, history of changes etc.
- All the multimedia, tables, references, links, and other
non-plaintext elements have been removed.
- Text is encoded as UTF-8.
- In 839 269 articles there are 127 million words, 918 MB of text in total.
The corpus has been created by applying WikiExtractor script (http://medialab.di.unipi.it/wiki/Wikipedia_Extractor) to Wikipedia dump (http://dumps.wikimedia.org/backup-index.html) and dividing 100 kB files into individual articles using own code.