This archive contains the LMF version of the Polish Named Entity Gazetteer used within the SProUT platform, initially for information extraction from Polish texts, and then for the automatic pre-annotation of the National Corpus of Polish (NKJP) on the level of named entities.

Authors: 
Jakub Piskorski
Agata Savary
Michał Lenart

July 11, 2012

List of files:

* polish-ne-gazetteer-LMF-format.pdf - LMF format definition and conversion guidelines,

* PNEG-complete-entries.xml - grammatically complete gazetteer entries: 9,060 lemmas and 95,359 word forms (see Section 2.4, p. 8 of the guidelines file)

* PNEG-incomplete-entries.xml - grammatically incomplete gazetteer entries: 35,884 lemmas and 40,612 word forms (see Section 2.4, p. 8 of the guidelines file)

* schema.rng - Relax NG schema for both files

* morphosyntax.cfg - morphosyntax configuration file needed for validation

* validateLMF.py, validateLMF.sh - validation scrypts
