Locked History Actions

Diff for "Gazetteer"

Differences between revisions 12 and 13
Revision 12 as of 2012-07-11 14:21:51
Size: 2155
Editor: AgataSavary
Comment:
Revision 13 as of 2012-07-11 14:22:17
Size: 2155
Editor: AgataSavary
Comment:
Deletions are marked like this. Additions are marked like this.
Line 24: Line 24:
[[attachment:gazetteer-nkjp-no-pwn.zip]|Text version as used with Sprout for NKJP annotation] [[attachment:gazetteer-nkjp-no-pwn.zip|Text version as used with Sprout for NKJP annotation]]

Gazetteer for Polish Named Entities

The Gazetteer for Polish Named Entities is a textual source used within the SProUT platform, initially for information extraction from Polish texts, and then for the automatic pre-annotation of the National Corpus of Polish (NKJP) on the level of named entities. Its construction, contents and use have been described in:

  • SAVARY, A., PISKORSKI, J. (2011). Language Resources for Named Entity Annotation in the National Corpus of Polish, to appear in Control and Cybernetics.

  • SAVARY, A., PISKORSKI, J. (2010). Lexicons and Grammars for Named Entity Annotation in the National Corpus of Polish, in Proceedings of the 18th International Conference Intelligent Information Systems (IIS'10), Siedlce, Poland.

  • PISKORSKI, J. (2005). Named-Entity Recognition for Polish with SProUT, in LNCS Vol 3490: Proceedings of IMTCI 2004, Warsaw, Poland.

The file contains 153,477 inflected entries of Polish (and some foreign) proper names and named entity components:

  • forenames and surnames,
  • city, country, mountain, region and river names,
  • institution names,
  • relational adjectives and inhabitant names stemming from country names,
  • named entity triggers (months, days, positions, etc.).

The file DOES NOT contain inhabitant names and relational adjectives stemming from Polish settlements. These data, owned by the PWN publisher, were used within the NKJP project under a particular licence and are concerned by the copyright.

The data are available under the 2-clause BSD licence.

Available resources

Text version as used with Sprout for NKJP annotation

  • [[attachment:gazetteer-nkjp-no-pwn-lmf.zip]|LMF-compliant version] in the LMF format defined in the [[attachment:gazetteer-nkjp-lmf-format.zip]|conversion instruction]