Locked History Actions

Diff for "Gazetteer"

Differences between revisions 2 and 33 (spanning 31 versions)
Revision 2 as of 2011-09-29 00:26:08
Size: 1553
Comment:
Revision 33 as of 2013-07-26 09:31:02
Size: 2932
Editor: AgataSavary
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= The Polish Gazetteer = = Gazetteer for Polish Named Entities =
Line 3: Line 3:
The Polish Gazetteer is the textual source used within the SProUT (http://sprout.dfki.de/) platform for the automatic pre-annotation of the National Corpus of Polish (NKJP) on the level of named entities. Its construction, contents and use have been described in: The Gazetteer for Polish Named Entities was used within the ''[[http://sprout.dfki.de/|SProUT]]'' platform, initially for information extraction from Polish texts, and then for the automatic pre-annotation of the ''[[http://nkjp.pl/index.php?page=0&lang=1 |National Corpus of Polish]]'' (NKJP) on the level of named entities. Its construction, contents and use have been described in:
 
 * SAVARY, A., PISKORSKI, J. (2011). ''Language Resources for Named Entity Annotation in the National Corpus of Polish'', in Control and Cybernetics 40(2), Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland, pp. 361-391.
 * SAVARY, A., PISKORSKI, J. (2010). ''[[http://iis.ipipan.waw.pl/2010/proceedings/iis10-15.pdf|Lexicons and Grammars for Named Entity Annotation in the National Corpus of Polish]]'', in Proceedings of the 18th International Conference Intelligent Information Systems (IIS'10), Siedlce, Poland.
 * PISKORSKI, J. (2005). ''Named-Entity Recognition for Polish with SProUT'', in LNCS Vol 3490: Proceedings of IMTCI 2004, Warsaw, Poland.
Line 5: Line 9:
 * SAVARY, A., PISKORSKI, J. (2011). ''Language Resources for Named Entity Annotation in the National Corpus of Polish'', to appear in Control and Cybernetics.
 * SAVARY, A., PISKORSKI, J. (2010). ''[[http://iis.ipipan.waw.pl/2010/proceedings/iis10-15.pdf|Lexicons and Grammars for Named Entity Annotation in the National Corpus of Polish]]'', in Proceedings of the 18th International Conference Intelligent Information Systems (IIS'10), Siedlce, Poland.

The file contains 153,477 inflected entries of Polish (and some foreign) proper names and named entity components:
 * forenames and surnames
 * city, country, mountain, region and river names
 * institution names
 * named entity triggers (months, days, positions, etc.)
The gazetteer contains 153,477 inflected entries of Polish (and some foreign) proper names and named entity components:
 * forenames and surnames,
 * city, country, mountain, region and river names,
 * institution names,
 * relational adjectives and inhabitant names stemming from country names,
 * named entity triggers (months, days, positions, etc.).
Line 16: Line 18:
== Copyright information == == Authors ==
 * [[http://www.info.univ-tours.fr/~savary/English/indexgb.html|Agata Savary]] - NKJP version version of the gazetteer; LMF format definition
 * [[http://zil.ipipan.waw.pl/MichalLenart|Michał Lenart]] - LMF conversion and validation
 * [[http://zil.ipipan.waw.pl/JakubPiskorski|Jakub Piskorski]] - earlier version of the gazetteer used for information extraction from Polish texts
Line 18: Line 23:
The data is available under [[http://en.wikipedia.org/wiki/BSD_licenses#2-clause_license_.28.22Simplified_BSD_License.22_or_.22FreeBSD_License.22.29|2-clause BSD licence]]. == License ==

The data are available under the [[http://en.wikipedia.org/wiki/BSD_licenses#2-clause_license_.28.22Simplified_BSD_License.22_or_.22FreeBSD_License.22.29|2-clause BSD licence]].
Line 22: Line 29:
[[attachment:gazetteer-nkjp-no-pwn.zip]]  * [[attachment:gazetteer-nkjp-no-pwn.zip|Text version]] as used with Sprout for NKJP pre-annotation

 * [[attachment:polish-ne-gazetteer-LMF-format.pdf|LMF format definition and conversion guidelines]]

 * [[attachment:PNEG-LMF-v1.tar.gz|LMF-compliant version]] containing:
   * LMF format definition and conversion guidelines,
   * Relax NG schema, morphosyntax configuration file and validation scrypts,
   * gramatically complete gazetteer entries (9,060 lemmas and 95,359 word forms),
   * gramatically incomplete gazetteer entries (35,884 lemmas and 40,612 word forms).

Gazetteer for Polish Named Entities

The Gazetteer for Polish Named Entities was used within the SProUT platform, initially for information extraction from Polish texts, and then for the automatic pre-annotation of the National Corpus of Polish (NKJP) on the level of named entities. Its construction, contents and use have been described in:

  • SAVARY, A., PISKORSKI, J. (2011). Language Resources for Named Entity Annotation in the National Corpus of Polish, in Control and Cybernetics 40(2), Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland, pp. 361-391.

  • SAVARY, A., PISKORSKI, J. (2010). Lexicons and Grammars for Named Entity Annotation in the National Corpus of Polish, in Proceedings of the 18th International Conference Intelligent Information Systems (IIS'10), Siedlce, Poland.

  • PISKORSKI, J. (2005). Named-Entity Recognition for Polish with SProUT, in LNCS Vol 3490: Proceedings of IMTCI 2004, Warsaw, Poland.

The gazetteer contains 153,477 inflected entries of Polish (and some foreign) proper names and named entity components:

  • forenames and surnames,
  • city, country, mountain, region and river names,
  • institution names,
  • relational adjectives and inhabitant names stemming from country names,
  • named entity triggers (months, days, positions, etc.).

The file DOES NOT contain inhabitant names and relational adjectives stemming from Polish settlements. These data, owned by the PWN publisher, were used within the NKJP project under a particular licence and are concerned by the copyright.

Authors

  • Agata Savary - NKJP version version of the gazetteer; LMF format definition

  • Michał Lenart - LMF conversion and validation

  • Jakub Piskorski - earlier version of the gazetteer used for information extraction from Polish texts

License

The data are available under the 2-clause BSD licence.

Available resources

  • Text version as used with Sprout for NKJP pre-annotation

  • LMF format definition and conversion guidelines

  • LMF-compliant version containing:

    • LMF format definition and conversion guidelines,
    • Relax NG schema, morphosyntax configuration file and validation scrypts,
    • gramatically complete gazetteer entries (9,060 lemmas and 95,359 word forms),
    • gramatically incomplete gazetteer entries (35,884 lemmas and 40,612 word forms).