Locked History Actions

DeepEREntityLibrary

DeepER Entity Library

DeepER Entity Library is a database containing around 900,000 entities, each described by its textual representations in Polish (names) and WordNet synsets. This resource has been originally created for deep entity recognition (DeepER) in Polish Question Answering System RAFAEL by analysing definitions in Polish Wikipedia [1,2]. A simplified version is also available, containing nominal groups instead of synsets.

Download main library: entities.txt.gz

Download simplified library: entitiesD.txt.gz

Created and made available by Piotr Przybyła.

Main library

The library contains 809,786 entities with 1,169,452 names (972,592 unique) and 1,264,918 synsets (31,545 unique). Each of them consists of the following elements (entity #9751, describing Bronisław Komorowski):

  • Main name: Bronisław Komorowski,
  • Other names (aliases): Bronisław Maria Komorowski, Komorowski,
  • Description URL: http://pl.wikipedia.org/wiki/?curid=121267,

  • plWordNet 2.1 [3] synsets:
    • <podsekretarz1, podsekretarz stanu1, wiceminister1> (vice-minister, undersecretary),

    • <wicemarszałek1> (vice-speaker of the Sejm, the Polish parliament),

    • <polityk1> (politician),

    • <wysłannik1, poseł1, posłaniec2, wysłaniec1, posłannik1> (member of a parliament),

    • <marszałek1> (speaker of the Sejm),

    • <historyk1> (historian),

    • <minister1> (minister),

    • <prezydent1, prezydent miasta1> (president of a city, mayor).

Each line of the file corresponds to a single entity and has the following format:

<main_name><tab><article_name><tab><URL><tab><names_number(n)><tab><synsets_number(m)><tab><name_1><tab><name_2>...<tab><name_n><tab><synset_id_1><tab><synset_id_2>...<synset_id_m><tab><synset_repr_1><tab><synset_repr_2>...<synset_repr_m>

where:

  • synset_id corresponds to synset id in plWordNet 2.1,
  • synset_repr is a human-readable representation of a synset.

Simplified library

The simplified version of the library instead of WordNet synsets contains nominal groups, from which they have been extracted. For example, the list for Bronisław Komorowski is the following:

  • wicemarszałek i marszałek Sejmu
  • minister obrony narodowej
  • wiceminister i minister obrony narodowej
  • marszałek Sejmu RP
  • polski polityk
  • poseł
  • prezydent RP
  • historyk

Therefore, each line has the following format:

<main_name><tab><article_name><tab><URL><tab><names_number(n)><tab><groups_number(m)><tab><name_1><tab><name_2>...<tab><name_n><tab><group_1><tab><group_2>...<group_m>

References

[1] Przybyła, P. (2015). Gathering Knowledge for Question Answering Beyond Named Entities. Proceedings of the 20th International Conference on Application of Natural Language to Information Systems (NLDB 2015).

[2] Przybyła, P. (2014). Odpowiadanie na pytania w języku polskim z użyciem głębokiego rozpoznawania nazw. Doctoral thesis, Institute of Computer Science, Polish Academy of Sciences.

[3] Maziarz, M., Piasecki, M., and Szpakowicz, S. (2012). Approaching plWordNet 2.0. Proceedings of the 6th Global Wordnet Conference.