Locked History Actions

Diff for "POSMAC"

Differences between revisions 1 and 22 (spanning 21 versions)
Revision 1 as of 2021-12-14 20:11:00
Size: 930
Comment:
Revision 22 as of 2023-04-21 11:50:51
Size: 2402
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
## page was renamed from POSMC
Line 3: Line 4:
The Polish Open Science Metadata Corpus (POSMC) is a collection of 216,214 abstracts of scientific publications compiled in the [[CURLICAT]] project. The Polish subset of CURLICAT contains data acquired from the [[https://bibliotekanauki.pl/|Library of Science]], a platform providing open access to full texts of articles published in over 900 Polish scientific journals and full texts of selected scientific books together with extensive bibliographic metadata. The Polish Open Science Metadata Corpus (POSMAC) is a subset of the [[CURLICAT]] corpus. It contains data acquired from the [[https://bibliotekanauki.pl/|Library of Science]] (LoS), a platform providing open access to full texts of articles published in over 900 Polish scientific journals and full texts of selected scientific books together with extensive bibliographic metadata. More than 70 percent of the metadata records included in the resulting corpus contain keywords describing the content of the indexed articles. This makes POSMAC a particularly valuable source of data for training keyword generation models and semantic indexing systems.

== Domains ==

Top 10 scientific domains represented in the POSMAC:

|| '''Domains''' || '''Documents''' || '''With keywords''' ||
|| Engineering and technical sciences ||<)> 58 974 ||<)> 57 165 ||
|| Social sciences ||<)> 58 166 ||<)> 41 799 ||
|| Agricultural sciences ||<)> 29 811 ||<)> 15 492 ||
|| Humanities ||<)> 22 755 ||<)> 11 497 ||
|| Exact and natural sciences ||<)> 13 579 ||<)> 9 185 ||
|| Humanities, Social sciences ||<)> 12 809 ||<)> 7 063 ||
|| Medical and health sciences ||<)> 6 030 ||<)> 3 913 ||
|| Medical and health sciences, Social sciences ||<)> 828 ||<)> 571 ||
|| Humanities, Medical and health sciences, Social sciences ||<)> 601 ||<)> 455 ||
|| Engineering and technical sciences, Humanities ||<)> 312 ||<)> 312 ||
Line 7: Line 25:
 * [[http://curlicat.nlp.ipipan.waw.pl/download/latest/pl-raw.zip|Raw data]] {{{#!wiki comment
* [[http://curlicat.nlp.ipipan.waw.pl/download/latest/pl-raw.zip|Raw data]]
}}}
Line 14: Line 34:
== Publications == == Presentation ==
Line 16: Line 36:
Pęzik P., Mikołajczyk A., Wawrzyński A., Nitoń B., Ogrodniczuk M. ''Keyword Extraction from Short Texts with a Text-to-text Transfer Transformer (T5)'' (in preparation). [[http://zil.ipipan.waw.pl/seminarium-archiwum?action=AttachFile&do=view&target=2021-12-20.pdf|Keyword Extraction with a Text-to-text Transfer Transformer]] (presentation in Polish at NLP seminar, December 2021)

== Citation ==

<<BibMate(key, "pez:etal:22:aciids", omitYears=true)>>

The Polish Open Science Metadata Corpus

The Polish Open Science Metadata Corpus (POSMAC) is a subset of the CURLICAT corpus. It contains data acquired from the Library of Science (LoS), a platform providing open access to full texts of articles published in over 900 Polish scientific journals and full texts of selected scientific books together with extensive bibliographic metadata. More than 70 percent of the metadata records included in the resulting corpus contain keywords describing the content of the indexed articles. This makes POSMAC a particularly valuable source of data for training keyword generation models and semantic indexing systems.

Domains

Top 10 scientific domains represented in the POSMAC:

Domains

Documents

With keywords

Engineering and technical sciences

58 974

57 165

Social sciences

58 166

41 799

Agricultural sciences

29 811

15 492

Humanities

22 755

11 497

Exact and natural sciences

13 579

9 185

Humanities, Social sciences

12 809

7 063

Medical and health sciences

6 030

3 913

Medical and health sciences, Social sciences

828

571

Humanities, Medical and health sciences, Social sciences

601

455

Engineering and technical sciences, Humanities

312

312

Download

Licence

CC-BY 4.0

Presentation

Keyword Extraction with a Text-to-text Transfer Transformer (presentation in Polish at NLP seminar, December 2021)

Citation

List of publications

Piotr Pęzik, Agnieszka Mikołajczyk, Adam Wawrzyński, Bartłomiej Nitoń, and Maciej Ogrodniczuk. Keyword extraction from short texts with a text-to-text Transfer Transformer. In Edward Szczerbicki, Krystian Wojtkiewicz, Sinh Van Nguyen, Marcin Pietranik, and Marek Krótkiewicz, editors, ACIIDS 2022: Recent Challenges in Intelligent Information and Database Systems, number 1716 in Communications in Computer and Information Science (CCIS), pages 530–542. Springer Nature Singapore, 2022.