Size: 3025
Comment:
|
Size: 3026
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 40: | Line 40: |
* Pęzik P., Mikołajczyk A., Wawrzyński A., Żarnecki F., Nitoń B., Ogrodniczuk M. ''Transferable Keyword Extraction and Generation from Scholarly Documents with Text-to-text Language Models''. [[{{attachment:sdp.pdf}}|Poster]] presented at the [[https://sdproc.org/2022/|Third Workshop on Scholarly Document Processing (SDP 2022)]]. | * Pęzik P., Mikołajczyk A., Wawrzyński A., Żarnecki F., Nitoń B., Ogrodniczuk M. ''Transferable Keyword Extraction and Generation from Scholarly Documents with Text-to-text Language Models''. [[attachment:image.png|Poster]] presented at the [[https://sdproc.org/2022/|Third Workshop on Scholarly Document Processing (SDP 2022)]]. |
The Polish Open Science Metadata Corpus
The Polish Open Science Metadata Corpus (POSMAC) is a subset of the CURLICAT corpus. It contains data acquired from the Library of Science (LoS), a platform providing open access to full texts of articles published in over 900 Polish scientific journals and full texts of selected scientific books together with extensive bibliographic metadata. More than 70 percent of the metadata records included in the resulting corpus contain keywords describing the content of the indexed articles. This makes POSMAC a particularly valuable source of data for training keyword generation models and semantic indexing systems.
Domains
Top 10 scientific domains represented in the POSMAC:
Domains |
Documents |
With keywords |
Engineering and technical sciences |
58 974 |
57 165 |
Social sciences |
58 166 |
41 799 |
Agricultural sciences |
29 811 |
15 492 |
Humanities |
22 755 |
11 497 |
Exact and natural sciences |
13 579 |
9 185 |
Humanities, Social sciences |
12 809 |
7 063 |
Medical and health sciences |
6 030 |
3 913 |
Medical and health sciences, Social sciences |
828 |
571 |
Humanities, Medical and health sciences, Social sciences |
601 |
455 |
Engineering and technical sciences, Humanities |
312 |
312 |
Download
Licence
CC-BY 4.0
Presentation
Keyword Extraction with a Text-to-text Transfer Transformer (presentation in Polish at NLP seminar, December 2021)
Publications
Pęzik P., Mikołajczyk A., Wawrzyński A., Nitoń B., Ogrodniczuk M. Keyword Extraction from Short Texts with a Text-to-text Transfer Transformer (T5). In: Szczerbicki, E. et al. (ed.) ACIIDS 2022 Proceedings. Springer Nature Switzerland AG. (forthcoming, see its arXiv copy)
Pęzik P., Mikołajczyk A., Wawrzyński A., Żarnecki F., Nitoń B., Ogrodniczuk M. Transferable Keyword Extraction and Generation from Scholarly Documents with Text-to-text Language Models.
Poster presented at the Third Workshop on Scholarly Document Processing (SDP 2022).