Locked History Actions

Diff for "PARSEME-PL"

Differences between revisions 24 and 31 (spanning 7 versions)
Revision 24 as of 2017-11-17 15:05:30
Size: 3835
Editor: AgataSavary
Comment:
Revision 31 as of 2019-03-07 18:42:55
Size: 4543
Editor: AgataSavary
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
The PARSEME corpus is a multilingual corpus annotated manually for verbal multiword expressions (VMWEs) in '''18 languages''' including Polish. If was used in the [[http://multiword.sourceforge.net/sharedtask2017/|PARSEME shared task]] on automatic identification of verbal multiword expressions. It was created due to a collective effort of the IC1207 COST action [[http://www.parseme.eu|PARSEME]]. It contains in total 274,376 sentences, 5,439,204 tokens and 62,218 annotated VMWEs in 18 languages. It is openly available via [[http://hdl.handle.net/11372/LRT-2282|LINDAT/CLARIN]] under various flavours of the Creative Common licence. The PARSEME corpus is a multilingual corpus annotated manually for verbal multiword expressions (VMWEs) in '''20 languages''' including Polish. If was used in the [[http://multiword.sourceforge.net/sharedtask2018/|PARSEME shared task]] on automatic identification of verbal multiword expressions. It was created due to a collective effort of the IC1207 COST action [[http://www.parseme.eu|PARSEME]]. It contains in total 280,838 sentences, 6,072,331 tokens and 79,326 annotated VMWEs in 20 languages. It is openly available via [[http://hdl.handle.net/11372/LRT-2842|LINDAT/CLARIN]] under various flavours of the Creative Common licence.
Line 6: Line 6:
 * Agata Savary, Marie Candito, Verginica Barbu Mititelu, Eduard Bejček, Fabienne Cap, Slavomír Čéplö, Silvio Ricardo Cordeiro, Gülşen Eryiğit, Voula Giouli, Maarten van Gompel, Yaakov HaCohen-Kerner, Jolanta Kovalevskaite, Simon Krek, Chaya Liebeskind, Johanna Monti, Carla Parra Escartín, Lonneke van der Plas, Behrang QasemiZadeh, Carlos Ramisch, Federico Sangati, Ivelina Stoyanova, Veronika Vincze (2018) "PARSEME multilingual corpus of verbal multiword expressions", to appear in a special volume of Pharseology and Multiword Expressions, Language Science Press.
 * SAVARY, A., RAMISCH, C., RICARDO CORDEIRO, S., SANGATI, F., VINCZE, V., QASEMIZADEH, B., CANDITO, M., CAP, F., GIOULI, V., STOYANOVA, I., DOUCET, A. (2017): "[[http://aclweb.org/anthology/W/W17/W17-1704.pdf|The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions]]", in the Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), 4 April 2017, Valencia, Spain. ([[http://aclweb.org/anthology/W/W17/W17-1704.bib|bibtex]])
 * Carlos Ramisch, Silvio Ricardo Cordeiro, Agata Savary, Veronika Vincze, Verginica Barbu Mititelu, Archna Bhatia, Maja Buljan, Marie Candito, Polona Gantar, Voula Giouli, Tunga Güngör, Abdelati Hawwari, Uxoa Iñurrieta, Jolanta Kovalevskaitė, Simon Krek, Timm Lichte, Chaya Liebeskind, Johanna Monti, Carla Parra Escartín, Behrang QasemiZadeh, Renata Ramisch, Nathan Schneider, Ivelina Stoyanova, Ashwini Vaidya, Abigail Walsh (2018) [[https://aclanthology.info/papers/W18-4925/w18-4925|Edition 1.1 of the PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions]], in the Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018), Santa Fe, USA.
 * Agata Savary, Marie Candito, Verginica Barbu Mititelu, Eduard Bejček, Fabienne Cap, Slavomír Čéplö, Silvio Ricardo Cordeiro, Gülşen Eryiğit, Voula Giouli, Maarten van Gompel, Yaakov HaCohen-Kerner, Jolanta Kovalevskaite, Simon Krek, Chaya Liebeskind, Johanna Monti, Carla Parra Escartín, Lonneke van der Plas, Behrang QasemiZadeh, Carlos Ramisch, Federico Sangati, Ivelina Stoyanova, Veronika Vincze (2018) [[http://langsci-press.org/catalog/view/204/1344/1319-1|PARSEME multilingual corpus of verbal multiword expressions]], in Markantonatou, S., Ramisch, C., Savary, A., Vincze, V. (Eds.) [[http://langsci-press.org/catalog/book/204|Multiword expressions at length and in depth: Extended papers from the MWE 2017 workshop]], Language Science Press, Berlin, pp. 87-147.
 * Savary, A., Ramisch, C., Ricardo Cordeiro, S., Sangati, F., Vincze, V., QuasemiZadeh, B., Candito, M., CAP, F., Giouli, V., Stoyanova, I., Doucet, A. (2017): "[[http://aclweb.org/anthology/W/W17/W17-1704.pdf|The PARSEME Shared Task on Automatic Identification of Verbal Multiword Expressions]]", in the Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), 4 April 2017, Valencia, Spain. ([[http://aclweb.org/anthology/W/W17/W17-1704.bib|bibtex]])
Line 9: Line 10:
The Polish subcorpus contains 13,606 sentences, 220,934 tokens and 3,649 VMWE annotations of 3 categories:
 * 383 idioms (IDs) e.g. ''bujać w obłokach'' (lit. to swing in the clouds) 'to fantasise', ''kości zostały rzucone'' (lit dies were cast) ’the die is cast'
 * 1,813 inherently reflexive verbs (IReflVs), e.g. ''śmiać się'' (lit. to laugh self) 'to laugh', ''bać się'' (lit. to fear self) 'to be afraid'
 * 1,453 light verb constructions (LVCs), e.g. ''odnieść sukces'' (lit. to carried back a success) ’to be successful', ''sprawować patronat'' (lit. to performed patronage) ’to dispense patronage'.

The VMWE annotations are aligned with morphological and syntactic annotations in the [[http://universaldependencies.org/format.html|CoNLL-U format]]. The morphological data (lemmas, POS, morphological features)) stem from the original corpora. The syntactic data (dependencies) were generated automatically with [[http://ufal.mff.cuni.cz/udpipe|UDPipe]] usin a pre-trained Polish model.
The Polish subcorpus contains 27,904 sentences, 638,002 tokens and 5,536 VMWE annotations of 3 categories:
 * 503 verbal idioms (VIDs) e.g. ''bujać w obłokach'' (lit. to swing in the clouds) 'to fantasise', ''kości zostały rzucone'' (lit dies were cast) ’the die is cast'
 * 2,279 inherently reflexive verbs (IRVs), e.g. ''śmiać się'' (lit. to laugh self) 'to laugh', ''bać się'' (lit. to fear self) 'to be afraid'
 * 1,833 light verb constructions (LVCs), e.g. ''odnieść sukces'' (lit. to carried back a success) ’to be successful', ''sprawować patronat'' (lit. to performed patronage) ’to dispense patronage'.
 
The VMWE annotations are aligned with morphological and syntactic annotations in the [[http://universaldependencies.org/format.html|CoNLL-U format]]. The morphological data (lemmas, POS, morphological features)) stem from the original corpora. The syntactic data (dependencies) stem partly from the manual annotation in [[http://zil.ipipan.waw.pl/Sk%C5%82adnica|Składnica]] and partly from automatic annotation with [[https://ufal.mff.cuni.cz/udpipe|UDPipe]].
Line 21: Line 21:
The data are distributed under the terms of the [[https://creativecommons.org/licenses/by/4.0/|CC-BY v4]] license.
The lemmas, POS-tags, morphological features, and dependency relations, contained in CONNL-U files, are distributed under the terms of the [[https://www.gnu.org/licenses/gpl.html|GNU GPL v.3]] licence.
The VMWE data are distributed under the terms of the [[https://creativecommons.org/licenses/by/4.0/|CC-BY v4]] license.
The lemmas, POS-tags, morphological features, and dependency relations are distributed under the terms of the [[https://creativecommons.org/licenses/by-sa/4.0/|CC-BY-SA 0.4]] and [[https://www.gnu.org/licenses/gpl.html|GNU GPL v.3]] licences.
Line 25: Line 25:
 * [[http://hdl.handle.net/11372/LRT-2282|PARSEME corpus in 18 languages]] including Polish at LINDAT/CLARIN
 * [[attachment:SEJF-1.1-Slownik.tar.gz|Polish subcorpus]] of PARSEME in the [[http://proycon.github.io/folia/|Folia]] format (XML)
 * PARSEME corpus including Polish at LINDAT/CLARIN
   * [[http://hdl.handle.net/11372/LRT-2842|version 1.1]]
   * [[http://hdl.handle.net/11372/LRT-2282|version 1.0]]
Line 29: Line 30:
Updating and enlarging the corpus in line with the [[http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/|PARSEME guidelines version 1.1]. Corpus release for the [[https://typo.uni-konstanz.de/parseme/index.php/2-general/202-parseme-shared-task-on-automatic-identification-of-verbal-mwes-edition-1-1|PARSEME shared task edition 1.1]].  * Extending the the annotation to other types of MWEs

Polish PARSEME corpus

The PARSEME corpus is a multilingual corpus annotated manually for verbal multiword expressions (VMWEs) in 20 languages including Polish. If was used in the PARSEME shared task on automatic identification of verbal multiword expressions. It was created due to a collective effort of the IC1207 COST action PARSEME. It contains in total 280,838 sentences, 6,072,331 tokens and 79,326 annotated VMWEs in 20 languages. It is openly available via LINDAT/CLARIN under various flavours of the Creative Common licence.

Reference publications:

The Polish subcorpus contains 27,904 sentences, 638,002 tokens and 5,536 VMWE annotations of 3 categories:

  • 503 verbal idioms (VIDs) e.g. bujać w obłokach (lit. to swing in the clouds) 'to fantasise', kości zostały rzucone (lit dies were cast) ’the die is cast'

  • 2,279 inherently reflexive verbs (IRVs), e.g. śmiać się (lit. to laugh self) 'to laugh', bać się (lit. to fear self) 'to be afraid'

  • 1,833 light verb constructions (LVCs), e.g. odnieść sukces (lit. to carried back a success) ’to be successful', sprawować patronat (lit. to performed patronage) ’to dispense patronage'.

The VMWE annotations are aligned with morphological and syntactic annotations in the CoNLL-U format. The morphological data (lemmas, POS, morphological features)) stem from the original corpora. The syntactic data (dependencies) stem partly from the manual annotation in Składnica and partly from automatic annotation with UDPipe.

Author

License

The VMWE data are distributed under the terms of the CC-BY v4 license. The lemmas, POS-tags, morphological features, and dependency relations are distributed under the terms of the CC-BY-SA 0.4 and GNU GPL v.3 licences.

Available resources

Future work

  • Extending the the annotation to other types of MWEs