Corpus of literal readings of Polish verbal multiword expressions
This dataset contains occurrences of verbal multiword expressions (VMWEs) and their literal readings, stemming from the Polish PARSEME corpus. For instance, for the VMWE być w stanie (lit. to be in the state of) 'to be able to' as in:
Więcej nie jestem już w stanie dokonać.
the following is a true literal reading:
Dwóch rannych żołnierzy jest w stanie krytycznym.
while the following is a false literal reading:
Wystarczy tę kwotę przyrównać do płacy minimalnej, aby zrozumieć dlaczego stan czytelnictwa w Polsce jest opłakany.
The VMWE occurrences have been manually annotated. The literal readings have been automatically extracted following several heuristics, and then manually validated as true or false literal readings. The dataset contains:
- 3149 occurrences of VMWEs
- 72 true literal readings
- 344 false literal readings
The dataset allows us to calculate the idiomaticity rate of Polish VMWEs, i.e. the ratio of their occurrences with idiomatic readings to its both idiomatic and literal occurrences in a corpus.
Reference publication:
Agata Savary and Silvio Ricardo Cordeiro (2018) Literal readings of multiword expressions: as scarce as hen's teeth, in the Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories (TLT-16), Prague, Czech Republic.
Authors
License
The data are distributed under the terms of the CC-BY v4 license.
Available resources
a README file
true and false literal readings of VMWEs