Locked History Actions

MweLitRead

Corpus of literal readings of Polish verbal multiword expressions

This dataset contains occurrences of verbal multiword expressions (VMWEs) and their literal readings, stemming from the Polish PARSEME corpus. For instance, for the VMWE być w stanie (lit. to be in the state of) 'to be able to' as in:

  • Więcej nie jestem już w stanie dokonać.

the following is a true literal reading:

  • Dwóch rannych żołnierzy jest w stanie krytycznym.

while the following is a false literal reading:

  • Wystarczy tę kwotę przyrównać do płacy minimalnej, aby zrozumieć dlaczego stan czytelnictwa w Polsce jest opłakany.

The VMWE occurrences have been manually annotated. The literal readings have been automatically extracted following several heuristics, and then manually validated as true or false literal readings. The dataset contains:

  • 3149 occurrences of VMWEs
  • 72 true literal readings
  • 344 false literal readings

The dataset allows us to calculate the idiomaticity rate of Polish VMWEs, i.e. the ratio of their occurrences with idiomatic readings to its both idiomatic and literal occurrences in a corpus.

Reference publication:

  • Agata Savary and Silvio Ricardo Cordeiro (2018) Literal readings of multiword expressions: as scarce as hen's teeth, in the Proceedings of the 16th International Workshop on Treebanks and Linguistic Theories (TLT-16), Prague, Czech Republic.

Authors

License

The data are distributed under the terms of the CC-BY v4 license.

Available resources