Locked History Actions

Diff for "PARSEME-PL"

Differences between revisions 30 and 31
Revision 30 as of 2019-03-07 18:40:18
Size: 4671
Editor: AgataSavary
Revision 31 as of 2019-03-07 18:42:55
Size: 4543
Editor: AgataSavary
Deletions are marked like this. Additions are marked like this.
Line 29: Line 29:
 * [[attachment:SEJF-1.1-Slownik.tar.gz|Polish subcorpus]] of PARSEME in the [[http://proycon.github.io/folia/|Folia]] format (XML)
Line 32: Line 30:
 * Extending the corpus for other types of MWEs  * Extending the the annotation to other types of MWEs

Polish PARSEME corpus

The PARSEME corpus is a multilingual corpus annotated manually for verbal multiword expressions (VMWEs) in 20 languages including Polish. If was used in the PARSEME shared task on automatic identification of verbal multiword expressions. It was created due to a collective effort of the IC1207 COST action PARSEME. It contains in total 280,838 sentences, 6,072,331 tokens and 79,326 annotated VMWEs in 20 languages. It is openly available via LINDAT/CLARIN under various flavours of the Creative Common licence.

Reference publications:

The Polish subcorpus contains 27,904 sentences, 638,002 tokens and 5,536 VMWE annotations of 3 categories:

  • 503 verbal idioms (VIDs) e.g. bujać w obłokach (lit. to swing in the clouds) 'to fantasise', kości zostały rzucone (lit dies were cast) ’the die is cast'

  • 2,279 inherently reflexive verbs (IRVs), e.g. śmiać się (lit. to laugh self) 'to laugh', bać się (lit. to fear self) 'to be afraid'

  • 1,833 light verb constructions (LVCs), e.g. odnieść sukces (lit. to carried back a success) ’to be successful', sprawować patronat (lit. to performed patronage) ’to dispense patronage'.

The VMWE annotations are aligned with morphological and syntactic annotations in the CoNLL-U format. The morphological data (lemmas, POS, morphological features)) stem from the original corpora. The syntactic data (dependencies) stem partly from the manual annotation in Składnica and partly from automatic annotation with UDPipe.



The VMWE data are distributed under the terms of the CC-BY v4 license. The lemmas, POS-tags, morphological features, and dependency relations are distributed under the terms of the CC-BY-SA 0.4 and GNU GPL v.3 licences.

Available resources

Future work

  • Extending the the annotation to other types of MWEs