This is the README file of the Polish XML subcorpus of the PARSEME corpus

Agata Savary, 17 November 2017

The official release in the [[http://typo.uni-konstanz.de/parseme/index.php/2-general/184-parseme-shared-task-format-of-the-final-annotation|parseme-tsv]] format, aligned with morphological and syntactic annotations in [[http://universaldependencies.org/format.html|CoNLL-U]] format is available via [[http://hdl.handle.net/11372/LRT-2282|LINDAT/CLARIN]] (see README.md of the Polish subcorpus).

This README describes the '''XML version''' of the Polish corpus. 

The Polish data stem from
 * the [[http://clip.ipipan.waw.pl/NationalCorpusOfPolish|National Corpus of Polish]] - all texts from daily newspapers are included, i.e. those whose identifiers start with 130-2, 130-3 or 130-5; the texts with identifiers starting with 130-5 were merged into bigger files for an easier file management:
    * from 130-5-000000001 to 130-5-000000099 - merged into 130-5-0000000
    ° from 130-5-000000100 to 130-5-000000199 - merged into 130-5-0000001
    ° from 130-5-000000200 to 130-5-000000299 - merged into 130-5-0000002
    * etc.
    ° from 130-5-000001900 to 130-5-000001999 - merged into PL-NKJP-130-5-0000019
    ° from 130-5-000001999 to 130-5-000002000 - merged into PL-NKJP-130-5-0000020
 * the [[http://zil.ipipan.waw.pl/PolishCoreferenceCorpus|Polish Coreference Corpus]] - the 21 "long" texts from this corpus are included, 36,000 tokens, Rzeczpospolita newspaper

VMWEs have been annotated by a single annotator per file. The following [[http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.0/?page=030_Categories_of_VMWEs|categories]] are used: ID, IReflV, LVC, OTH.

All VMWEs annotations were performed by Agata Savary.

The VMWEs annotations are distributed under the terms of the [CC-BY v4](https://creativecommons.org/licenses/by/4.0/) license.

Contact: agata.savary@univ-tours.fr