Locked History Actions


MARCELL project

Project factsheet

English name:

Multilingual Resources for CEF.AT in the legal domain

Polish name:

Zasoby wielojęzyczne dla CEF.AT w domenie prawnej

Project type:

A CEF-TC-2017-3 – eTranslation grant

Grant agreement number:



1 October 2018 – 30 September 2020 (extended to 31 March 2021)

Principal investigator:

Tamás Váradi

Polish participation:

Linguistic Engineering Group, Institute of Computer Science, Polish Academy of Sciences

Polish PI:

Maciej Ogrodniczuk



Project summary

The project focuses on the body of national legislation (laws, decrees, regulations) in the seven countries making up the consortium: Bulgaria, Croatia, Hungary, Poland, Romania, Slovakia and Slovenia. We start from the premise that at present national legislation texts are not automatically available to CEF.AT and present MT systems could be improved if they had access to national legislative texts. Furthermore, technological advances in MT technology emphasizes the need for clean, validated domain-specific datasets. Groundbreaking MT research results reported at this year’s WMT Competition employed back-translated training data, which underscores the relevance and importance of domain-specific monolingual corpora for MT technology.

The project aims to process two resources available in all seven languages concerned i.e. the multilingual ontology-based thesaurus EUROVOC on the one hand and the corpora of all national legislation in the respective languages on the other. As a result, the project aims to produce the following deliverables: 1) Seven large-scale suitably preprocessed (tokenized and morphologically tagged) monolingual corpora of national legislation documents classified into EUROVOC topics/descriptors and enriched with EUROVOC and IATE terms identified. 2) Comparable corpus of seven languages aligned at the topic level domains identified by EUROVOC descriptors. 3) Croatian English parallel corpus consisting of ca. 1800 legislative documents.

The envisaged deliverables are all directly useful for CEF.AT as training material for MT systems. As the partners of the consortium represent less-resourced languages, such required resources are in short supply for these languages. In addition to the expected overall improvement of MT systems in the seven languages concerned, the project will have an impact both on the e-justice and the Online Dispute Resolution Digital Service Infrastructures as the resources focus on national legislation, which is of direct relevance to both DSIs. Sustainability of the results will be achieved by creation of a workflow that will automatically monitor the changes in the national legislation and will update the respective national corpora.


MARCELL legislative subcorpus