Revision 1 as of 2020-03-10 13:42:42

Clear message
Locked History Actions


CURLICAT project

Project factsheet

English name:

Curated Multilingual Language Resources for CEF AT

Polish name:

Wielojęzyczne zasoby językowe na potrzeby tłumaczenia maszynowego

Project type:

A CEF-TC-2019-1 – Automated Translation grant

Action number:


Grant agreement number:



1 June 2020 – 31 May 2022

Principal investigator:

Tamás Váradi

Polish participation:

Linguistic Engineering Group, Institute of Computer Science, Polish Academy of Sciences

Polish PI:

Maciej Ogrodniczuk

Project summary

The present proposal focuses on the body of national legislation (laws, decrees, regulations) in the seven countries making up the consortium: Bulgaria, Croatia, Hungary, Poland, Romania, Slovakia and Slovenia. We start from the premise that at present national legislation texts are not automatically available to CEF.AT and present MT systems could be improved if they had access to national legislative texts. Furthermore, technological advances in MT technology emphasizes the need for clean, validated domain-specific datasets. Groundbreaking MT research results reported at this year’s WMT Competition employed back-translated training data, which underscores the relevance and importance of domain-specific monolingual corpora for MT technology.

The envisioned project aims to process two resources available in all seven languages concerned i.e. the multilingual ontology-based thesaurus EUROVOC on the one hand and the corpora of all national legislation in the respective languages on the other. As a result, the project will produce the following deliverables: 1) Seven large-scale suitably preprocessed (tokenized and morphologically tagged) monolingual corpora of national legislation documents classified into EUROVOC topics/descriptors and enriched with EUROVOC and IATE terms identified. 2) Comparable corpus of seven languages aligned at the topic level domains identified by EUROVOC descriptors. 3) Croatian English parallel corpus consisting of ca. 1800 legislative documents.

The envisaged deliverables are all directly useful for CEF.AT as training material for MT systems. As the partners of the consortium represent less-resourced languages, such required resources are in short supply for these languages. In addition to the expected overall improvement of MT systems in the seven languages concerned, the project will have an impact both on the e-justice and the Online Dispute Resolution Digital Service Infrastructures as the resources focus on national legislation, which is of direct relevance to both DSIs. Sustainability of the results will be achieved by creation of a workflow that will automatically monitor the changes in the national legislation and will update the respective national corpora.