You are not allowed to use this action.

Clear message
Locked History Actions

CURLICAT

CURLICAT project

Project factsheet

English name:

Curated Multilingual Language Resources for CEF AT

Polish name:

Wielojęzyczne zasoby językowe na potrzeby tłumaczenia maszynowego

Project type:

A CEF-TC-2019-1 – Automated Translation grant

Action number:

2019-EU-IA-0034

Grant agreement number:

INEA/CEF/ICT/A2019/1926831

Duration:

1 June 2020 – 31 May 2022, extended to 30 November 2022

Principal investigator:

Tamás Váradi

Project website:

https://curlicat.eu/

Polish participation:

Linguistic Engineering Group, Institute of Computer Science, Polish Academy of Sciences

Polish PI:

Maciej Ogrodniczuk

Project summary

The aim of the project is to compile curated datasets in seven languages of the consortium (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) in domains of relevance to European Digital Service Infrastructures (DSIs) with a view to enhancing the eTranslation automated translation system. The prime source of data will come from national corpora of the above languages and will cover domains relevant for CEF DSIs, such as eHealth, Europeana or eGovernment. The corpus will contain at least 14 million sentences (estimated to contain 185 million words) from domains including culture, education, health and science.

The data will be technically and legally cleaned. For legal reasons it will be anonymised through replacement of named entities of the same kind and similar phonological and graphemic structure. Terms from the IATE database will be identified and annotated so that the language models built with the help of these corpora will consist of not only single words but also multi-word expressions. Since an important aspect of today’s neural machine translation technology is the quality of the language model, the envisaged seven language corpora, although monolingual datasets in themselves, will make an impact on the quality of eTranslation through the enhanced language models built with them.

Sustainability beyond the duration of the Action will be addressed through a series of 7 workshops aimed at representatives from digital publishing industry, textual repositories, digital libraries etc. where we will present our innovative solutions for preserving the IPR while sharing e-texts for designated tasks.These participants we see as key stakeholders in building the future culture of e-text sharing that we expect to help in the development of EU digital economy.

Citation

List of publications

Tamás Váradi, Bence Nyéki, Svetla Koeva, Marko Tadić, Vanja Štefanec, Maciej Ogrodniczuk, Bartłomiej Nitoń, Piotr Pęzik, Verginica Barbu Mititelu, Elena Irimia, Maria Mitrofan, Vasile Păiș, Dan Tufiș, Radovan Garabík, Simon Krek, and Andraž Repar. Introducing the CURLICAT corpora: Seven-language domain specific annotated corpora from curated sources. In Proceedings of the Language Resources and Evaluation Conference, pages 100–108, Marseille, France, 2022. European Language Resources Association.