CURLICAT project

Project factsheet

English name:	Curated Multilingual Language Resources for CEF AT
Polish name:	Wielojęzyczne zasoby językowe na potrzeby tłumaczenia maszynowego
Project type:	A CEF-TC-2019-1 – Automated Translation grant
Action number:	2019-EU-IA-0034
Grant agreement number:	INEA/CEF/ICT/A2019/1926831
Duration:	1 June 2020 – 31 May 2022, extended to 30 November 2022
Principal investigator:	Tamás Váradi
Project website:	https://curlicat.eu/
Polish participation:	Linguistic Engineering Group, Institute of Computer Science, Polish Academy of Sciences
Polish PI:	Maciej Ogrodniczuk

Project summary

The aim of the project is to compile curated datasets in seven languages of the consortium (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) in domains of relevance to European Digital Service Infrastructures (DSIs) with a view to enhancing the eTranslation automated translation system. The prime source of data will come from national corpora of the above languages and will cover domains relevant for CEF DSIs, such as eHealth, Europeana or eGovernment. The corpus will contain at least 14 million sentences (estimated to contain 185 million words) from domains including culture, education, health and science.

The data will be technically and legally cleaned. For legal reasons it will be anonymised through replacement of named entities of the same kind and similar phonological and graphemic structure. Terms from the IATE database will be identified and annotated so that the language models built with the help of these corpora will consist of not only single words but also multi-word expressions. Since an important aspect of today’s neural machine translation technology is the quality of the language model, the envisaged seven language corpora, although monolingual datasets in themselves, will make an impact on the quality of eTranslation through the enhanced language models built with them.

Sustainability beyond the duration of the Action will be addressed through a series of 7 workshops aimed at representatives from digital publishing industry, textual repositories, digital libraries etc. where we will present our innovative solutions for preserving the IPR while sharing e-texts for designated tasks.These participants we see as key stakeholders in building the future culture of e-text sharing that we expect to help in the development of EU digital economy.

Citation

List of publications

Tamás Váradi, Bence Nyéki, Svetla Koeva, Marko Tadić, Vanja Štefanec, Maciej Ogrodniczuk, Bartłomiej Nitoń, Piotr Pęzik, Verginica Barbu Mititelu, Elena Irimia, Maria Mitrofan, Vasile Păiș, Dan Tufiș, Radovan Garabík, Simon Krek, and Andraž Repar. Introducing the CURLICAT corpora: Seven-language domain specific annotated corpora from curated sources. In Proceedings of the Language Resources and Evaluation Conference, pages 100–108, Marseille, France, 2022. European Language Resources Association.

CURLICAT

Menu

Wiki

CURLICAT project

Project factsheet

Project summary

Citation