CURLICAT

CURLICAT102023-04-21 11:50:36MaciejOgrodniczuk92023-04-21 11:50:15MaciejOgrodniczuk82022-10-26 15:40:21MaciejOgrodniczuk72021-08-23 12:42:29MaciejOgrodniczuk62021-08-23 12:41:05MaciejOgrodniczuk52021-08-23 12:40:57MaciejOgrodniczuk42020-10-07 09:42:34MaciejOgrodniczuk32020-10-07 09:42:19MaciejOgrodniczuk22020-03-10 13:46:03MaciejOgrodniczuk12020-03-10 13:42:42MaciejOgrodniczuk

CURLICAT project

Project factsheet English name: Curated Multilingual Language Resources for CEF AT Polish name: Wielojęzyczne zasoby językowe na potrzeby tłumaczenia maszynowego Project type: A CEF-TC-2019-1 – Automated Translation grant Action number: 2019-EU-IA-0034 Grant agreement number: INEA/CEF/ICT/A2019/1926831 Duration: 1 June 2020 – 31 May 2022, extended to 30 November 2022 Principal investigator: Tamás Váradi Project website: Polish participation: Linguistic Engineering Group, Institute of Computer Science, Polish Academy of Sciences Polish PI: Maciej Ogrodniczuk

Project summaryThe aim of the project is to compile curated datasets in seven languages of the consortium (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) in domains of relevance to European Digital Service Infrastructures (DSIs) with a view to enhancing the eTranslation automated translation system. The prime source of data will come from national corpora of the above languages and will cover domains relevant for CEF DSIs, such as eHealth, Europeana or eGovernment. The corpus will contain at least 14 million sentences (estimated to contain 185 million words) from domains including culture, education, health and science. The data will be technically and legally cleaned. For legal reasons it will be anonymised through replacement of named entities of the same kind and similar phonological and graphemic structure. Terms from the IATE database will be identified and annotated so that the language models built with the help of these corpora will consist of not only single words but also multi-word expressions. Since an important aspect of today’s neural machine translation technology is the quality of the language model, the envisaged seven language corpora, although monolingual datasets in themselves, will make an impact on the quality of eTranslation through the enhanced language models built with them. Sustainability beyond the duration of the Action will be addressed through a series of 7 workshops aimed at representatives from digital publishing industry, textual repositories, digital libraries etc. where we will present our innovative solutions for preserving the IPR while sharing e-texts for designated tasks.These participants we see as key stakeholders in building the future culture of e-text sharing that we expect to help in the development of EU digital economy.

Citation