<?xml version="1.0" encoding="utf-8"?><!DOCTYPE article  PUBLIC '-//OASIS//DTD DocBook XML V4.4//EN'  'http://www.docbook.org/xml/4.4/docbookx.dtd'><article><articleinfo><title>CURLICAT</title><revhistory><revision><revnumber>10</revnumber><date>2023-04-21 11:50:36</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>9</revnumber><date>2023-04-21 11:50:15</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>8</revnumber><date>2022-10-26 15:40:21</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>7</revnumber><date>2021-08-23 12:42:29</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>6</revnumber><date>2021-08-23 12:41:05</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>5</revnumber><date>2021-08-23 12:40:57</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>4</revnumber><date>2020-10-07 09:42:34</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>3</revnumber><date>2020-10-07 09:42:19</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>2</revnumber><date>2020-03-10 13:46:03</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision><revision><revnumber>1</revnumber><date>2020-03-10 13:42:42</date><authorinitials>MaciejOgrodniczuk</authorinitials></revision></revhistory></articleinfo><section><title>CURLICAT project</title><section><title>Project factsheet</title><informaltable><tgroup cols="2"><colspec colname="col_0"/><colspec colname="col_1"/><tbody><row rowsep="1"><entry colsep="1" rowsep="1"><para> English name:           </para></entry><entry colsep="1" rowsep="1"><para> Curated Multilingual Language Resources for CEF AT </para></entry></row><row rowsep="1"><entry colsep="1" rowsep="1"><para> Polish name:            </para></entry><entry colsep="1" rowsep="1"><para> Wielojęzyczne zasoby językowe na potrzeby tłumaczenia maszynowego </para></entry></row><row rowsep="1"><entry colsep="1" rowsep="1"><para> Project type:           </para></entry><entry colsep="1" rowsep="1"><para> A CEF-TC-2019-1 – Automated Translation grant </para></entry></row><row rowsep="1"><entry colsep="1" rowsep="1"><para> Action number:          </para></entry><entry colsep="1" rowsep="1"><para> 2019-EU-IA-0034 </para></entry></row><row rowsep="1"><entry colsep="1" rowsep="1"><para> Grant agreement number: </para></entry><entry colsep="1" rowsep="1"><para> INEA/CEF/ICT/A2019/1926831 </para></entry></row><row rowsep="1"><entry colsep="1" rowsep="1"><para> Duration:               </para></entry><entry colsep="1" rowsep="1"><para> 1 June 2020 – 31 May 2022, extended to 30 November 2022 </para></entry></row><row rowsep="1"><entry colsep="1" rowsep="1"><para> Principal investigator: </para></entry><entry colsep="1" rowsep="1"><para> Tamás Váradi </para></entry></row><row rowsep="1"><entry colsep="1" rowsep="1"><para> Project website:        </para></entry><entry colsep="1" rowsep="1"><para> <ulink url="https://curlicat.eu/"/> </para></entry></row><row rowsep="1"><entry colsep="1" rowsep="1"><para> Polish participation:   </para></entry><entry colsep="1" rowsep="1"><para> <ulink url="http://zil.ipipan.waw.pl/">Linguistic Engineering Group</ulink>, Institute of Computer Science, Polish Academy of Sciences </para></entry></row><row rowsep="1"><entry colsep="1" rowsep="1"><para> Polish PI:              </para></entry><entry colsep="1" rowsep="1"><para> <ulink url="http://zil.ipipan.waw.pl/MaciejOgrodniczuk">Maciej Ogrodniczuk</ulink> </para></entry></row></tbody></tgroup></informaltable></section><section><title>Project summary</title><para>The aim of the project is to compile curated datasets in seven languages of the consortium (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) in domains of relevance to European Digital Service Infrastructures (DSIs) with a view to enhancing the eTranslation automated translation system. The prime source of data will come from national corpora of the above languages and will cover domains relevant for CEF DSIs, such as eHealth, Europeana or eGovernment. The corpus will contain at least 14 million sentences (estimated to contain 185 million words) from domains including culture, education, health and science. </para><para>The data will be technically and legally cleaned. For legal reasons it will be anonymised through replacement of named entities of the same kind and similar phonological and graphemic structure. Terms from the IATE database will be identified and annotated so that the language models built with the help of these corpora will consist of not only single words but also multi-word expressions. Since an important aspect of today’s neural machine translation technology is the quality of the language model, the envisaged seven language corpora, although monolingual datasets in themselves, will make an impact on the quality of eTranslation through the enhanced language models built with them. </para><para>Sustainability beyond the duration of the Action will be addressed through a series of 7 workshops aimed at representatives from digital publishing industry, textual repositories, digital libraries etc. where we will present our innovative solutions for preserving the IPR while sharing e-texts for designated tasks.These participants we see as key stakeholders in building the future culture of e-text sharing that we expect to help in the development of EU digital economy. </para></section><section><title>Citation</title></section></section></article>