Locked History Actions

Diff for "CURLICAT"

Differences between revisions 1 and 10 (spanning 9 versions)
Revision 1 as of 2020-03-10 13:42:42
Size: 3043
Comment:
Revision 10 as of 2023-04-21 11:50:36
Size: 2827
Comment:
Deletions are marked like this. Additions are marked like this.
Line 11: Line 11:
|| Duration: || 1 June 2020 – 31 May 2022 || || Duration: || 1 June 2020 – 31 May 2022, extended to 30 November 2022 ||
Line 13: Line 13:
|| Project website: || [[https://curlicat.eu/]] ||
Line 14: Line 15:
|| Polish PI: || Maciej Ogrodniczuk || || Polish PI: || [[http://zil.ipipan.waw.pl/MaciejOgrodniczuk|Maciej Ogrodniczuk]] ||
Line 18: Line 19:
The present proposal focuses on the body of national legislation (laws, decrees, regulations) in the seven countries making up the consortium: Bulgaria, Croatia, Hungary, Poland, Romania, Slovakia and Slovenia. We start from the premise that at present national legislation texts are not automatically available to CEF.AT and present MT systems could be improved if they had access to national legislative texts. Furthermore, technological advances in MT technology emphasizes the need for clean, validated domain-specific datasets. Groundbreaking MT research results reported at this year’s WMT Competition employed back-translated training data, which underscores the relevance and importance of domain-specific monolingual corpora for MT technology. The aim of the project is to compile curated datasets in seven languages of the consortium (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) in domains of relevance to European Digital Service Infrastructures (DSIs) with a view to enhancing the eTranslation automated translation system. The prime source of data will come from national corpora of the above languages and will cover domains relevant for CEF DSIs, such as eHealth, Europeana or eGovernment. The corpus will contain at least 14 million sentences (estimated to contain 185 million words) from domains including culture,
education, health and science.
Line 20: Line 22:
The envisioned project aims to process two resources available in all seven languages concerned i.e. the multilingual ontology-based thesaurus EUROVOC on the one hand and the corpora of all national legislation in the respective languages on the other. As a result, the project will produce the following deliverables:
1) Seven large-scale suitably preprocessed (tokenized and morphologically tagged) monolingual corpora of national legislation documents classified into EUROVOC topics/descriptors and enriched with EUROVOC and IATE terms identified.
2) Comparable corpus of seven languages aligned at the topic level domains identified by EUROVOC descriptors.
3) Croatian English parallel corpus consisting of ca. 1800 legislative documents.
The data will be technically and legally cleaned. For legal reasons it will be anonymised through replacement of named entities of the same kind and similar phonological and graphemic structure. Terms from the IATE database will be identified and annotated so that the language models built with the help of these corpora will consist of not only single words but also multi-word expressions. Since an important aspect of today’s neural machine translation technology is the quality of the language model, the envisaged seven language corpora, although monolingual datasets in themselves, will make an impact on the quality of eTranslation through the enhanced language models built with them.
Line 25: Line 24:
The envisaged deliverables are all directly useful for CEF.AT as training material for MT systems. As the partners of the consortium represent less-resourced languages, such required resources are in short supply for these languages.
In addition to the expected overall improvement of MT systems in the seven languages concerned, the project will have an impact both on the e-justice and the Online Dispute Resolution Digital Service Infrastructures as the resources focus on national legislation, which is of direct relevance to both DSIs. Sustainability of the results will be achieved by creation of a workflow that will automatically monitor the changes in the national legislation and will update the respective national corpora.
Sustainability beyond the duration of the Action will be addressed through a series of 7 workshops aimed at representatives from digital publishing industry, textual repositories, digital libraries etc. where we will present our innovative solutions for preserving the IPR while sharing e-texts for designated tasks.These participants we see as key stakeholders in building the future culture of e-text sharing that we expect to help in the development of EU digital economy.

== Citation ==

<<BibMate(key, "var:etal:22:lrec", omitYears=true)>>

CURLICAT project

Project factsheet

English name:

Curated Multilingual Language Resources for CEF AT

Polish name:

Wielojęzyczne zasoby językowe na potrzeby tłumaczenia maszynowego

Project type:

A CEF-TC-2019-1 – Automated Translation grant

Action number:

2019-EU-IA-0034

Grant agreement number:

INEA/CEF/ICT/A2019/1926831

Duration:

1 June 2020 – 31 May 2022, extended to 30 November 2022

Principal investigator:

Tamás Váradi

Project website:

https://curlicat.eu/

Polish participation:

Linguistic Engineering Group, Institute of Computer Science, Polish Academy of Sciences

Polish PI:

Maciej Ogrodniczuk

Project summary

The aim of the project is to compile curated datasets in seven languages of the consortium (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) in domains of relevance to European Digital Service Infrastructures (DSIs) with a view to enhancing the eTranslation automated translation system. The prime source of data will come from national corpora of the above languages and will cover domains relevant for CEF DSIs, such as eHealth, Europeana or eGovernment. The corpus will contain at least 14 million sentences (estimated to contain 185 million words) from domains including culture, education, health and science.

The data will be technically and legally cleaned. For legal reasons it will be anonymised through replacement of named entities of the same kind and similar phonological and graphemic structure. Terms from the IATE database will be identified and annotated so that the language models built with the help of these corpora will consist of not only single words but also multi-word expressions. Since an important aspect of today’s neural machine translation technology is the quality of the language model, the envisaged seven language corpora, although monolingual datasets in themselves, will make an impact on the quality of eTranslation through the enhanced language models built with them.

Sustainability beyond the duration of the Action will be addressed through a series of 7 workshops aimed at representatives from digital publishing industry, textual repositories, digital libraries etc. where we will present our innovative solutions for preserving the IPR while sharing e-texts for designated tasks.These participants we see as key stakeholders in building the future culture of e-text sharing that we expect to help in the development of EU digital economy.

Citation

List of publications

Tamás Váradi, Bence Nyéki, Svetla Koeva, Marko Tadić, Vanja Štefanec, Maciej Ogrodniczuk, Bartłomiej Nitoń, Piotr Pęzik, Verginica Barbu Mititelu, Elena Irimia, Maria Mitrofan, Vasile Păiș, Dan Tufiș, Radovan Garabík, Simon Krek, and Andraž Repar. Introducing the CURLICAT corpora: Seven-language domain specific annotated corpora from curated sources. In Proceedings of the Language Resources and Evaluation Conference, pages 100–108, Marseille, France, 2022. European Language Resources Association.