Locked History Actions

KORBA

KORBA project

Project factsheet

English name:

Electronic corpus of 17th and 18th century Polish texts

Polish name:

Elektroniczny korpus tekstów polskich z XVII i XVIII w. (do roku 1772)

Project type:

A Ministry of Science and Higher Education National Programme for the Development of Humanities grant 0036/NPRH2/H11/81/2012

Duration:

1 May 2013 ‒ 30 April 2018

Project Web page:

http://wiki.nlp.ipipan.waw.pl/korba (authorization required)

Principal investigator:

Włodzimierz Gruszczyński

Partners

Project description

The aim of the project is the creation of a corpus of 17th and 18th century Polish texts (up to 1772) and tools for its processing (searching, filtering, summarizing statistical data, etc.). The entire corpus will feature annotation for text structure and language (all foreign elements, e.g. Latin intrusions, will be distinguished), and a portion of it will also feature morphological annotation. Since the corpus will mark another stage of development of the Polish National Corpus (Narodowy Korpus Języka Polskiego, NKJP, see: http://nkjp.pl/), we intend to cooperate with the institution and the people behind NKJP, i.e. the Linguistic Engineering Group of the Institute of Computer Science, Polish Academy of Sciences.

The existing corpus of contemporary (20th century) Polish texts has been created by the consortium consisting of: the Institute of Computer Science, Polish Academy of Sciences (project coordinator); the Institute of Polish Studies, Polish Academy of Sciences; Polish Scientific Publishers PWN; the Department of Computational and Corpus Linguistics of the University of Łódź. Another corpus of old (pre-1500) Polish texts exists, but remains unannotated and is not equipped with a search interface. The creation of a 17th and 18th century Polish corpus constitutes an important step towards extending the scale of the National Corpus of Polish to written texts from all eras. The project is crucial for researching the history of the Polish language - in particular, it accelerates ongoing work on the dictionary of 17th and early 18th century Polish, but will no doubt come in useful for other forms of historical linguistic studies (e.g. evolution of Polish grammar, regional and social variety of historical Polish), as well as literary studies and editorial work. An additional goal is the to initiate a series of publications of a number of (non-literary) texts from a selected period.

Since existing tools designed for the corpus of contemporary Polish will have to be adapted to the historical corpus, the project also contributes to the field of linguistic engineering in Poland as a whole. The corpus will be:

  1. open for use for a variety of purposes;
  2. considerably large (aiming for around 12 million tokens);
  3. structurally annotated, i.e. every concordance will be provided with source information, including page number and in-text status (marginalia, notes, errata, etc.);
  4. lexically annotated in its entirety (the process is expected to be partially automated - individual tokens will be automatically associated with appropriate lexemes);
  5. morphosyntactically annotated to a certain extent (initially aiming for 0.5 million text segments and expanding coverage over time);
  6. freely available online (open access), as in the case of NKJP;
  7. equipped with tools for finding certain linguistic elements or establishing their frequency in text, allowing for searches restricted to a given time period, author, publisher, geographic area, prominence of foreign quotations in text, etc.