bpng — Poliqarp corpus builder
bpng { -h | --help | -v | --version }
bpng [option...] base-name [xml-root-dir...]
bpng builds a binary corpus in the Poliqarp format from XML sources in the following formats:
IPI PAN variant of the XCES format;
NKJP variant of the TEI format.
-h, --helpDisplay help and exit.
-v, --versionOutput version information and exit.
-c, --continueContinue a partially-successful build or add more files to the existing corpus.
Each document consists of two files: a header file (typically named header.xml) and
a text with morphosyntactic annotations (typically named morph.xml or
ann_morphosyntax.xml). Files of separate documents need to reside in separate
directories.
Gzip-compressed files (with a .gz suffix) will be decompressed on the fly.
Text with morphosyntactic annotations should either:
follow the IPI PAN variant of the XCES format:
or, follow the NKJP variant of the TEI format.
The is used to customize the corpus build
process. The file consists of sections, led by a ‘base-name.bp.conf[’
header and followed by
‘section]’ entries.
Empty lines and lines starting with keyword = setting# are comments.
[locale]locale = locale-nameSpecifies the corpus language and possible other regional preferences. Currently, only the string collation is affected by this setting.
A locale name is typically of the form
where language is an
ISO 639
language code, territory is an
ISO 3166
country code, and codeset is a character set or encoding identifier like
language_territory.codesetISO-8859-1 or UTF-8.
On Unix systems, you can use the ‘locale -a’ command to list all the
available locales.
Windows systems uses a different locale names convention. However, bpng is able to translate the usual locale name forms to the Windows-specific ones.
On Unix systems, the selected locale is required to support the UTF-8 encoding.
It is allowed to omit the ‘.UTF-8’ suffix from the locale name.
This section and this entry is required.
[filenames]
Each entry is in the form
‘’.
file-type = file-namesfile-names is a whitespace separated list of file names.
header = file-names
Specifies to possible file names of header files.
The default is ‘header.xml’.
morphosyntax = file-names
Specifies to possible file names of texts with morphosyntactic annotations.
The default is: ‘ann_morphosyntax.xml morph.xml’.
[xmlns]Setup namespaces for XPath 1.0 expressions.
prefix = uri
Bind prefix to the namespace uri.
By default:
the ‘tei’ prefix is bound to http://www.tei-c.org/ns/1.0;
the ‘nkjp’ prefix is bound to http://www.nkjp.pl/ns/1.0.
the ‘poliqarp’ prefix is bound to
http://poliqarp.sourceforge.net/ns/2009.
Note that it is not possible to declare a default (i.e., a prefix-less) namespace for XPath 1.0.
[meta]
Multiple [meta] sections are allowed. Each one describes a metadata key.
name = nameSpecifies the name of the key.
This entry is required.
type = stringAllows any string value is possible for the key. This is the default.
type = dateSpecifies that dates are possible values for the key.
type = enumSpecifies that the set of possible values for the key is a fixed set of strings.
values = valuesSpecifies the set of possible values for the key.
multiple = true-or-false
Specifies if a document can have more than one value for the key. The default is
false.
required = true-or-false
Specifies if each document is required to have a value for the key. The default is
false.
path = xpath-expression
Specifies where to look up metadata values for the key in
the header file. xpath-expression is an
XPath 1.0 expression.
More that one entry is allowed. For each document, a path is inspected only if no values were found along all previously defined paths.
At least one entry is required.
*.poliqarp.corpus.imagea sequence of segments
*.poliqarp.chunk.imagea sequence of document ranges
*.poliqarp.subchunk.image, *.poliqarp.subchunk.offset, *.poliqarp.subchunk.item.*a dictionary of possible subdocument types (e.g., paragraphs, sentences) and sequences of subdocuments ranges
*.poliqarp.orth.image, *.poliqarp.orth.index.alpha, *.poliqarp.orth.index.atergo, *.poliqarp.orth.offseta dictionary of possible orthographics forms
*.poliqarp.tag.image, *.poliqarp.tag.offseta dictionary of possible morphosyntactic tags
*.poliqarp.base1.image, *.poliqarp.base1.offseta dictionary of possible disambiguated base forms
*.poliqarp.base2.image, *.poliqarp.base2.offseta dictionary of possible ambiguous base forms
*.poliqarp.interp1.image, *.poliqarp.interp1.offseta dictionary of possible disambiguated interpretations
*.poliqarp.interp2.image, *.poliqarp.interp2.offseta dictionary of possible ambiguous interpretations
*.meta.cfg, *.poliqarp.meta-key.image, *.poliqarp.meta-key.offseta dictionary of possible metadata keys
*.poliqarp.meta-value.image, *.poliqarp.meta-value.offset, *.poliqarp.meta.imagea dictionary of possible metadata key-value pairs and a sequence of key-value pairs