Name

bpng — Poliqarp corpus builder

Synopsis

bpng { -h | --help | -v | --version }

bpng [option...] base-name [xml-root-dir...]

Description

bpng builds a binary corpus in the Poliqarp format from XML sources in the following formats:

  • IPI PAN variant of the XCES format;

  • NKJP variant of the TEI format.

Options

-h, --help

Display help and exit.

-v, --version

Output version information and exit.

-c, --continue

Continue a partially-successful build or add more files to the existing corpus.

Source format

Document files

Each document consists of two files: a header file (typically named header.xml) and a text with morphosyntactic annotations (typically named morph.xml or ann_morphosyntax.xml). Files of separate documents need to reside in separate directories. Gzip-compressed files (with a .gz suffix) will be decompressed on the fly.

Morphosyntax

Text with morphosyntactic annotations should either:

Header

No particular header format is required. The way header information is converted to the binary format can be customized in the configuration file.

Configuration file

The base-name.bp.conf is used to customize the corpus build process. The file consists of sections, led by a ‘[section]’ header and followed by ‘keyword = setting’ entries. Empty lines and lines starting with # are comments.

[locale]

locale = locale-name

Specifies the corpus language and possible other regional preferences. Currently, only the string collation is affected by this setting.

A locale name is typically of the form language_territory.codeset where language is an ISO 639 language code, territory is an ISO 3166 country code, and codeset is a character set or encoding identifier like ISO-8859-1 or UTF-8.

On Unix systems, you can use the ‘locale -a’ command to list all the available locales.

Windows systems uses a different locale names convention. However, bpng is able to translate the usual locale name forms to the Windows-specific ones.

On Unix systems, the selected locale is required to support the UTF-8 encoding. It is allowed to omit the ‘.UTF-8’ suffix from the locale name.

This section and this entry is required.

[filenames]

Each entry is in the form ‘file-type = file-names’. file-names is a whitespace separated list of file names.

header = file-names

Specifies to possible file names of header files. The default is ‘header.xml’.

morphosyntax = file-names

Specifies to possible file names of texts with morphosyntactic annotations. The default is: ‘ann_morphosyntax.xml morph.xml’.

[xmlns]

Setup namespaces for XPath 1.0 expressions.

prefix = uri

Bind prefix to the namespace uri.

By default:

  • the ‘tei’ prefix is bound to http://www.tei-c.org/ns/1.0;

  • the ‘nkjp’ prefix is bound to http://www.nkjp.pl/ns/1.0.

  • the ‘poliqarp’ prefix is bound to http://poliqarp.sourceforge.net/ns/2009.

Note that it is not possible to declare a default (i.e., a prefix-less) namespace for XPath 1.0.

[meta]

Multiple [meta] sections are allowed. Each one describes a metadata key.

name = name

Specifies the name of the key.

This entry is required.

type = string

Allows any string value is possible for the key. This is the default.

type = date

Specifies that dates are possible values for the key.

type = enum

Specifies that the set of possible values for the key is a fixed set of strings.

values = values

Specifies the set of possible values for the key.

multiple = true-or-false

Specifies if a document can have more than one value for the key. The default is false.

required = true-or-false

Specifies if each document is required to have a value for the key. The default is false.

path = xpath-expression

Specifies where to look up metadata values for the key in the header file. xpath-expression is an XPath 1.0 expression.

More that one entry is allowed. For each document, a path is inspected only if no values were found along all previously defined paths.

At least one entry is required.

Poliqarp format

Corpus configuration file

*.cfg

Corpus definition file

*.cdf

The only supported binary format version is 2.

Binary files created by bpng

*.poliqarp.corpus.image

a sequence of segments

*.poliqarp.chunk.image

a sequence of document ranges

*.poliqarp.subchunk.image, *.poliqarp.subchunk.offset, *.poliqarp.subchunk.item.*

a dictionary of possible subdocument types (e.g., paragraphs, sentences) and sequences of subdocuments ranges

*.poliqarp.orth.image, *.poliqarp.orth.index.alpha, *.poliqarp.orth.index.atergo, *.poliqarp.orth.offset

a dictionary of possible orthographics forms

*.poliqarp.tag.image, *.poliqarp.tag.offset

a dictionary of possible morphosyntactic tags

*.poliqarp.base1.image, *.poliqarp.base1.offset

a dictionary of possible disambiguated base forms

*.poliqarp.base2.image, *.poliqarp.base2.offset

a dictionary of possible ambiguous base forms

*.poliqarp.interp1.image, *.poliqarp.interp1.offset

a dictionary of possible disambiguated interpretations

*.poliqarp.interp2.image, *.poliqarp.interp2.offset

a dictionary of possible ambiguous interpretations

*.meta.cfg, *.poliqarp.meta-key.image, *.poliqarp.meta-key.offset

a dictionary of possible metadata keys

*.poliqarp.meta-value.image, *.poliqarp.meta-value.offset, *.poliqarp.meta.image

a dictionary of possible metadata key-value pairs and a sequence of key-value pairs

Binary files created by bpindexer

*.poliqarp.rindex.*

See bpindexer(1) for details.

Bugs, limitations, missing features

bzip2 on-the-fly decompression is not supported.

See also

bp(1), the legacy corpus converter; bpindexer(1)