bpng — Poliqarp corpus builder
bpng
{ -h
| --help
| -v
| --version
}
bpng
[option
...] base-name
[xml-root-dir
...]
bpng builds a binary corpus in the Poliqarp format from XML sources in the following formats:
IPI PAN variant of the XCES format;
NKJP variant of the TEI format.
-h
, --help
Display help and exit.
-v
, --version
Output version information and exit.
-c
, --continue
Continue a partially-successful build or add more files to the existing corpus.
Each document consists of two files: a header file (typically named header.xml
) and
a text with morphosyntactic annotations (typically named morph.xml
or
ann_morphosyntax.xml
). Files of separate documents need to reside in separate
directories.
Gzip-compressed files (with a .gz
suffix) will be decompressed on the fly.
Text with morphosyntactic annotations should either:
follow the IPI PAN variant of the XCES format:
or, follow the NKJP variant of the TEI format.
The
is used to customize the corpus build
process. The file consists of sections, led by a ‘base-name
.bp.conf[
’
header and followed by
‘section
]
’ entries.
Empty lines and lines starting with keyword
= setting
#
are comments.
[locale]
locale = locale-name
Specifies the corpus language and possible other regional preferences. Currently, only the string collation is affected by this setting.
A locale name is typically of the form
where language is an
ISO 639
language code, territory is an
ISO 3166
country code, and codeset is a character set or encoding identifier like
language
_territory
.codeset
ISO-8859-1
or UTF-8
.
On Unix systems, you can use the ‘locale -a
’ command to list all the
available locales.
Windows systems uses a different locale names convention. However, bpng is able to translate the usual locale name forms to the Windows-specific ones.
On Unix systems, the selected locale is required to support the UTF-8 encoding.
It is allowed to omit the ‘.UTF-8
’ suffix from the locale name.
This section and this entry is required.
[filenames]
Each entry is in the form
‘
’.
file-type
= file-names
file-names
is a whitespace separated list of file names.
header = file-names
Specifies to possible file names of header files.
The default is ‘header.xml
’.
morphosyntax = file-names
Specifies to possible file names of texts with morphosyntactic annotations.
The default is: ‘ann_morphosyntax.xml morph.xml
’.
[xmlns]
Setup namespaces for XPath 1.0 expressions.
prefix
= uri
Bind prefix
to the namespace uri
.
By default:
the ‘tei
’ prefix is bound to http://www.tei-c.org/ns/1.0
;
the ‘nkjp
’ prefix is bound to http://www.nkjp.pl/ns/1.0
.
the ‘poliqarp
’ prefix is bound to
http://poliqarp.sourceforge.net/ns/2009
.
Note that it is not possible to declare a default (i.e., a prefix-less) namespace for XPath 1.0.
[meta]
Multiple [meta]
sections are allowed. Each one describes a metadata key.
name = name
Specifies the name of the key.
This entry is required.
type = string
Allows any string value is possible for the key. This is the default.
type = date
Specifies that dates are possible values for the key.
type = enum
Specifies that the set of possible values for the key is a fixed set of strings.
values = values
Specifies the set of possible values for the key.
multiple = true-or-false
Specifies if a document can have more than one value for the key. The default is
false
.
required = true-or-false
Specifies if each document is required to have a value for the key. The default is
false
.
path = xpath-expression
Specifies where to look up metadata values for the key in
the header file. xpath-expression
is an
XPath 1.0 expression.
More that one entry is allowed. For each document, a path is inspected only if no values were found along all previously defined paths.
At least one entry is required.
*
.poliqarp.corpus.image
a sequence of segments
*
.poliqarp.chunk.image
a sequence of document ranges
*
.poliqarp.subchunk.image
, *
.poliqarp.subchunk.offset
, *
.poliqarp.subchunk.item.*
a dictionary of possible subdocument types (e.g., paragraphs, sentences) and sequences of subdocuments ranges
*
.poliqarp.orth.image
, *
.poliqarp.orth.index.alpha
, *
.poliqarp.orth.index.atergo
, *
.poliqarp.orth.offset
a dictionary of possible orthographics forms
*
.poliqarp.tag.image
, *
.poliqarp.tag.offset
a dictionary of possible morphosyntactic tags
*
.poliqarp.base1.image
, *
.poliqarp.base1.offset
a dictionary of possible disambiguated base forms
*
.poliqarp.base2.image
, *
.poliqarp.base2.offset
a dictionary of possible ambiguous base forms
*
.poliqarp.interp1.image
, *
.poliqarp.interp1.offset
a dictionary of possible disambiguated interpretations
*
.poliqarp.interp2.image
, *
.poliqarp.interp2.offset
a dictionary of possible ambiguous interpretations
*
.meta.cfg
, *
.poliqarp.meta-key.image
, *
.poliqarp.meta-key.offset
a dictionary of possible metadata keys
*
.poliqarp.meta-value.image
, *
.poliqarp.meta-value.offset
, *
.poliqarp.meta.image
a dictionary of possible metadata key-value pairs and a sequence of key-value pairs