What is Poliqarp?
Installation from the sources
2.1. Prerequisities
2.2. Installation
2.3. After installation
Basic usage
3.1. The concordancer
3.2. The corpus builder
3.3. The corpus indexer
For developers
Poliqarp is a universal suite of tools for large corpora processing, consisting of a concordancer and some useful tools. The concordancer supports corpora morphosyntactically tagged with positional tagsets, deals with ambiguities in annotation, has a rich query language, is independent of the tagset and can process corpora of several hundred million words. It has mostly been used within the IPI PAN Corpus project (see http://korpus.pl), but can be used with other corpora.
To compile and run Poliqarp, you will need:
It is also possible to build the suite on Windows, using the MinGW toolchain that can be found at http://www.mingw.org.
The installation is standard. Just extract the sources somewhere and do:
$ ./configure $ make # or gmake, if that's what GNU make is called on your system $ make install # ditto
This will figure out which tools can be compiled, then build and install them in /usr/local. To select the installation directory, pass the --prefix option to configure. See configure --help for details.
Another option to consider for configure is --with-tcl-regex. This makes Poliqarp use the bundled regex library, which always supports UTF-8 regardless of the LC_COLLATE environment setting. In addition, passing the option CFLAGS='-O2 -DNDEBUG' to configure will cause the resulting Poliqarp to run a little bit faster.
The concordancer is split into two programs (client and server) that communicate over local TCP sockets. By default, they use port 4567 for communication, but this can be overridden by creating a server configuration file named $HOME/.poliqarpd/poliqarpd.conf. If this file contains the line:
port = <insert port number here>
then the specified port number will be used for connections. Also, make sure that your firewall allows local connections on the specified port.
Proper documentation for Poliqarp is not yet written. However, there is a description of the query language (see gui/help/english/ in this tarball) that should be enough to get started. Just run it by typing:
$ poliqarp
in a shell, then open a corpus and enter your query in the text field.
This tool (called bp, for build Poliqarp) is useful when you want to build your own corpora searchable with Poliqarp from XCES files. An example corpus in XCES format (the corpus of "Frequency Dictionary of Contemporary Polish") can be found at http://www.korpus.pl.
A source corpus consists of:
For examples of these files, see the corpus of "Frequency Dictionary...".
Please note that:
To build the corpus, type:
$ bp yourcorpus
in your corpus' toplevel directory. This will create a bunch of *.poliqarp.* files, which along with .cfg and .meta.cfg constitute the resulting binary corpus. bp respects the value of the LC_COLLATE environment variable, so, e.g., a typical way of building a Polish corpus from UTF-8 sources is:
$ LC_COLLATE=pl_PL.utf8 bp yourcorpus
The indexer produces corpora that are readily available for use with Poliqarp, but the searching will be inefficient. This utility performs a second (optional) stage of corpus building: it creates inverted indices that will help in speeding up most queries. It is highly recommended to build the indices unless your corpus is (say) 1 million segments or less — the difference will then be negligible.
The indexer by default produces indices that provide an acceptable trade-off between size and speed, so in most cases you should be fine with just doing:
$ bpindexer yourcorpus
but feel free to experiment (as always, the -h option to indexer will show you a list of available options).
Please feel free to play around with the sources, modify them and post patches at http://poliqarp.sf.net/bugs/. There is (as yet) no document providing an introduction to the source code, but in the TODO file you can find several dozen ideas for things that should be done.