Spejd 0.8.4 Copyright (C) IPI PAN, 2007-2010. All rights reserved. Available under the terms of the GNU General Public License; see the file doc/gpl.txt for details. ABOUT Spejd is a shallow parser, which allows for simultaneous syntactic parsing and morphological disambiguation, developed at the Institute of Computer Science, Polish Academy od Sciences, Warsaw. Spejd homepage: http://nlp.ipipan.waw.pl/Spejd/ Last releases: 0.8.4: bugfix release 0.8.3: bugfix release 0.8.2: bugfix release 0.8.1: Compared to the previous release, major changes in this version include: - Integrated plain text mode processing module based on morphological analyzer Morfologik (http://morfologik.blogspot.com/). This module requires appropriately encoded input, as defined by inputEncoding config parameter. Plain text module is enabled by inputType parameter (auto or txt). - Parallel processing (benefits are immediate on multicore CPUs). The number of processing threads are defined by maxThreads parameter. - A simple spelling correction module, addressing lacks of Polish diactrics. Possible transformations are listed in ogonkifier.ini. - Changes listed in doc/changes0_5.txt. REQUIREMENTS Sun Java Runtime Environment version 1.5 or higher. Notice: it may be possible to run the program on alternative Java implementation, but because of differences in regular expression implementations, we can not guarantee its behaviour. INSTALLATION Unzip the file spade.zip. Installation finished! SYNOPSIS java -jar spejd.jar path [options] where: - path - a single file or a folder with XML CES (see doc/xcesIPIAna.dtd) or plain text files (.txt, encoding defined by inputEncoding parameter) to parse; the parser looks for files matching a pattern defined in config.ini (inputFiles parameter) and recursively checks subdirectories. - options - optional list of assignments var=value; var has to be one of variables from config.ini; values passed as an invocations argument override the default values from the file. Examples: java -jar spejd.jar corpus nullAgreement=1 java -jar spejd.jar corpus rules=rules2.sr logDir=log2 java -jar spejd.jar corpus discardDeleted=true outputSuffix=.sh2.xml RESULTS In the case of xml input, for each directory, in which filename.xml(.gz) has been found, a new filenameSh.xml is created. It is a copy of a corresponding .xml, but with additional annotation: token identifiers, disambiguation attributes, syntactic word and groups. In the case of plain text input filename.txt, a new xml file (file name ends with Sh.xml) is created for each corresponding .txt file. A few additional files are generated in logs subdirectory of the spade directory: rules.compiled - a compiled set of rules rules.matched.csv - rules statistics: for each rule gives the number of completed (evaluated to true) matches, the number of matches, matching time, evaluation time, total time tagdict.ini - tags dictionary, translating the tagset defined in configuration file to inner positional tagset DOCUMENTATION doc/spade.pdf - a paper about Spejd doc/xcesAnaIPI.dtd - DTD of the input format api/ - technical documentation EXAMPLE ./sample-morfeusz.cfg - example Morfeusz tagset file ./sample-morfologik.cfg - example Morfologik tagset file (for plain text input) ./rules.sr - example set of rules doc/morph.xml - example XML input to the parser doc/morphSh.xml - example output doc/display.* - stylesheets and example output WHAT'S NEW IN THIS VERSION FOR DEVELOPERS Please feel free to play around with the sources, modify them and post patches on Spejd's bugtracker at sourceforge (linked from the homepage)! See api/ - for a brief introduction to the code structure.