DataparkSearch Engine 4.54: Reference manual
Prev	Chapter 3. Indexing	Next

3.4. Stopwords

Stopwords - are the most frequently used words, i.e. words which appear in almost every document searched. Stopwords are filtered out prior to index construction, what is allow to reduce the total size of the index without any significant loss in quality of search.

3.4.1. StopwordFile command

Load stop words from the given text file. You may specify either absolute file name or a name relative to DataparkSearch /etc directory. You may use several StopwordFile commands.

StopwordFile stopwords/en.sl

You must use the same set of StopwordFile commands in indexer.conf and search.htm (searchd.conf if searchd is used).

3.4.2. Format of stopword file

You may create your own stopword lists. As an example you may take the English stopword file etc/stopwords/en.sl. In the beginning of the list please specify the following two commands:

Language: en
Charset:  us-ascii

Language - standard (ISO 639) two-letter language abbreviation.
Charset - any charset supported by DataparkSearch (see Section 7.1>).

Then the list of stopwords is follow, one word per line. Each word is written in character set specified above by Charset: command.

You may use optional Match: command to specify a pattern to treat any word match it as a stopword. E.g.:

Match: regex ^\$##

According to this command, any word begins with $## will be considered as a stopword.

Options of Match: command are the same as for Allow (see Section 3.10.14>). Arguments are in character set specified by Charset: command. Regular expressions are limited at the moment (e.g. intervals aren't supported).

3.4.3. FillDictionary command.

With the command "FillDictionary yes" in indexer.conf you can enable storage of all indexed words into "dict" table for dbmode cache. This is usefull to track down which words are stopwords for your installation.

3.4.4. StopwordsLoose command.

With the command "StopwordsLoose yes" in indexer.conf and search.htm only the stopwords of the same language as the language of a document indexing or the language of a search request are taken into accont as stopwords, i.e. the stopwords of different language are processed as regular words for this document indexing or search request executed.

Prev	Home	Next
Content-Encoding support	Up	Clones