8.8. Fuzzy search

8.8.1. Ispell

When DataparkSearch is used with ispell support enabled, it automatically extend search query by all grammatical forms of the query words. E.g. search front-end will try to find the word "test" if "testing" or "tests" is given in search query.

8.8.1.1. Two types of ispell files

DataparkSearch understands two types of ispell files: affixes and dictionaries. Ispell affixes file contains rules for words and has approximately the following format:

Flag V:
       E   > -E, IVE      # As in create> creative
      [^E] > IVE          # As in prevent > preventive
Flag *N:
       E   > -E, ION      # As in create > creation
       Y   > -Y, ICATION  # As in multiply > multiplication
     [^EY] > EN           # As in fall > fallen

Ispell dictionary file contains words themselves and has the following format:

wop/S
word/DGJMS
wordage/S
wordbook
wordily
wordless/P

8.8.1.2. Using Ispell

To make DataparkSearch support ispell you must specify Affix and Spell commands in search.htm file. The format of commands:

Affix [lang] [charset] [ispell affix file name]
Spell [lang] [charset] [ispell dictionary filename]

The first parameter of both commands is two letters language abbreviation. The second is ispell files charset. The third one is filename. File names are relative to DataparkSearch /etc directory. Absolute paths can be also specified.

Note: Simultaneous loading of several languages is supported, e.g.:

Affix en iso-8859-1 en.aff
Spell en iso-8859-1 en.dict
Affix de iso-8859-1 de.aff
Spell de iso-8859-1 de.dict

Will load support for both English and German languages.

If you use searchd, add the same commands to searchd.conf.

When DataparkSearch is used with ispell support it is recommended to use searchd, especially for several languages support. Otherwise the starting time of search.cgi increases.

8.8.1.3. Customizing dictionary

It is possible that several rare words are found in your site which are not in ispell dictionaries. In such case, an entry with longest match suffix is taking to produce word forms.

But you can also create the list of such words in plain text file with the following format (one word per line):

rare.dict:
----------
webmaster
intranet
.......
www
http
---------
			

You may also use ispell flags in this file (for ispell flags refer to ISpell documentation). This will allow not writing the same word with different endings to the rare words file, for example "webmaster" and "webmasters". You may choose the word which has the same changing rules from existing ispell dictionary and just to copy flags from it. For example, English dictionary has this line:

postmaster/MS

So, webmaster with MS flags will be probably OK:

webmaster/MS

Then copy this file to /etc directory of DataparkSearch and add this file by Spell command in DataparkSearch configuration:

During next reindexing using of all documents new words will be considered as words with correct spelling. The only really incorrect words will remain.

8.8.1.4. Where to get Ispell files

You may find ispell files for many of languages at this page.

For Japanese language there exist quasi-ispell files suitable for use with DataparkSearch only. You may get this data from our web site or from one of our mirrors. See Section 1.2>.

8.8.1.5. Query words modification

Quffix [lang] [charset] [ispell-like suffix file name]
The Quffix command is similar to Affix command described above, except that these rules apply to the query words, bot not to the normal word forms as it is done for Affix command. The file loaded with this command must contain only suffix rules (in terms of ispell affix files).

This command is suitable, for example, to specify the rules to switch from one part of speech to an another for the Russian language when it is appropriate.

8.8.2. Aspell

With Aspell support compiled, it's possible automatically extend search query by spelling suggestions for query words. To enable this feature, you need to install Aspell at your system before DataparkSearch build. Then you need to place AspellExtensions yes command into your indexer.conf and search.htm (or into searchd.conf, if searchd is used) files to activate this feature.

Automatically spelling suggestion for search query words is going only if sp search parameter is set, see Section 8.1.2>.

8.8.3. Synonyms

DataparkSearch also support a synonyms-based fuzzy search.

Synonyms files are installed into etc/synonym subdirectory of DataparkSearch installation. Large synonyms files you need to download separately from our web site, or from one of our mirrors, see Section 1.2>.

To enable synonyms, add to search.htm search template commands like Synonym <filename>, e.g.:

Synonym synonym/english.syn
Synonym synonym/russian.syn

Filenames are relative to etc directory of DataparkSearch installation or absolute if begin with /

If you use searchd, add the same commands to searchd.conf.

You may create your own synonyms lists. As an example you may take the English synonyms file. In the beginning of the list please specify the following two commands:

Language: en
Charset:  us-ascii

You can use '\' character to escape '#' character in your acronyms or its extensions which usually it's considered as beginning of a comment.

Optionaly you may specify following command in the list:

Thesaurus: yes

This command enable thesaurus mode for synonyms list. For this mode, only words at one line treats as synonyms.

8.8.4. Accent insensitive search

Since version 4.17 DataparkSearch also support an accent insensitive search.

To enable this extension, use AccentExtensions command in your search.htm (or in searchd.conf, if searchd is used) to make automatically accent-free copies for query words, and in your indexer.conf config file to produce accent-free word's copies to store in database.

AccentExtensions yes

If AccentExtensions command is placed before Spell and Affix commands, accent-free copies for those data also will be loaded automaticaly.

8.8.5. Acronyms and abbreviations

Since version 4.30 DataparkSearch also support search fuzzying based on acronyms and abbreviation.

Acronyms files are installed into etc/acronym subdirectory of DataparkSearch installation.

To enable acronyms, add to search.htm search template commands like Acronym <filename>, e.g.:

Acronym acronym/en.fido.acr
Acronym acronym/en.acr

Filenames are relative to etc directory of DataparkSearch installation or absolute if begin with /

If you use searchd, add the same commands to searchd.conf.

You may create your own acronyms lists. As an example you may take the English acronyms file. In the beginning of the list please specify the following two commands:

Language: en
Charset:  us-ascii

You can use '\' character to escape '#' character in your acronyms or its extensions which usually it's considered as beginning of a comment.

Also, you can extend queries by special comments specifying regular expression modifications. E.g.:

#* regex last "([0-9]{2})[- \.]?([0-9]{2})[- \.]?([0-9]{2})" "+78622$1$2$3"

This specify a transformation from widely used format of local phone numbers, 99-99-99, into canonical format, +78622XXXXXX. So the phone numbers become searchable regardless the format they were written. The last option here means that the process of regex application stops after applying this rule.

Please send your own acronym files to , if you want share its with other users.