DataparkSearch stores every words found in any defined section of document. The count of word appearance in the document does not affect it's weight. But the fact whether the word appears in more important parts of the document (title, description, etc.) is taken in account however.
There are different modes of word storage which
are currently supported by DataparkSearch:
"single", "multi", "crc", "crc-multi", "cache". Default mode is "cache". Mode is
to be selected by dbmode
parameter of DBAddr command in both
indexer.conf and search.htm
files.
Examples: DBAddr mysql://localhost/search/?dbmode=single DBAddr mysql://localhost/search/?dbmode=multi DBAddr mysql://localhost/search/?dbmode=crc DBAddr mysql://localhost/search/?dbmode=crc-multi
When "single" is specified, all words are stored in one table with structure (url_id,word,weight), where url_id is the ID of the document which is referenced by rec_id field in "url" table. Word has variable char(32) SQL type.
If "multi" is selected, words will be located in different 13 tables depending of their lengths. Structures of these tables are the same with "single" mode, but fixed length char type is used, which is usually faster in most databases. This fact makes "multi" mode usually faster comparing with "single" mode.
If "crc" mode is selected, DataparkSearch will store 32 bit integer word IDs calculated by HASH32 algorithm instead of words. This mode requires less disc space and is faster than "single" and "multi" modes. DataparkSearch uses the fact that HASH32 calculates quite unique check sums for different words. According to our tests there are only 250 pairs of words have the same HASH32 value in the list of about 1.600.000 unique words. Most of these pairs (>90%) have at least one misspelled word. Words information is stored in the structure (url_id,word_id,weight), where word_id is 32 bit integer ID calculated by HASH32 algorithm. This mode is recommended for big search engines.
When "crc-multi" mode is selected, DataparkSearch stores HASH32 word IDs in several tables with the same to "crc" structures depending on word lengths like in "multi" mode. This mode usually is the most fast and recommended for big search engines.
Please note that we develop DataparkSearch with PostgreSQL as back-end and often have no possibility to test each version with all of other supported databases. So, if there is no table definition in create/you_database directory, you may found PostgreSQL definition for the same table and just adopt it for your back-end. PostgreSQL table definitions are always up-to-date.
"single" and "multi" modes support substring search. As far as "crc" and "crc-multi" do not store words themselves and use integer values generated by HASH32 algorithm instead, there is no possibility of substring search in these modes.