Chapter 5. Storing data

Table of Contents
5.1. SQL storage types
5.2. Cache mode storage
5.3. DataparkSearch performance issues
5.4. SearchD support
5.5. Oracle notes

5.1. SQL storage types

5.1.1. General storage information

DataparkSearch stores every words found in any defined section of document. The count of word appearance in the document does not affect it's weight. But the fact whether the word appears in more important parts of the document (title, description, etc.) is taken in account however.

5.1.2. Various modes of words storage

There are different modes of word storage which are currently supported by DataparkSearch: "single", "multi", "crc", "crc-multi", "cache". Default mode is "cache". Mode is to be selected by dbmode parameter of DBAddr command in both indexer.conf and search.htm files.

Examples:
DBAddr mysql://localhost/search/?dbmode=single
DBAddr mysql://localhost/search/?dbmode=multi
DBAddr mysql://localhost/search/?dbmode=crc
DBAddr mysql://localhost/search/?dbmode=crc-multi

5.1.3. Storage mode - single

When "single" is specified, all words are stored in one table with structure (url_id,word,weight), where url_id is the ID of the document which is referenced by rec_id field in "url" table. Word has variable char(32) SQL type.

5.1.4. Storage mode - multi

If "multi" is selected, words will be located in different 13 tables depending of their lengths. Structures of these tables are the same with "single" mode, but fixed length char type is used, which is usually faster in most databases. This fact makes "multi" mode usually faster comparing with "single" mode.

5.1.5. Storage mode - crc

If "crc" mode is selected, DataparkSearch will store 32 bit integer word IDs calculated by HASH32 algorithm instead of words. This mode requires less disc space and is faster than "single" and "multi" modes. DataparkSearch uses the fact that HASH32 calculates quite unique check sums for different words. According to our tests there are only 250 pairs of words have the same HASH32 value in the list of about 1.600.000 unique words. Most of these pairs (>90%) have at least one misspelled word. Words information is stored in the structure (url_id,word_id,weight), where word_id is 32 bit integer ID calculated by HASH32 algorithm. This mode is recommended for big search engines.

5.1.6. Storage mode - crc-multi

When "crc-multi" mode is selected, DataparkSearch stores HASH32 word IDs in several tables with the same to "crc" structures depending on word lengths like in "multi" mode. This mode usually is the most fast and recommended for big search engines.

5.1.7. SQL structure notes

Please note that we develop DataparkSearch with PostgreSQL as back-end and often have no possibility to test each version with all of other supported databases. So, if there is no table definition in create/you_database directory, you may found PostgreSQL definition for the same table and just adopt it for your back-end. PostgreSQL table definitions are always up-to-date.

5.1.8. Additional features of non-CRC storage modes

"single" and "multi" modes support substring search. As far as "crc" and "crc-multi" do not store words themselves and use integer values generated by HASH32 algorithm instead, there is no possibility of substring search in these modes.