DataparkSearch Engine 4.54

Reference manual


Table of Contents
1. Introduction
1.1. DataparkSearch Features
1.2. Where to get DataparkSearch.
1.3. Disclaimer
1.4. Authors
1.4.1. Contributors
2. Installation
2.1. SQL database requirements
2.2. Supported operating systems
2.3. Tools required for installation
2.4. Installing DataparkSearch
2.5. Possible installation problems
2.6. Creating binary distribution
2.7. Quick usage tour
3. Indexing
3.1. Indexing in general
3.1.1. Configuration
3.1.2. Running indexer
3.1.3. How to create SQL table structure
3.1.4. How to drop SQL table structure
3.1.5. Subsection control
3.1.6. How to clear database
3.1.7. Database Statistics
3.1.8. Link validation
3.1.9. Parallel indexing
3.2. Supported HTTP response codes
3.3. Content-Encoding support
3.4. Stopwords
3.4.1. StopwordFile command
3.4.2. Format of stopword file
3.4.3. FillDictionary command.
3.4.4. StopwordsLoose command.
3.5. Clones
3.5.1. DetectClones command
3.6. Specifying WEB space to be indexed
3.6.1. Server command
3.6.2. Realm command
3.6.3. Subnet command
3.6.4. Using different parameter for server and it's subsections
3.6.5. Default indexer behavior
3.6.6. Using indexer -f <filename>
3.6.7. URL command
3.6.8. ServerDB, RealmDB, SubnetDB and URLDB commands
3.6.9. ServerFile, RealmFile, SubnetFile and URLFile commands
3.6.10. Robots exclusion standard
3.7. Aliases
3.7.1. Alias indexer.conf command
3.7.2. Different aliases for server parts
3.7.3. Using aliases in Server commands
3.7.4. Using aliases in Realm commands
3.7.5. AliasProg command
3.7.6. ReverseAlias command
3.7.7. ReverseAliasProg command
3.7.8. Alias command in search.htm search template
3.8. Servers Table
3.8.1. Loading servers table
3.8.2. Servers table structure
3.8.3. Flushing Servers Table
3.9. External parsers
3.9.1. Supported parser types
3.9.2. Setting up parsers
3.9.3. Avoid indexer hang on parser execution
3.9.4. Pipes in parser's command line
3.9.5. Charsets and parsers
3.9.6. DPS_URL environment variable
3.9.7. Some third-party parsers
3.9.8. libextractor library
3.10. Other commands are used in indexer.conf
3.10.1. Include command
3.10.2. DBAddr command
3.10.3. VarDir command
3.10.4. NewsExtensions command
3.10.5. SyslogFacility command
3.10.6. Word length commands
3.10.7. MaxDocSize command
3.10.8. MinDocSize command
3.10.9. IndexDocSizeLimit command
3.10.10. URLSelectCacheSize command
3.10.11. URLDumpCacheSize command
3.10.12. UseCRC32URLId command
3.10.13. HTTPHeader command
3.10.14. Allow command
3.10.15. Disallow command
3.10.16. CheckOnly command
3.10.17. HrefOnly command
3.10.18. CheckMp3 command
3.10.19. CheckMp3Only command
3.10.20. IndexIf command
3.10.21. NoIndexIf command
3.10.22. AllowIf command
3.10.23. DisallowIf command
3.10.24. HoldBadHrefs command
3.10.25. DeleteOlder command
3.10.26. UseRemoteContentType command
3.10.27. AddType command
3.10.28. Period command
3.10.29. PeriodByHops command
3.10.30. ExpireAt command
3.10.31. UseDateHeader command
3.10.32. LMDSection command
3.10.33. MaxHops command
3.10.34. TrackHops command
3.10.35. MaxDepth command
3.10.36. MaxDocsPerServer command
3.10.37. MaxHrefsPerServer command
3.10.38. MaxNetErrors command
3.10.39. ReadTimeOut command
3.10.40. DocTimeOut command
3.10.41. NetErrorDelayTime command
3.10.42. Cookies command
3.10.43. Section command
3.10.44. HrefSection command
3.10.45. FastHrefCheck command
3.10.46. Index command
3.10.47. ProxyAuthBasic command
3.10.48. Proxy command
3.10.49. AuthBasic command
3.10.50. ServerWeight command
3.10.51. OptimizeAtUpdate command
3.10.52. SkipUnreferred command
3.10.53. Bind command
3.10.54. ProvideReferer command
3.10.55. LongestTextItems command
3.10.56. MakePrefixes command
3.11. Extended indexing features
3.11.1. News extensions
3.11.2. Indexing SQL database tables (htdb: virtual URL scheme)
3.11.3. Indexing binaries output (exec: and cgi: virtual URL schemes)
3.11.4. Mirroring
3.11.5. Data acquisition
3.12. Using syslog
3.13. Storing compressed document copies
3.13.1. Configure stored
3.13.2. How stored works
3.13.3. Using stored during search
3.13.4. Document excerpts
4. DataparkSearch HTML parser
4.1. Tag parser
4.2. Special characters
4.3. META tags
4.4. Links
4.5. Comments
4.6. Body patterns
4.7. Sub-documents
5. Storing data
5.1. SQL storage types
5.1.1. General storage information
5.1.2. Various modes of words storage
5.1.3. Storage mode - single
5.1.4. Storage mode - multi
5.1.5. Storage mode - crc
5.1.6. Storage mode - crc-multi
5.1.7. SQL structure notes
5.1.8. Additional features of non-CRC storage modes
5.2. Cache mode storage
5.2.1. Introduction
5.2.2. Cache mode word indexes structure
5.2.3. Cache mode tools
5.2.4. Starting cache mode
5.2.5. Optional usage of several splitters
5.2.6. Using run-splitter script
5.2.7. Doing search
5.2.8. Using search limits
5.3. DataparkSearch performance issues
5.3.1. searchd usage recommendation
5.3.2. Search results caching
5.3.3. Memory based filesystem (mfs) usage recommendation
5.3.4. URLInfoSQL command
5.3.5. SRVInfoSQLcommand
5.3.6. MarkForIndex command
5.3.7. CheckInsertSQL command
5.3.8. MySQL performance
5.3.9. Asynchronous resolver library
5.4. SearchD support
5.4.1. Why using searchd
5.4.2. Starting searchd
5.5. Oracle notes
5.5.1. Introduction
5.5.2. Compilation, Installation and Configuration
6. Subsections
6.1. Tags
6.1.1. Tag command
6.1.2. TagIf command
6.1.3. Tags in SQL version
6.2. Categories
6.2.1. Category command
6.2.2. CategoryIf command
6.2.3. Loading categories table
6.2.4. FlushCategoryTable command
7. Languages support
7.1. Character sets
7.1.1. Supported character sets
7.1.2. Character sets aliases
7.1.3. Recoding
7.1.4. Recoding at search time
7.1.5. Document charset detection
7.1.6. Automatic charset guesser
7.1.7. Default charset
7.1.8. Default Language
7.1.9. LocalCharset command
7.1.10. ForceIISCharset1251 command
7.1.11. RemoteCharset command
7.1.12. URLCharset command
7.1.13. CharsToEscape command
7.2. Making multi-language search pages
7.2.1. How does it work?
7.2.2. Possible troubles
7.3. Segmenters for Chinese, Japanese, Korean and Thai languages
7.3.1. Japanese language phrase segmenter
7.3.2. Chinese language phrase segmenter
7.3.3. Thai language phrase segmenter
7.3.4. Korean language phrase segmenter
7.4. Multilingual servers support
8. Searching documents
8.1. Using search front-ends
8.1.1. Performing search
8.1.2. Search parameters
8.1.3. Changing different document parts weights at search time
8.1.4. Using front-end with an shtml page
8.1.5. Using several templates
8.1.6. Search operators
8.1.7. Advanced boolean search
8.1.8. The Verity Query Language, VQL
8.1.9. How search handles expired documents
8.2. mod_dpsearch module for Apache httpd
8.2.1. Why using mod_dpsearch
8.2.2. Configuring mod_dpsearch
8.3. How to write search result templates
8.3.1. Template sections
8.3.2. Variables section
8.3.3. Includes in templates
8.3.4. Conditional template operators
8.3.5. Security issues
8.4. Designing search.html
8.4.1. How the results page is created
8.4.2. Your HTML
8.4.3. Forms considerations
8.4.4. Relative links in search.htm
8.4.5. Adding Search form to other pages
8.5. Relevance
8.5.1. Ordering documents
8.5.2. Relevance calculation
8.5.3. Popularity rank
8.5.4. Boolean search
8.5.5. Crosswords
8.5.6. The Summary Extraction Algorithm (SEA)
8.6. Search queries tracking
8.7. Search results cache
8.8. Fuzzy search
8.8.1. Ispell
8.8.2. Aspell
8.8.3. Synonyms
8.8.4. Accent insensitive search
8.8.5. Acronyms and abbreviations
9. Miscellaneous
9.1. Reporting bugs
9.1.1. Currently known bugs
9.1.2. Core dump reports
9.2. Using libdpsearch library
9.2.1. dps-config script
9.2.2. DataparkSearch API
9.3. Database schema
A. Donations
Index
List of Tables
3-1. Relationship between libextractor's keyword types and DataparkSearch section names
3-2. Verbose levels
5-1. Cache mode predefined limit types
5-2. SQL-based cache mode limit types
7-1. Language groups
7-2. Charsets aliases
8-1. Available search parameters
8-2. VQL operators supported by DataparkSearch
8-3. Configure-time parameters to tune relevance calculation (switches for configure)
9-1. server table schema
9-2. Several server's parameters values in srvinfo table