3.10. Other commands are used in indexer.conf

3.10.1. Include command

You may include another configuration file in any place of the indexer.conf using Include <filename> command. Absolute path if <filename> starts with "/":

Include /usr/local/dpsearch/etc/inc1.conf

Relative path else:

Include inc1.conf

3.10.2. DBAddr command

DBAddr command is URL-style database description. It specify options (type, host, database name, port, user and password) to connect to SQL database. Should be used before any other commands. You may specify several DBAddr commands. In this case DataparkSearch will merge result from every database specified. Command have global effect for whole config file. Format:

DBAddr <Type>:[//[User[:Pass]@]Host[:Port]]/DBName/[?[dbmode=mode]{&<parameter name>=<parameter value>}]

Note: ODBC related. Use DBName to specify ODBC data source name (DSN) Host does not matter, use "localhost".

Note: Solid related. Use Host to specify Solid server DBName does not matter for Solid.

You may use CGI-like encoding for User and Pass if you need use special characters in user name or password. For example, if you have ABC@DEF as password, you should write it as ABC%40DEF.

Currently supported Type values are mysql, pgsql, msql, solid, mssql, oracle, ibase, sqlite. Actually, it does not matter for native libraries support. But ODBC users should specify one of supported values. If your database type is not supported, you may use "unknown" instead.

MySQL and PostgreSQLusers can specify path to Unix socket when connecting to localhost: mysql://foo:bar@localhost/dpsearch/?socket=/tmp/mysql.sock

If you are using PostgreSQL and do not specify hostname, e.g. pgsql://user:password@/dbname/ then PostgreSQL will not work via TCP, but will use default Unix socket.

dbmode parameter. You may also select database mode of words storage. When "single" is specified, all words are stored in the same table (file). If "multi" is selected, words will be located in different tables (files) depending of their lengths. "multi" mode is usually faster but requires more tables (files). If "crc" mode is selected, DataparkSearch will store 32 bit integer word IDs calculated by HASH32 algorithm instead of words. This mode requires less disk space and it is faster comparing with "single" and "multi" modes, however it doesn't support substring searches. "crc-multi" uses the same storage structure with the "crc" mode, but also stores words in different tables (files) depending on words lengths like "multi" mode. Default mode is "single".

stored parameter. Format:stored=StoredHost[:StoredPort]. This parameter is used to specify host and port, if specified, where stored daemon is running, if you plan to use document excerpts and cached copies.

cached parameter. Format:cached=CachedHost[:CachedPort]. Use cached at given host and port if specified. It is required for cache storage mode only (see Section 5.2>). Each indexer will connect to cached on given address at startup.

charset parameter. Format:charset=DBCharacterSet. This parameter can be used to specity database connection charset. The charset specified by DBCharacterSet should be equal to charset specified by LocalCharset command.

label parameter. Format: label=DBAlabel. This parameter may be used to assign a label to DBAddr command. So, if you pass label CGI-variable to the DataparkSearch, then only DBAddr marked by label value will be used to performing search. Thus, you can use one searchd daemon to answer queries for several search databases selectible by label variable.

Note: If no label is passed as CGI-parameter, then only DBAddr without a label will be used to perform search query.

Example:

DBAddr          mysql://foo:bar@localhost/dpsearch/?dbmode=single

3.10.3. VarDir command

You may choose alternative working directory for cache mode:

VarDir /usr/local/dpsearch/var

3.10.4. NewsExtensions command

Whether to enable news extensions. Default value is no.

NewsExtensions yes

3.10.5. SyslogFacility command

This is used if DataparkSearch was compiled with syslog support and if you don't like the default value. Argument is the same as used in syslog.conf file. For list of possible facilities see syslog.conf(5)

SyslogFacility local7

3.10.6. Word length commands

Word lengths. You may change default length range of words stored in database. By default, words with the length in the range from 1 to 32 are stored.

MinWordLength 1
MaxWordLength 32

3.10.7. MaxDocSize command

This command is used for specify maximal document size. Default value 1048576 (1 Megabyte). Takes global effect for whole config file.

MaxDocSize 1048576

3.10.8. MinDocSize command

This command is used to checkonly urls with content size less than value specified. Default value 0. Takes global effect for whole config file.

MinDocSize 1024

3.10.9. IndexDocSizeLimit command

Use this command to specify the maximal amount of data stored in index per document. Default value 0. This means no limit. Takes effect till next IndexDocSizeLimit command.

IndexDocSizeLimit 65536

3.10.10. URLSelectCacheSize command

Select number of targets to index at once. Default value is 1024.

URLSelectCacheSize 10240

3.10.11. URLDumpCacheSize command

Select at once this number of urls to write cache mode indexes, to preload url data or to calculate the Popularity Rank. Default value is 100000.

URLDumpCacheSize 10240

3.10.12. UseCRC32URLId command

Switch on or off the ID generation for URL using HASH32. Default value is "no".

UseCRC32URLId yes

Switching it on allow speed up indexing a bit, but some small number of collisions is possible.

3.10.13. HTTPHeader command

You may add desired headers to indexer's HTTP request. You should not use "If-Modified-Since", "Accept-Charset" headers, these headers are composed by indexer itself. "User-Agent: DataparkSearch/version" header is sent too, but you may override it. Command has global effect for all configuration file.

HTTPHeader "User-Agent: My_Own_Agent"
HTTPHeader "Accept-Language: ru, en"
HTTPHeader "From: webmaster@mysite.com"

3.10.14. Allow command

Allow [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]

Use this command to allow URLs that match (doesn't match) given argument. First three optional parameters describe the type of comparison. Default values are Match, NoCase, String. Use NoCase or Case values to choose case sensitive or case insensitive comparison. Use Regex to choose regular expression comparison. Use String to choose string with wildcards comparison. Wildcards are '*' for any number of characters and '?' for one character. Note that '?' and '*' have special meaning in String match type. Please use Regex to describe documents with '?' and '*' signs in URL. String match is much faster than Regex. Use String where it is possible. You may use several arguments for one Allow command. You may use this command any times. Takes global effect for config file. Note that DataparkSearch automatically adds one "Allow regex .*" command after reading config file. It means that allowed everything that is not disallowed.

Examples

#  Allow everything:
Allow *
#  Allow everything but .php .cgi .pl extensions case insensitively using regex:
Allow NoMatch Regex \.php$|\.cgi$|\.pl$
#  Allow .HTM extension case sensitively:
Allow NoCase *.HTM

3.10.15. Disallow command

Disallow [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]

Use this command to disallow URLs that match (doesn't match) given argument. The meaning of first three optional parameters is exactly the same with Allow command. You can use several arguments for one Disallow command. Takes global effect for config file. Examples:

# Disallow URLs that are not in udm.net domains using "string" match:
Disallow NoMatch *.udm.net/*
# Disallow any except known extensions and directory index using "regex" match:
Disallow NoMatch Regex \/$|\.htm$|\.html$|\.shtml$|\.phtml$|\.php$|\.txt$
# Exclude cgi-bin and non-parsed-headers using "string" match:
Disallow */cgi-bin/* *.cgi */nph-*
# Exclude anything with '?' sign in URL. Note that '?' sign has a 
# special meaning in "string" match, so we have to use "regex" match here:
Disallow Regex  \?

3.10.16. CheckOnly command

CheckOnly [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]

The meaning of first three optional parameters is exactly the same with Allow command. Indexer will use HEAD instead of GET HTTP method for URLs that match/do not match given regular expressions. It means that the file will be checked only for being existing and will not be downloaded. Useful for zip,exe,arj and other binary files. Note that you can disallow those files with commands given below. You may use several arguments for one CheckOnly commands. Useful for example for searching through the URL names rather than the contents (a la FTP-search). Takes global effect for config file. Examples:

# Check some known non-text extensions using "string" match:
CheckOnly *.b	  *.sh   *.md5
# or check ANY except known text extensions using "regex" match:
CheckOnly NoMatch Regex \/$|\.html$|\.shtml$|\.phtml$|\.php$|\.txt$

3.10.17. HrefOnly command

HrefOnly [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]

The meaning of first three optional parameters is exactly the same with Allow command. Use this to scan a HTML page for "href" attribute of tags but not to index the contents of the page with an URLs that match (doesn't match) given argument. Commands have global effect for all configuration file. When indexing large mail list archives for example, the index and thread index pages (like mail.10.html, thread.21.html, etc.) should be scanned for links but shouldn't be indexed:

HrefOnly */mail*.html */thread*.html

3.10.18. CheckMp3 command

CheckMp3 [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ...]

The meaning of first three optional parameters is exactly the same with Allow command. If an URL matches given rules, indexer will download only a little part of the document and try to find MP3 tags in it. On success, indexer will parse MP3 tags, else it will download whole document then parse it as usual. Notes: This works only with those servers which support HTTP/1.1 protocol. It is used "Range: bytes" header to download mp3 tag.

CheckMp3 *.bin *.mp3

3.10.19. CheckMp3Only command

CheckMP3Only [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ...]

The meaning of first three optional parameters is exactly the same with Allow command. If an URL matches given rules, indexer, like in the case CheckMP3 command, will download only a little part of the document and try to find MP3 tags. On success, indexer will parse MP3 tags, else it will NOT download whole document.

CheckMP3Only *.bin *.mp3

3.10.20. IndexIf command

IndexIf [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]

Use this command to allow indexing, if the value of section match the arg pattern given. The meaning of first three optional parameters is exactly the same as for the Allow command (see Section 3.10.14>).

Example

IndexIf regex Title Manual
IndexIf body "*important detail*"

3.10.21. NoIndexIf command

NoIndexIf [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]

Use this command to disallow indexing, if the value of section match the arg pattern given. The meaning of first three optional parameters is exactly the same as for the Allow command (see Section 3.10.14>).

Example

NoIndexIf regex Title Sex
IndexIf body *xxx*

3.10.22. AllowIf command

AllowIf [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]

This command is similar to the Allow command (see Section 3.10.14>), but is applicable to any section of the document indexed, and it is appplied after the content of the document downloaded and indexed. Use this command to allow indexing, if the value of section match the arg pattern given. The meaning of first three optional parameters is exactly the same as for the Allow command.

Example

AllowIf regex Title Manual
AllowIf body "*important detail*"

3.10.23. DisallowIf command

DisallowIf [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]

This command is similar to the Disallow command (see Section 3.10.15>), but is applicable to any section of the document indexed, and it is appplied after the content of the document downloaded and indexed. Use this command to delete corresponding document from the database, if the value of section match the arg pattern given. The meaning of first three optional parameters is exactly the same as for the Allow command (see Section 3.10.14>).

Example

DisallowIf regex Title Sex
DisallowIf body *xxx*

3.10.24. HoldBadHrefs command

HoldBadHrefs <time>

How much time to hold URLs with erroneous status before deleting them from the database. For example, if host is down, indexer will not delete pages from this site immediately and search will use previous content of these pages. However if site doesn't respond for a month, probably it's time to remove these pages from the database. For <time> format see description of Period command in Section 3.10.28>.

HoldBadHrefs 30d

3.10.25. DeleteOlder command

DeleteOlder <time>

How much time to hold URLs before deleting them from the database. For example, for news sites indexing, you may delete automatically old news articles after specified period. For <time> format see description of Period command in Section 3.10.28>. Default value is 0. "0" value mean "do not check". You may specify several DeleteOlder commands, for example, by one for every Server command.

DeleteOlder 7d

3.10.26. UseRemoteContentType command

UseRemoteContentType yes/no

This command specifies if the indexer should get content type from http server headers (yes) or from it's AddType settings (no). If set to 'no' and the indexer could not determine content-type by using its AddType settings, then it will use http header. Default: yes

UseRemoteContentType yes

3.10.27. AddType command

AddType [String|Regex] [Case|NoCase] <mime type> <arg> [<arg>...]

This command associates filename extensions (for services that don't automatically include them) with their mime types. Currently "file:" protocol uses these commands. Use optional first two parameter to choose comparison type. Default type is "String" "Case" (case insensitive string match with '?' and '*' wildcards for one and several characters correspondently).

AddType image/x-xpixmap	*.xpm

3.10.28. Period command

Period <time>

Set reindex period. <time> is in the form 'xxxA[yyyB[zzzC]]' (Spaces are allowed between xxx and A and yyy and so on) there xxx, yyy, zzz are numbers (can be negative!) A, B, C can be one of the following: s - second M - minute h - hour d - day m - month y - year (these letters are the same as in strptime/strftime functions). Examples:

 15s - 15 seconds
 4h30M - 4 hours and 30 minutes
 1y6m-15d - 1 year and six month minus 15 days
 1h-10M+1s - 1 hour minus 10 minutes plus 1 second

If you specify only number without any character, it is assumed that time is given in seconds. Can be set many times before Server command and takes effect till the end of config file or till next Period command.

Period 7d

3.10.29. PeriodByHops command

PeriodByHops <hops> [ <time> ]

Set reindex period per <hops> basis. The format for <time> is the same as for Period.

Can be set many times before Server command and takes effect till the end of config file or till next PeriodByHops command with same <hops> value. If <time> parameter is omitted, this undefine the previous defined value.

If for given <hops> value the appropriate PeriodByHops command is not specified, in this case the value defined in Period command is used.

3.10.30. ExpireAt command

ExpireAt [ A [ B [ C [ D [ E ]]]]]

This command allow specify the exactly expiration time for documents. May be specified per Server/Realm basis and takes effect till the end of config file or till next ExpireAt command. ExpireAt specified without any arguments disable previously specified value. A - stand for minute, may be * or 0-59; B - stand for hour, may be * or 0-23; C - stand for day of month, may be * or 1-31; D - stand for month, may be * or 1-12; E - stand for day of week, may be * or 0-6, 0 - is Sunday. ExpireAt command have higher prioroty over Period or PeriodByHops command.

3.10.31. UseDateHeader command

UseDateHeader yes|no|force

Use Date header if no Last-Modified header is sent by remote web-server. The value "force" instructs to use Date header even if Last-Modified header has been sent by remote server. Default value: no.

3.10.32. LMDSection command

LMDSection <section name>

This command specify the section which will be used as the document last modification date instead of Last-Modified header sent by remote web-server. Can be set many times before Server command and takes effect till the end of config file or till next LMDSection command. Default value is undefined. Use this command without any argument to make its value undefined. If the value of the section specified by this command is not defined for current document the value of Last-Modified header will be used.

3.10.33. MaxHops command

MaxHops <number>

It limits the length of a way from a seeding URL to the indexing one in "mouse clicks". Default value is 256. Can be set multiple times before "Server" command and it takes effect till the end of config file or till next MaxHops command.

MaxHops 256

3.10.34. TrackHops command

TrackHops yes|no

This command enable or disable hops tracking in reindexing. Default value is no. If enabled, the value of hops for url is recalculated when reindexing. Otherwise the value of hops is calculated only once at insertion of url into database.

TrackHops yes

3.10.35. MaxDepth command

MaxDepth <number>

It limits the directory depth of an URL indexed. Default value is 16. Can be set multiple times before "Server" command and takes effect till the end of config file or till next MaxDepth command.

MaxDepth 2

3.10.36. MaxDocsPerServer command

MaxDocsPerServer <number>

Limits the number of hrefs accepted from a Server. Default value is -1, that means no limits. If set to positive value, no more than given number of pages will be indexed from one server during this run of index. Can be set multiple times before Server command and takes effect till the end of config file or till next MaxDocsPerServer command.

MaxDocsPerServer 100

3.10.37. MaxHrefsPerServer command

MaxHrefsPerServer <number>

Limits the number of documents retrieved from a Server. Default value is -1, that means no limits. If set to positive value, no more than given number of hrefs will be picked up from one server during this run of index. Can be set multiple times before Server command and takes effect till the end of config file or till next MaxHrefsPerServer command.

MaxHrefsPerServer 100

3.10.38. MaxNetErrors command

MaxNetErrors <number>

Maximum network errors for each server. Default value is 16. Use 0 for unlimited errors number. If there too many network errors on some server (server is down, host unreachable, etc) indexer will try to do not more then 'number' attempts to connect to this server. Takes effect till the end of config file or till next MaxNetErrors command.

MaxNetErrors 16

3.10.39. ReadTimeOut command

ReadTimeOut <time>

Connect timeout and stalled connections timeout. For <time> format see Section 3.10.28>. Default value is 30 seconds. Can be set any times before Server command and takes effect till the end of config file or till next ReadTimeOut command.

ReadTimeOut 30s

3.10.40. DocTimeOut command

DocTimeOut <time>

Maximum amount of time indexer spends for one document downloading. For <time> format see Section 3.10.28>. Default value is 90 seconds. Can be set any times before Server command and takes effect till the end of config file or till next DocTimeOut command.

DocTimeOut 1m30s

3.10.41. NetErrorDelayTime command

NetErrorDelayTime <time>

Specify document processing delay time if network error has occurred. For <time> format see Section 3.10.28>. Default value is one day

NetErrorDelayTime 1d

3.10.42. Cookies command

Cookies yes/no

Enables/Disables the support for HTTP cookies. Command may be used several times before Server command and takes effect till the end of config file or till next Cookies command. Default value is "no".

Cookies yes

3.10.43. Section command

Section <string> <number> <maxlen> [strict] [ <pattern> <replacement> ]

where <string> is a section name and <number> is section ID between 0 and 255. Use 0 if you don't want to index some of these sections. It is better to use different IDs for different sections. In this case during search time you'll be able to give different weight to each section or even disallow some sections at a search time. <maxlen> argument contains a maximum length of section which will be stored in database. Use 0 for <maxlen>, if you don't want to store this section. <pattern> and <replacement> are a regex-like pattern and replacement to extract section value from document content.

You can specify strict option to set strict string tokenization for a section, which mean word break at any non-character symbol despite the context. It's useful, for example, in indexing of URL, where hyphen, the character, uses as delimiter between words.

You can specify single option for a single value section, for which any second value will be skipped in processing. This is useful, for example, to clean up titles of pages with frames or to remove doubled titles when libextractor is used.

# Standard HTML sections: body, title
Section	body			1	256
Section title			2	128
# strict tokenization for URL
Section url                     3       0 strict
# regex-pattern for a section
Section GoodName                4       128 "<h1>([^<]*)</h1>" "<b>GoodName:</b> $1"

3.10.44. HrefSection command

HrefSection <string> [ <pattern> <replacement> ]

where <string> is a section name, <pattern> and <replacement> are a regex-like pattern and replacement to extract section value from document content. Use this command to extract links from document content.

# Standard HTML sections: body, title
HrefSection	link
HrefSection     NewLink "<newlink>([^<]*)</newlink>" "$1"

3.10.45. FastHrefCheck command

The "FastHrefCheck yes" command is useful to speed-up the indexing when you have a huge list of Server/Realm/Subnet commands as it disables the href checking against server list during parsing.

3.10.46. Index command

Index yes/no

Prevent indexer from storing words into database. Useful for example for link validation. Can be set multiple times before Server command and takes effect till the end of config file or till next Index command. Default value is "yes".

Index no

3.10.47. ProxyAuthBasic command

ProxyAuthBasic login:passwd

Specity username and password for http proxy basic authorisation and for SOCKS5 authorisation. Can be used before every Server command and takes effect only for next one Server command! It should be also before Proxy command. Examples:

ProxyAuthBasic somebody:something  

3.10.48. Proxy command

Proxy [http|socks5] your.proxy.host[:port]

Use proxy rather then connect directly. You can specify either HTTP or SOCK5 proxy type. HTTP proxy type is used by default. One can index ftp servers when using HTTP proxy Default port value if not specified is 3128 (Squid) If proxy host is not specified direct connect will be used. Can be set before every Server command and takes effect till the end of config file or till next Proxy command. If no one Proxy command specified indexer will use direct connect. Examples:

#           Proxy on atoll.anywhere.com, port 3128:
Proxy atoll.anywhere.com
#           Proxy on lota.anywhere.com, port 8090:
Proxy lota.anywhere.com:8090
#	    Proxy on local Tor
Proxy socks5 localhost:9050
#           Disable proxy (direct connect):
Proxy

3.10.49. AuthBasic command

AuthBasic login:passwd

Use basic http authorization. Can be set before every Server command and takes effect only for next one Server command! Examples:

AuthBasic somebody:something  

# If you have password protected directory(-ies), but whole server is open,use:
AuthBasic login1:passwd1
Server http://my.server.com/my/secure/directory1/
AuthBasic login2:passwd2
Server http://my.server.com/my/secure/directory2/
Server http://my.server.com/

3.10.50. ServerWeight command

ServerWeight <number>

Server weight for Popularity Rank calculation (see Section 8.5.3>). Default value is 1.

ServerWeight 1

3.10.51. OptimizeAtUpdate command

OptimizeAtUpdate yes

Specify word index optimize strategy. Default value: no If enabled, this save disk space, but slow down indexing. May be placed in indexer.conf and cached.conf.

3.10.52. SkipUnreferred command

SkipUnreferred yes|no|del

Default value: no. Use this command to skip reindexing or delete unreferred documents. An unreferred document is a document with no links to it. This command require the links collection to be enabled (see Section 8.5.3>).

3.10.53. Bind command

Bind 127.0.0.1

You may use this command to specify local ip address, if your system have several network interfaces.

3.10.54. ProvideReferer command

ProvideReferer yes

Use this command to provide Referer: request header for HTTP and HTTPS connections.

3.10.55. LongestTextItems command

LongestTextItems 4

Use this command to specify the number of longest text items to index.

3.10.56. MakePrefixes command

With MakePrefixes yes command you can instruct indexer to produce automatically all prefixes for words indexed. This is suitable, for example, for making search suggestions.