You may include another configuration file in any place of the indexer.conf using Include <filename> command. Absolute path if <filename> starts with "/":
Include /usr/local/dpsearch/etc/inc1.conf
Relative path else:
Include inc1.conf
DBAddr command is URL-style database description. It specify options (type, host, database name, port, user and password) to connect to SQL database. Should be used before any other commands. You may specify several DBAddr commands. In this case DataparkSearch will merge result from every database specified. Command have global effect for whole config file. Format:
DBAddr <Type>:[//[User[:Pass]@]Host[:Port]]/DBName/[?[dbmode=mode]{&<parameter name>=<parameter value>}]
Note: ODBC related. Use DBName to specify ODBC data source name (DSN)
Host
does not matter, use "localhost".
Note: Solid related. Use
Host
to specify Solid server DBName does not matter for Solid.
You may use CGI-like encoding for User
and Pass
if you need use special characters
in user name or password. For example, if you have ABC@DEF as password, you should write it as
ABC%40DEF.
Currently supported Type
values are
mysql, pgsql, msql, solid, mssql,
oracle, ibase, sqlite.
Actually, it does not matter for native libraries support.
But ODBC users should specify one of supported values.
If your database type is not supported, you may use "unknown" instead.
MySQL and PostgreSQLusers can specify path to Unix socket when connecting to localhost: mysql://foo:bar@localhost/dpsearch/?socket=/tmp/mysql.sock
If you are using PostgreSQL and do not specify hostname, e.g. pgsql://user:password@/dbname/ then PostgreSQL will not work via TCP, but will use default Unix socket.
dbmode parameter. You may also select database mode of words storage.
When "single
" is specified, all words are stored in the same table (file).
If "multi
" is selected, words will be located in different tables (files)
depending of their lengths. "multi
" mode is usually faster but requires more
tables (files).
If "crc
" mode is selected, DataparkSearch will store 32 bit integer
word IDs calculated by HASH32 algorithm instead of words. This mode requires
less disk space and it is faster comparing with "single" and "multi" modes,
however it doesn't support substring searches.
"crc-multi
" uses the same storage structure with the "crc" mode, but also
stores words in different tables (files) depending on words lengths like
"multi" mode. Default mode is "single".
stored parameter. Format:stored=StoredHost[:StoredPort]. This parameter is used to specify host and port, if specified, where stored daemon is running, if you plan to use document excerpts and cached copies.
cached parameter. Format:cached=CachedHost[:CachedPort]. Use cached at given host and port if specified. It is required for cache storage mode only (see Section 5.2>). Each indexer will connect to cached on given address at startup.
charset parameter. Format:charset=DBCharacterSet. This parameter can be used to specity database connection charset. The charset specified by DBCharacterSet should be equal to charset specified by LocalCharset command.
label parameter. Format: label=DBAlabel.
This parameter may be used to assign a label to DBAddr command. So, if you pass label
CGI-variable to
the DataparkSearch, then only DBAddr marked by label value will be used to performing search.
Thus, you can use one searchd daemon to answer queries for several search databases selectible by
label
variable.
Note: If no
label
is passed as CGI-parameter, then only DBAddr without alabel
will be used to perform search query.
Example:
DBAddr mysql://foo:bar@localhost/dpsearch/?dbmode=single
You may choose alternative working directory for cache mode:
VarDir /usr/local/dpsearch/var
Whether to enable news extensions. Default value is no.
NewsExtensions yes
This is used if DataparkSearch was compiled with syslog support and if you don't like the default value. Argument is the same as used in syslog.conf file. For list of possible facilities see syslog.conf(5)
SyslogFacility local7
Word lengths. You may change default length range of words stored in database. By default, words with the length in the range from 1 to 32 are stored.
MinWordLength 1 MaxWordLength 32
This command is used for specify maximal document size. Default value 1048576 (1 Megabyte). Takes global effect for whole config file.
MaxDocSize 1048576
This command is used to checkonly urls with content size less than value specified. Default value 0. Takes global effect for whole config file.
MinDocSize 1024
Use this command to specify the maximal amount of data stored in index per document. Default value 0. This means no limit. Takes effect till next IndexDocSizeLimit command.
IndexDocSizeLimit 65536
Select number of targets to index at once. Default value is 1024.
URLSelectCacheSize 10240
Select at once this number of urls to write cache mode indexes, to preload url data or to calculate the Popularity Rank. Default value is 100000.
URLDumpCacheSize 10240
Switch on or off the ID generation for URL using HASH32. Default value is "no".
UseCRC32URLId yes
Switching it on allow speed up indexing a bit, but some small number of collisions is possible.
You may add desired headers to indexer's HTTP request. You should not use "If-Modified-Since", "Accept-Charset" headers, these headers are composed by indexer itself. "User-Agent: DataparkSearch/version" header is sent too, but you may override it. Command has global effect for all configuration file.
HTTPHeader "User-Agent: My_Own_Agent" HTTPHeader "Accept-Language: ru, en" HTTPHeader "From: webmaster@mysite.com"
Allow [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]
Use this command to allow URLs that match (doesn't match) given argument.
First three optional parameters describe the type of comparison.
Default values are Match, NoCase, String.
Use NoCase
or Case
values to choose case sensitive or case insensitive
comparison.
Use Regex
to choose regular expression comparison.
Use String
to choose string with wildcards comparison.
Wildcards are '*' for any number of characters and '?' for one character.
Note that '?' and '*' have special meaning in String
match type. Please use
Regex
to describe documents with '?' and '*' signs in URL.
String
match is much faster than Regex
. Use String
where it
is possible.
You may use several arguments for one Allow command.
You may use this command any times.
Takes global effect for config file.
Note that DataparkSearch automatically adds one "Allow regex .*"
command after reading config file. It means that allowed everything
that is not disallowed.
Examples
# Allow everything: Allow * # Allow everything but .php .cgi .pl extensions case insensitively using regex: Allow NoMatch Regex \.php$|\.cgi$|\.pl$ # Allow .HTM extension case sensitively: Allow NoCase *.HTM
Disallow [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]
Use this command to disallow URLs that match (doesn't match) given argument. The meaning of first three optional parameters is exactly the same with Allow command. You can use several arguments for one Disallow command. Takes global effect for config file. Examples:
# Disallow URLs that are not in udm.net domains using "string" match: Disallow NoMatch *.udm.net/* # Disallow any except known extensions and directory index using "regex" match: Disallow NoMatch Regex \/$|\.htm$|\.html$|\.shtml$|\.phtml$|\.php$|\.txt$ # Exclude cgi-bin and non-parsed-headers using "string" match: Disallow */cgi-bin/* *.cgi */nph-* # Exclude anything with '?' sign in URL. Note that '?' sign has a # special meaning in "string" match, so we have to use "regex" match here: Disallow Regex \?
CheckOnly [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]
The meaning of first three optional parameters is exactly the same with Allow command. Indexer will use HEAD instead of GET HTTP method for URLs that match/do not match given regular expressions. It means that the file will be checked only for being existing and will not be downloaded. Useful for zip,exe,arj and other binary files. Note that you can disallow those files with commands given below. You may use several arguments for one CheckOnly commands. Useful for example for searching through the URL names rather than the contents (a la FTP-search). Takes global effect for config file. Examples:
# Check some known non-text extensions using "string" match: CheckOnly *.b *.sh *.md5 # or check ANY except known text extensions using "regex" match: CheckOnly NoMatch Regex \/$|\.html$|\.shtml$|\.phtml$|\.php$|\.txt$
HrefOnly [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ... ]
The meaning of first three optional parameters is exactly the same with Allow command. Use this to scan a HTML page for "href" attribute of tags but not to index the contents of the page with an URLs that match (doesn't match) given argument. Commands have global effect for all configuration file. When indexing large mail list archives for example, the index and thread index pages (like mail.10.html, thread.21.html, etc.) should be scanned for links but shouldn't be indexed:
HrefOnly */mail*.html */thread*.html
CheckMp3 [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ...]
The meaning of first three optional parameters is exactly the same with Allow command. If an URL matches given rules, indexer will download only a little part of the document and try to find MP3 tags in it. On success, indexer will parse MP3 tags, else it will download whole document then parse it as usual. Notes: This works only with those servers which support HTTP/1.1 protocol. It is used "Range: bytes" header to download mp3 tag.
CheckMp3 *.bin *.mp3
CheckMP3Only [Match|NoMatch] [NoCase|Case] [String|Regex] <arg> [<arg> ...]
The meaning of first three optional parameters is exactly the same with Allow command. If an URL matches given rules, indexer, like in the case CheckMP3 command, will download only a little part of the document and try to find MP3 tags. On success, indexer will parse MP3 tags, else it will NOT download whole document.
CheckMP3Only *.bin *.mp3
IndexIf [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]
Use this command to allow indexing, if the value of section
match the arg
pattern given.
The meaning of first three optional parameters is exactly the same
as for the Allow command (see Section 3.10.14>).
Example
IndexIf regex Title Manual IndexIf body "*important detail*"
NoIndexIf [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]
Use this command to disallow indexing, if the value of section
match the arg
pattern given.
The meaning of first three optional parameters is exactly the same
as for the Allow command (see Section 3.10.14>).
Example
NoIndexIf regex Title Sex IndexIf body *xxx*
AllowIf [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]
This command is similar to the Allow command (see Section 3.10.14>), but is applicable to any section of the document indexed, and it is appplied after the content of the document downloaded and indexed. Use this command to allow indexing, if the value of section
match the arg
pattern given.
The meaning of first three optional parameters is exactly the same
as for the Allow command.
Example
AllowIf regex Title Manual AllowIf body "*important detail*"
DisallowIf [Match|NoMatch] [NoCase|Case] [String|Regex] <section> <arg> [<arg> ... ]
This command is similar to the Disallow command (see Section 3.10.15>), but is applicable to any section of the document indexed, and it is appplied after the content of the document downloaded and indexed. Use this command to delete corresponding document from the database, if the value of section
match the arg
pattern given.
The meaning of first three optional parameters is exactly the same
as for the Allow command (see Section 3.10.14>).
Example
DisallowIf regex Title Sex DisallowIf body *xxx*
HoldBadHrefs <time>
How much time to hold URLs with erroneous status before deleting them from the database. For example, if host is down, indexer will not delete pages from this site immediately and search will use previous content of these pages. However if site doesn't respond for a month, probably it's time to remove these pages from the database. For <time> format see description of Period command in Section 3.10.28>.
HoldBadHrefs 30d
DeleteOlder <time>
How much time to hold URLs before deleting them from the database. For example, for news sites indexing, you may delete automatically old news articles after specified period. For <time> format see description of Period command in Section 3.10.28>. Default value is 0. "0" value mean "do not check". You may specify several DeleteOlder commands, for example, by one for every Server command.
DeleteOlder 7d
UseRemoteContentType yes/no
This command specifies if the indexer should get content type from http server headers (yes) or from it's AddType settings (no). If set to 'no' and the indexer could not determine content-type by using its AddType settings, then it will use http header. Default: yes
UseRemoteContentType yes
AddType [String|Regex] [Case|NoCase] <mime type> <arg> [<arg>...]
This command associates filename extensions (for services that don't automatically include them) with their mime types. Currently "file:" protocol uses these commands. Use optional first two parameter to choose comparison type. Default type is "String" "Case" (case insensitive string match with '?' and '*' wildcards for one and several characters correspondently).
AddType image/x-xpixmap *.xpm
Period <time>
Set reindex period. <time> is in the form 'xxxA[yyyB[zzzC]]' (Spaces are allowed between xxx and A and yyy and so on) there xxx, yyy, zzz are numbers (can be negative!) A, B, C can be one of the following: s - second M - minute h - hour d - day m - month y - year (these letters are the same as in strptime/strftime functions). Examples:
15s - 15 seconds 4h30M - 4 hours and 30 minutes 1y6m-15d - 1 year and six month minus 15 days 1h-10M+1s - 1 hour minus 10 minutes plus 1 second
If you specify only number without any character, it is assumed that time is given in seconds. Can be set many times before Server command and takes effect till the end of config file or till next Period command.
Period 7d
PeriodByHops <hops> [ <time> ]
Set reindex period per <hops> basis. The format for <time> is the same as for Period.
Can be set many times before Server command and takes effect till the end of config file or till next PeriodByHops command with same <hops> value. If <time> parameter is omitted, this undefine the previous defined value.
If for given <hops> value the appropriate PeriodByHops command is not specified, in this case the value defined in Period command is used.
ExpireAt [ A [ B [ C [ D [ E ]]]]]
This command allow specify the exactly expiration time for documents. May be specified per Server/Realm basis and takes effect till the end of config file or till next ExpireAt command. ExpireAt specified without any arguments disable previously specified value. A - stand for minute, may be * or 0-59; B - stand for hour, may be * or 0-23; C - stand for day of month, may be * or 1-31; D - stand for month, may be * or 1-12; E - stand for day of week, may be * or 0-6, 0 - is Sunday. ExpireAt command have higher prioroty over Period or PeriodByHops command.
UseDateHeader yes|no|force
Use Date header if no Last-Modified header is sent by remote web-server. The value "force" instructs to use Date header even if Last-Modified header has been sent by remote server. Default value: no.
LMDSection <section name>
This command specify the section which will be used as the document last modification date instead of Last-Modified header sent by remote web-server. Can be set many times before Server command and takes effect till the end of config file or till next LMDSection command. Default value is undefined. Use this command without any argument to make its value undefined. If the value of the section specified by this command is not defined for current document the value of Last-Modified header will be used.
MaxHops <number>
It limits the length of a way from a seeding URL to the indexing one in "mouse clicks". Default value is 256. Can be set multiple times before "Server" command and it takes effect till the end of config file or till next MaxHops command.
MaxHops 256
TrackHops yes|no
This command enable or disable hops tracking in reindexing. Default value is no. If enabled, the value of hops for url is recalculated when reindexing. Otherwise the value of hops is calculated only once at insertion of url into database.
TrackHops yes
MaxDepth <number>
It limits the directory depth of an URL indexed. Default value is 16. Can be set multiple times before "Server" command and takes effect till the end of config file or till next MaxDepth command.
MaxDepth 2
MaxDocsPerServer <number>
Limits the number of hrefs accepted from a Server. Default value is -1, that means no limits. If set to positive value, no more than given number of pages will be indexed from one server during this run of index. Can be set multiple times before Server command and takes effect till the end of config file or till next MaxDocsPerServer command.
MaxDocsPerServer 100
MaxHrefsPerServer <number>
Limits the number of documents retrieved from a Server. Default value is -1, that means no limits. If set to positive value, no more than given number of hrefs will be picked up from one server during this run of index. Can be set multiple times before Server command and takes effect till the end of config file or till next MaxHrefsPerServer command.
MaxHrefsPerServer 100
MaxNetErrors <number>
Maximum network errors for each server.
Default value is 16. Use 0 for unlimited errors number.
If there too many network errors on some server
(server is down, host unreachable, etc) indexer will try to do
not more then 'number'
attempts to connect to this server.
Takes effect till the end of config file or till next MaxNetErrors command.
MaxNetErrors 16
ReadTimeOut <time>
Connect timeout and stalled connections timeout.
For <time>
format see Section 3.10.28>.
Default value is 30 seconds.
Can be set any times before Server command and
takes effect till the end of config file or till next ReadTimeOut command.
ReadTimeOut 30s
DocTimeOut <time>
Maximum amount of time indexer spends for one document downloading.
For <time>
format see Section 3.10.28>.
Default value is 90 seconds.
Can be set any times before Server command and
takes effect till the end of config file or till next DocTimeOut command.
DocTimeOut 1m30s
NetErrorDelayTime <time>
Specify document processing delay time if network error has occurred.
For <time>
format see Section 3.10.28>.
Default value is one day
NetErrorDelayTime 1d
Cookies yes/no
Enables/Disables the support for HTTP cookies. Command may be used several times before Server command and takes effect till the end of config file or till next Cookies command. Default value is "no".
Cookies yes
Section <string> <number> <maxlen> [strict] [ <pattern> <replacement> ]
where <string>
is a section name and <number>
is section ID
between 0 and 255. Use 0 if you don't want to index some of
these sections. It is better to use different IDs
for different sections. In this case during search
time you'll be able to give different weight to each section
or even disallow some sections at a search time.
<maxlen>
argument contains a maximum length of section
which will be stored in database. Use 0 for <maxlen>
, if you don't want to store this section.
<pattern>
and <replacement>
are a regex-like
pattern and replacement to extract section value from document content.
You can specify strict
option to set strict string tokenization for a section, which mean word break
at any non-character symbol despite the context. It's useful, for example, in indexing of URL, where hyphen, the character,
uses as delimiter between words.
You can specify single
option for a single value section, for which any second value will be skipped in processing. This is useful, for example, to clean up
titles of pages with frames or to remove doubled titles when libextractor is used.
# Standard HTML sections: body, title Section body 1 256 Section title 2 128 # strict tokenization for URL Section url 3 0 strict # regex-pattern for a section Section GoodName 4 128 "<h1>([^<]*)</h1>" "<b>GoodName:</b> $1"
HrefSection <string> [ <pattern> <replacement> ]
where <string>
is a section name,
<pattern>
and <replacement>
are a regex-like
pattern and replacement to extract section value from document content.
Use this command to extract links from document content.
# Standard HTML sections: body, title HrefSection link HrefSection NewLink "<newlink>([^<]*)</newlink>" "$1"
The "FastHrefCheck yes" command is useful to speed-up the indexing when you have a huge list of Server/Realm/Subnet commands as it disables the href checking against server list during parsing.
Index yes/no
Prevent indexer from storing words into database. Useful for example for link validation. Can be set multiple times before Server command and takes effect till the end of config file or till next Index command. Default value is "yes".
Index no
ProxyAuthBasic login:passwd
Specity username and password for http proxy basic authorisation and for SOCKS5 authorisation. Can be used before every Server command and takes effect only for next one Server command! It should be also before Proxy command. Examples:
ProxyAuthBasic somebody:something
Proxy [http|socks5] your.proxy.host[:port]
Use proxy rather then connect directly. You can specify either HTTP or SOCK5 proxy type. HTTP proxy type is used by default. One can index ftp servers when using HTTP proxy Default port value if not specified is 3128 (Squid) If proxy host is not specified direct connect will be used. Can be set before every Server command and takes effect till the end of config file or till next Proxy command. If no one Proxy command specified indexer will use direct connect. Examples:
# Proxy on atoll.anywhere.com, port 3128: Proxy atoll.anywhere.com # Proxy on lota.anywhere.com, port 8090: Proxy lota.anywhere.com:8090 # Proxy on local Tor Proxy socks5 localhost:9050 # Disable proxy (direct connect): Proxy
AuthBasic login:passwd
Use basic http authorization. Can be set before every Server command and takes effect only for next one Server command! Examples:
AuthBasic somebody:something # If you have password protected directory(-ies), but whole server is open,use: AuthBasic login1:passwd1 Server http://my.server.com/my/secure/directory1/ AuthBasic login2:passwd2 Server http://my.server.com/my/secure/directory2/ Server http://my.server.com/
ServerWeight <number>
Server weight for Popularity Rank calculation (see Section 8.5.3>). Default value is 1.
ServerWeight 1
OptimizeAtUpdate yes
Specify word index optimize strategy. Default value: no If enabled, this save disk space, but slow down indexing. May be placed in indexer.conf and cached.conf.
SkipUnreferred yes|no|del
Default value: no. Use this command to skip reindexing or delete unreferred documents. An unreferred document is a document with no links to it. This command require the links collection to be enabled (see Section 8.5.3>).
Bind 127.0.0.1
You may use this command to specify local ip address, if your system have several network interfaces.
ProvideReferer yes
Use this command to provide Referer: request header for HTTP and HTTPS connections.
LongestTextItems 4
Use this command to specify the number of longest text items to index.
With MakePrefixes yes command you can instruct indexer to produce automatically all prefixes for words indexed. This is suitable, for example, for making search suggestions.