DataparkSearch has an alias support making it possible to index sites taking information from another location. For example, if you index local web server, it is possible to take pages directly from disk without involving your web server in indexing process. Another example is building of search engine for primary site and using its mirror while indexing. There are several ways of using aliases.
Format of "Alias" indexer.conf command:
Alias <masterURL> <mirrorURL>
E.g. you wish to index http://search.site.ru/ using nearest German mirror http://www.other.com/mirrors/Search/. Add these lines in your indexer.conf:
Server http://search.site.ru/ Alias http://search.site.ru/ http://www.other.com/mirrors/Search/
search.cgi will display URLs from master site http://search.site.ru/ but indexer will take corresponding page from mirror site http://www.other.com/mirrors/Search/.
Another example. If you want to index everything in udm.net domain and one of servers, for example http://home.udm.net/ is stored on local machine in /home/httpd/htdocs/ directory. These commands will be useful:
Realm http://*.udm.net/ Alias http://home.udm.net/ file:/home/httpd/htdocs/
Indexer will take home.udm.net from local disk and index other sites using HTTP.
Aliases are searched in the order of their appearance in indexer.conf. So, you can create different aliases for server and its parts:
# First, create alias for example for /stat/ directory which # is not under common location: Alias http://home.udm.net/stat/ file:/usr/local/stat/htdocs/ # Then create alias for the rest of the server: Alias http://home.udm.net/ file:/usr/local/apache/htdocs/
Note: if you change the order of these commands, alias for /stat/ directory will never be found.
You may specify location used by indexer as an optional argument for Server command:
Server http://home.udm.net/ file:/home/httpd/htdocs/
Aliases in Realm command is a very powerful
feature based on regular expressions. The idea of aliases in Realm
command implementation is similar to how PHP
preg_replace()
function works. Aliases in Realm
command work only if "regex" match type is used and does not work with
"string" match type.
Use this syntax for Realm aliases:
Realm regex <URL_pattern> <alias_pattern>
Indexer searches URL for matches to URL_pattern and builds an URL alias using alias_pattern. alias_pattern may contain references of the form $n. Where n is a number in the range of 0-9. Every such reference will be replaced by text captured by the n'th parenthesized pattern. $0 refers to text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing subpattern.
Example: your company hosts several hundreds users with their domains in the form of www.username.yourname.com. Every user's site is stored on disk in "htdocs" under user's home directory: /home/username/htdocs/.
You may write this command into indexer.conf (note that dot '.' character has a special meaning in regular expressions and must be escaped with '\' sign when dot is used in usual meaning):
Realm regex (http://www\.)(.*)(\.yourname\.com/)(.*) file:/home/$2/htdocs/$4
Imagine indexer process http://www.john.yourname.com/news/index.html page. It will build patterns from $0 to $4:
$0 = 'http://www.john.yourname.com/news/index.htm' (whole patter match)
$1 = 'http://www.' subpattern matches '(http://www\.)'
$2 = 'john' subpattern matches '(.*)'
$3 = '.yourname.com/' subpattern matches '(\.yourname\.com/)'
$4 = '/news/index.html' subpattern matches '(.*)'
Then indexer will compose alias using $2 and $4 patterns:
file:/home/john/htdocs/news/index.html
and will use the result as document location to fetch it.
You may also specify AliasProg command for aliasing purposes. AliasProg is useful for major web hosting companies which want to index their web space taking documents directly from a disk without having to involve web server in indexing process. Documents layout may be very complex to describe it using alias in Realm command. AliasProg is an external program that can be called, that takes a URL and returns one string with the appropriate alias to stdout. Use $1 to pass URL to command line.
For example this AliasProg command uses 'replace' command from MySQL distribution and replaces URL substring http://www.apache.org/ to file:/usr/local/apache/htdocs/:
AliasProg "echo $1 | /usr/local/mysql/bin/mysql/replace http://www.apache.org/ file:/usr/local/apache/htdocs/"
You may also write your own very complex program to process URLs.
The ReverseAlias indexer.conf command allows URL mapping before URL is inserted into database. Unlike Alias command, that triggers mapping right before a document is downloaded, ReverseAlias command triggers mapping after the link is found.
ReverseAlias http://name2/ http://name2.yourname.com/ Server http://name2.yourname.com/
All links with short server name will be mapped to links with full server name before they are inserted into database.
One of the possible use is cutting various unnecessary strings like PHPSESSION=XXXX
E.g. cutting from URL like http://www/a.php?PHPSESSION=XXX, when PHPSESSION is the only parameter. The question sign is deleted as well:
ReverseAlias regex (http://[^?]*)[?]PHPSESSION=[^&]*$ $1$2
Cutting from URL like w/a.php?PHPSESSION=xxx&.., i.e. when PHPSESSION is the first parameter, but there are other parameters following it. The '&' sign after PHPSESSION is deleted as well. Question mark is not deleted:
ReverseAlias regex (http://[^?]*[?])PHPSESSION=[^&]*&(.*) $1$2
Cutting from URL like http://www/a.php?a=b&PHPSESSION=xxx or http://www/a.php?a=b&PHPSESSION=xxx&c=d, where PHPSESSION is not the first parameter. The '&' sign before PHPSESSION is deleted:
ReverseAlias regex (http://.*)&PHPSESSION=[^&]*(.*) $1$2
ReverseAliasProg - is a command similar to both AliasProg command and ReverseAlias command. It takes agruments as AliasProg but maps URL before inserting it into database, as ReverseAlias command.
It is also possible to define aliases in search template (search.htm). The Alias command in search.htm is identical to the one in indexer.conf, however it is active during searching, not indexing.
The syntax of the search.htm Alias command is the same as in indexer.conf:
Alias <find-prefix> <replace-prefix>
For example, there is the following command in search.htm:
Alias http://localhost/ http://www.site.ext/
Search returned a page with the following URL:
http://localhost/news/article10.html
As a result, the $(DU) variable will be replace NOT with this URL:
http://localhost/news/article10.html
but with the following URL (that results in processing with Alias):
http://www.site.ext/news/article10.html