DataparkSearch by default sorts results first by relevency and second by popularity rank.
In indexing, DataparkSearch divides every document onto sections. A section is any part of the document, for example, for HTML documents this may be TITLE or META Description tag.
In addition to sections, some document factors are also take in account for relevance calculation: the average distance between query words, the number of query word occurrences, the position of first occurrence of a query word, the difference between the distribution of query word counts and the uniform distribution.
In searching, DataparkSearch compares every document found against an "ideal" document. The "ideal" document should have query words in every section defined and should have also the predefined values of additional factors.
Since sections definition located only in indexer.conf file, use NumSections command in searchd.conf or in search.htm to specify the number od section used. By default, this value is 256. But note, NumSections do not affect document ordering, only the relevance value.
Table 8-3. Configure-time parameters to tune relevance calculation (switches for configure)
--enable-rel | This option enables "full", "fast" or "ultra" version of relevance calculation. Value by default: full (i.e. full relevance calculation). |
--disable-reldistance | This option disables accounting of average word distance for relevance calculation. Value by default: enabled. |
--disable-relposition | This option disables accounting of first query word position for relevance calculation. Value by default: enabled. |
--disable-relwrdcount | This option disables accounting of word counts for relevance calculation. Value by default: enabled. |
--with-avgdist= | This option specify the |
--with-bestpos= | This option specify the |
--with-bestwrdcnt= | This option specify the |
--with-distfactor= | This option specify the |
--with-lessdistfactor= | This option specify the |
--with-posfactor= | This option specify the |
--with-lessposfactor= | This option specify |
--with-wrdcntfactor= | This option specify the |
--with-lesswrdcntfactor= | This option specify |
--with-wrdunifactor= | This option specify the |
Let x is the weighted sum of all sections. The weights for these sections are define by
wf
parameter (see Section 8.1.3>). Let y is the weighted sum of
differences between values of additional factors of document found and corresponding values of additional factors of the "ideal"
document. And let xy is the weighted sum of sections where at least one query word has been found. Then value of
relevance for a document found is calculates as: 0.5 * ( x + xy ) / (x + y).
Let x is the number of bits used in weighted values of all sections defined. Let y is the weighted sum of differences between additional factors of document found and corresponding values of the "ideal" document. And let xy is the number of bits where weighted values of sections of the "ideal" document are different to weighted values of sections of document found. Then value of document relevance is calculates as: ( x - xy ) / ( x + y ).
DataparkSearch support two methods for popularity rank calculation. A method used in previous versions called "Goo", and new method is called "Neo". By default, the Goo method is used. To select desired PopRank calculation method use PopRankMethod command:
PopRankMethod Neo
You need enable links collection by CollectLinks yes command in your indexer.conf file for Neo method and for full functionality of Goo method. But this slow down a bit indexing speed. By default, links collection is not enabled.
By default, only intersite links (i.e. links from a page on one site to a page on an another site) are taken in account for the popularity rank calculation. If you place PopRankSkipSameSite no command in indexer.conf file, indexer take all links for this purpose.
You may assign initial value for page popularity rank using DP.PopRank META tag (see Section 4.3>).
The popularity rank calculation is made in two stages. At first stage, the value of Weight
parameter
for every server is divide by number of links from this server. Thus, the weight of one link from this server is calculated.
At second stage, for every page we find the sum of weights of all links pointed to this page. This sum is popularity rank for this page.
By default, the value of Weight
parameter is equal to 1 for all servers indexed.
You may change this value by ServerWeight command in indexer.conf file or
directly in server table, if you load servers configuration from this table.
If you place
PopRankFeedBack yes
command in indexer.conf file, indexer will calculate site weights before page rank
calculation. To do that, indexer calculate sum of popularity rank for all pages from same site. If this sum will
great 1, the weight for site set to this sum, otherwise, site weight is set to 1.
If you place
PopRankUseTracking yes
command in indexer.conf file, indexer will calculate site weight as the number of
tracked queries with restriction on this site.
If you place
PopRankUseShowCnt yes
command in search.htm (or searchd.conf) file, then for every result shown to user
corresponding url.shows value will be increased on 1, if relevance for this result is great or equal to
value specified by
PopRankShowCntRatio
command (default value is 25.0).
If you place PopRankUseShowCnt yes
in indexer.conf file, indexer
will add to url's PopularityRank the value of url.shows multiplied by value, specified in
PopRankShowCntWeight
command (default value is 0.01).
For this method is supposed all pages are neurons and links between pages are links between neurons. So it's possible use an error back-propagation algorithm to train this neural network. Popularity rank for a page is the activity level for corresponding neuron. See short description of The Neo Popularity Rank for web pages.
You may use
PopRankNeoIterations
command to specify the number of iterations of the Neo Popularity Rank calculation. Default value is 3.
By default, the Neo Popularity Rank is caclulated along with indexing. To speed up indexing, you may postpone Popularity Rank execution using PopRankPostpone command:
PopRankPostpone yes
Then you may calculate the Neo Popularity Rank after indexing in same way as for method Goo, i.e.: indexer -TR
Please note that in case of boolean searching for two or more words, you have to enter operators (&, |, ~, AND, OR, NOT, NEAR, ALL, etc.). I.e. it is necessary to enter a & book instead of a book. See also Section 8.1.7>.
This feature allows to assign words between <a href="xxx"> and </a> also to a document this link leads to, and the words from alt attribute of img tag to the picture this tag is pointed to. To enable Crosswords, please use CrossWords yes command in indexer.conf and search.htm, and define crosswords section in sections.conf file.
With the CrossWordsSkipSameSite command you can manage the collection of crosswords from the same site. If the option yes is set (by default), the crosswords from the same site doen't collected. If you wish to collect such crosswords, you need to set no option explicitly:
CrossWordsSkipSameSite no
The Summary Exctraction Algorith (SEA) builds the summary of three or more the most relevant sentences of the each document indexed, if this document consists of six or more sentences. To enable this feature, add this command to your seaction.conf file:
Section sea x ywhere
x
- the number of section and y
- the maximum length of this section value,
leave 0, if you do not want show this in result pages.
If you specify y
non-zero, you may use $(sea) meta-variable in your search
template to show the summary in result pages.Related configuration directives:
The SEASentenceMinLength command specify the minimal length of sentence to be used in summary construction using the SEA. Default value: 64.
The SEASentences command uses to specify the maximal number of sentences with length greater or equal to the value specified by the SEASentenceMinLength command, which are used for summary construction in the SEA. Default value: 32. Since the summary construction using SEA is nonlinear expensive (affects only indexing), you may adjust this value according to desired indexing performance.
With SEASections command you can specify the list of document sections which are used to construct SEA summary. By default, only the "body" section is used for SEA summary construction.
SEASections "body, title"
This algorithm of automatic summary construction is based on ideas of Rada Mihalcea described in the paper Rada Mihalcea and Paul Tarau, An Algorithm for Language Independent Single and Multiple Document Summarization, in Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), Korea, October 2005.
Differences in DataparkSearch's SEA:
Initial weights for graph edges are calculates as a measure of similarity between 3-gram distributions of corresponding sentences.
Initial value for all graph vertexes is equal to the value of 1 / (number of sentences + 1) in the current implementation.
The Neo PopRank algorithm is used as ranking algorithm to iterate values assigned to vertexes.
After indexing of document collection with this section defined, you may use $(sea) meta-variable in your template to show summary for a search result.