8.5. Relevance

8.5.1. Ordering documents

DataparkSearch by default sorts results first by relevency and second by popularity rank.

8.5.2. Relevance calculation

In indexing, DataparkSearch divides every document onto sections. A section is any part of the document, for example, for HTML documents this may be TITLE or META Description tag.

In addition to sections, some document factors are also take in account for relevance calculation: the average distance between query words, the number of query word occurrences, the position of first occurrence of a query word, the difference between the distribution of query word counts and the uniform distribution.

In searching, DataparkSearch compares every document found against an "ideal" document. The "ideal" document should have query words in every section defined and should have also the predefined values of additional factors.

Since sections definition located only in indexer.conf file, use NumSections command in searchd.conf or in search.htm to specify the number od section used. By default, this value is 256. But note, NumSections do not affect document ordering, only the relevance value.

Table 8-3. Configure-time parameters to tune relevance calculation (switches for configure)

--enable-rel

This option enables "full", "fast" or "ultra" version of relevance calculation. Value by default: full (i.e. full relevance calculation).

--disable-reldistance

This option disables accounting of average word distance for relevance calculation. Value by default: enabled.

--disable-relposition

This option disables accounting of first query word position for relevance calculation. Value by default: enabled.

--disable-relwrdcount

This option disables accounting of word counts for relevance calculation. Value by default: enabled.

--with-avgdist=NUM

This option specify the NUM as the best average distance between words in document found. Value by default: 464.

--with-bestpos=NUM

This option specify the NUM as the best value of first word position in document found. Value by default: 4.

--with-bestwrdcnt=NUM

This option specify the NUM as the best number of each query word in document found. Value by default: 11.

--with-distfactor=NUM

This option specify the NUM as a factor for average word distance for relevance calculation. Value by default: 0.2.

--with-lessdistfactor=NUM

This option specify the NUM as factor of average word distance in relevance calculation when average distance is less than value specified with --with-avgdist. Default value is --with-distancefactor multiply by 2.

--with-posfactor=NUM

This option specify the NUM as factor for difference between first query word position in document found and best position specified by --with-bestpos option. Value by default: 0.5.

--with-lessposfactor=NUM

This option specify NUM as factor of first word position in relevance calculation when it less than value specified with --with-bestpos. Default value is --with-posfactor multyply by 4.

--with-wrdcntfactor=NUM

This option specify the NUM as factor for difference between count of query words in document found and the best value specified by --with-bestwrdcnt option. Value by default: 0.4.

--with-lesswrdcntfactor=NUM

This option specify NUM as factor of word count in relevance calculation when this word count is less than value specified with --with-bestwrdcnt. Default value is --with-wrdcntfactor multiply by 10.

--with-wrdunifactor=NUM

This option specify the NUM as factor for difference of query word counts from uniform distribution. Value by default: 1.5.

8.5.2.1. A full method of relevance calculation.

Let x is the weighted sum of all sections. The weights for these sections are define by wf parameter (see Section 8.1.3>). Let y is the weighted sum of differences between values of additional factors of document found and corresponding values of additional factors of the "ideal" document. And let xy is the weighted sum of sections where at least one query word has been found. Then value of relevance for a document found is calculates as: 0.5 * ( x + xy ) / (x + y).

8.5.2.2. A fast method of relevance calculation.

Let x is the number of bits used in weighted values of all sections defined. Let y is the weighted sum of differences between additional factors of document found and corresponding values of the "ideal" document. And let xy is the number of bits where weighted values of sections of the "ideal" document are different to weighted values of sections of document found. Then value of document relevance is calculates as: ( x - xy ) / ( x + y ).

8.5.3. Popularity rank

DataparkSearch support two methods for popularity rank calculation. A method used in previous versions called "Goo", and new method is called "Neo". By default, the Goo method is used. To select desired PopRank calculation method use PopRankMethod command:

PopRankMethod Neo

You need enable links collection by CollectLinks yes command in your indexer.conf file for Neo method and for full functionality of Goo method. But this slow down a bit indexing speed. By default, links collection is not enabled.

By default, only intersite links (i.e. links from a page on one site to a page on an another site) are taken in account for the popularity rank calculation. If you place PopRankSkipSameSite no command in indexer.conf file, indexer take all links for this purpose.

You may assign initial value for page popularity rank using DP.PopRank META tag (see Section 4.3>).

8.5.3.1. "Goo" popularity rank calculation method

The popularity rank calculation is made in two stages. At first stage, the value of Weight parameter for every server is divide by number of links from this server. Thus, the weight of one link from this server is calculated. At second stage, for every page we find the sum of weights of all links pointed to this page. This sum is popularity rank for this page.

By default, the value of Weight parameter is equal to 1 for all servers indexed. You may change this value by ServerWeight command in indexer.conf file or directly in server table, if you load servers configuration from this table.

If you place PopRankFeedBack yes command in indexer.conf file, indexer will calculate site weights before page rank calculation. To do that, indexer calculate sum of popularity rank for all pages from same site. If this sum will great 1, the weight for site set to this sum, otherwise, site weight is set to 1.

If you place PopRankUseTracking yes command in indexer.conf file, indexer will calculate site weight as the number of tracked queries with restriction on this site.

If you place PopRankUseShowCnt yes command in search.htm (or searchd.conf) file, then for every result shown to user corresponding url.shows value will be increased on 1, if relevance for this result is great or equal to value specified by PopRankShowCntRatio command (default value is 25.0). If you place PopRankUseShowCnt yes in indexer.conf file, indexer will add to url's PopularityRank the value of url.shows multiplied by value, specified in PopRankShowCntWeight command (default value is 0.01).

For this method is supposed all pages are neurons and links between pages are links between neurons. So it's possible use an error back-propagation algorithm to train this neural network. Popularity rank for a page is the activity level for corresponding neuron. See short description of The Neo Popularity Rank for web pages.

You may use PopRankNeoIterations command to specify the number of iterations of the Neo Popularity Rank calculation. Default value is 3.

By default, the Neo Popularity Rank is caclulated along with indexing. To speed up indexing, you may postpone Popularity Rank execution using PopRankPostpone command:

PopRankPostpone yes

Then you may calculate the Neo Popularity Rank after indexing in same way as for method Goo, i.e.: indexer -TR

8.5.4. Boolean search

Please note that in case of boolean searching for two or more words, you have to enter operators (&, |, ~, AND, OR, NOT, NEAR, ALL, etc.). I.e. it is necessary to enter a & book instead of a book. See also Section 8.1.7>.

8.5.5. Crosswords

This feature allows to assign words between <a href="xxx"> and </a> also to a document this link leads to, and the words from alt attribute of img tag to the picture this tag is pointed to. To enable Crosswords, please use CrossWords yes command in indexer.conf and search.htm, and define crosswords section in sections.conf file.

With the CrossWordsSkipSameSite command you can manage the collection of crosswords from the same site. If the option yes is set (by default), the crosswords from the same site doen't collected. If you wish to collect such crosswords, you need to set no option explicitly:

CrossWordsSkipSameSite no

8.5.6. The Summary Extraction Algorithm (SEA)

The Summary Exctraction Algorith (SEA) builds the summary of three or more the most relevant sentences of the each document indexed, if this document consists of six or more sentences. To enable this feature, add this command to your seaction.conf file:

Section sea x y
where x - the number of section and y - the maximum length of this section value, leave 0, if you do not want show this in result pages. If you specify y non-zero, you may use $(sea) meta-variable in your search template to show the summary in result pages.

Related configuration directives:

The SEASentenceMinLength command specify the minimal length of sentence to be used in summary construction using the SEA. Default value: 64.

The SEASentences command uses to specify the maximal number of sentences with length greater or equal to the value specified by the SEASentenceMinLength command, which are used for summary construction in the SEA. Default value: 32. Since the summary construction using SEA is nonlinear expensive (affects only indexing), you may adjust this value according to desired indexing performance.

With SEASections command you can specify the list of document sections which are used to construct SEA summary. By default, only the "body" section is used for SEA summary construction.

SEASections "body, title"

This algorithm of automatic summary construction is based on ideas of Rada Mihalcea described in the paper Rada Mihalcea and Paul Tarau, An Algorithm for Language Independent Single and Multiple Document Summarization, in Proceedings of the International Joint Conference on Natural Language Processing (IJCNLP), Korea, October 2005.

Differences in DataparkSearch's SEA:

After indexing of document collection with this section defined, you may use $(sea) meta-variable in your template to show summary for a search result.