DataparkSearch Engine 4.54: Reference manual
Prev	Chapter 7. Languages support	Next

7.3. Segmenters for Chinese, Japanese, Korean and Thai languages

Chinese, Japanese, Korean and Thai writings have no spaces between words in phrase as in western languages. Thus, while indexing documents in these languages, it's need additionally to segment phrases into words.

Sometimes, a text in Chinese, Japanese, Korean or Thai can be typed with a space between every hieroglyph for better view. In this case, you may use "ResegmentChinese yes", "ResegmentJapanese yes", "ResegmentKorean yes" or "ResegmentThai yes" commands to index a text typed in such way. With resegmenting enabled, all spaces between characters are removing and then all the text is segmenting again using DataparkSearch's segmenters (see below).

7.3.1. Japanese language phrase segmenter

For Japanese language phrase segmenting the one of ChaSen, a morphological system for Japanese language, or MeCab, a Japanese morphological analyser, is used. Thus, you need one of these systems to be installed before DataparkSearch's configuring and building.

To enable Japanese language phrase segmenting use --enable-chasen or --enable-mecab switch for configure.

7.3.2. Chinese language phrase segmenter

For Chinese language phrase segmenting the frequency dictionary of Chinese words is used. And segmenting itself is done by dynamic programming method to maximize the cumulative frequency of produced words.

To enable Chinese language phrase segmenting it's need to enable the support for Chinese charsets while DataparkSearch configuring, and specify the frequency dictionary of Chinese words by LoadChineseList command in indexer.conf file.

LoadChineseList [charset dictionaryfilename]

By default, the GB2312 charset and mandarin.freq dictionary is used.

Note: You need to download frequency dictionaries from our web site, or from one of our mirrors, see Section 1.2>.

7.3.3. Thai language phrase segmenter

For Thai language phrase segmenting the frequency dictionary of Thai words is used. And segmenting itself is done as for Chinese language.

To enable Thai language phrase segmenting it's need to specify the frequency dictionary of Thai words by LoadThaiList command in indexer.conf file.

LoadThaiList [charset dictionaryfilename]

By default, the tis-620 charset and thai.freq dictionary is used.

Note: You need to download frequency dictionaries from our web site, or from one of our mirrors, see Section 1.2>.

7.3.4. Korean language phrase segmenter

For Korean language phrase segmenting the frequency dictionary of Korean words is used. And segmenting itself is done as for Chinese language.

To enable Korean language phrase segmenting it's need to specify the frequency dictionary of Korean words by LoadKoreanList command in indexer.conf file.

LoadKoreanList [charset dictionaryfilename]

By default, the euc-kr charset and korean.freq dictionary is used.

Note: You need to download frequency dictionaries from our web site, or from one of our mirrors, see Section 1.2>.

Prev	Home	Next
Making multi-language search pages	Up	Multilingual servers support