DataparkSearch indexer can use external parsers to index various file types (MIME types).
Parser is an executable program which converts one of the mime types to text/plain or text/html. For example, if you have postscript files, you can use ps2ascii parser (filter), which reads postscript file from stdin and produces ascii to stdout.
Indexer supports four types of parsers that can:
read data from stdin and send result to stdout
read data from file and send result to stdout
read data from file and send result to file
read data from stdin and send result to file
Configure mime types
Configure your web server to send appropriate "Content-Type" header. For apache, have a look at mime.types file, most mime types are already defined there.
If you want to index local files or via ftp use "AddType" command in indexer.conf to associate file name extensions with their mime types. For example:
AddType text/html *.html
Add lines with parsers definitions. Lines have the following format with three arguments:
Mime <from_mime> <to_mime> [<command line>]
For example, the following line defines parser for man pages:
# Use deroff for parsing man pages ( *.man ) Mime application/x-troff-man text/plain deroff
This parser will take data from stdin and output result to stdout.
Many parsers can not operate on stdin and require a file to read from. In this case indexer creates a temporary file in /tmp and will remove it when parser exits. Use $1 macro in parser command line to substitute file name. For example, Mime command for "catdoc" MS Word to ASCII converters may look like this:
Mime application/msword text/plain "/usr/bin/catdoc -a $1"
If your parser writes result into output file, use $2 macro. indexer will replace $2 by temporary file name, start parser, read result from this temporary file then remove it. For example:
Mime application/msword text/plain "/usr/bin/catdoc -a $1 >$2"
The parser above will read data from first temporary file and write result to second one. Both temporary files will be removed when parser exists. Note that result of usage of this parser will be absolutely the same with the previous one, but they use different execution mode: file->stdout and file->file correspondingly.
If the <command line> parameter is omitted this means both MIME type are synonyms. E.g. some sites can supply incorrect type for MP3 files as application/mp3. You can alter it into correct one audio/mpeg and therefore process them:
Mime application/mp3 audio/mpeg
To avoid a indexer hang on parser execution, you may specify the amount of time in seconds for parser execution in your indexer.conf by ParserTimeOut command. For example:
ParserTimeOut 600
Default value is 300 seconds, i.e. 5 minutes.
You can use pipes in parser's command line. For example, these lines will be useful to index gzipped man pages from local disk:
AddType application/x-gzipped-man *.1.gz *.2.gz *.3.gz *.4.gz Mime application/x-gzipped-man text/plain "zcat | deroff"
Some parsers can produce output in other charset than given in LocalCharset command. Specify charset to make indexer convert parser's output to proper one. For example, if your catdoc is configured to produce output in windows-1251 charset but LocalCharset is koi8-r, use this command for parsing MS Word documents:
Mime application/msword "text/plain; charset=windows-1251" "catdoc -a $1"
When executing a parser indexer creates DPS_URL environment variable with an URL being processed as a value. You can use this variable in parser scripts.
RPM parser by Mario Lang <lang@zid.tu-graz.ac.at>
/usr/local/bin/rpminfo:
#!/bin/bash /usr/bin/rpm -q --queryformat="<html><head><title>RPM: %{NAME} %{VERSION}-%{RELEASE} (%{GROUP})</title><meta name=\"description\" content=\"%{SUMMARY}\"></head><body> %{DESCRIPTION}\n</body></html>" -p $1
indexer.conf:
Mime application/x-rpm text/html "/usr/local/bin/rpminfo $1"
It renders to such nice RPM information:
3. RPM: mysql 3.20.32a-3 (Applications/Databases) [4] Mysql is a SQL (Structured Query Language) database server. Mysql was written by Michael (monty) Widenius. See the CREDITS file in the distribution for more credits for mysql and related things.... (application/x-rpm) 2088855 bytes
catdoc MS Word to text converter
Home page, also listed on Freshmeat.
indexer.conf:
Mime application/msword text/plain "catdoc $1"
xls2csv MS Excel to text converter
It is supplied with catdoc.
indexer.conf:
Mime application/vnd.ms-excel text/plain "xls2csv $1"
pdftotext Adobe PDF converter
Supplied with xpdf project.
Homepage, also listed on Freshmeat.
indexer.conf:
Mime application/pdf text/plain "pdftotext $1 -"
unrtf RTF to html converter
indexer.conf:
Mime text/rtf* text/html "/usr/local/dpsearch/sbin/unrtf --html $1" Mime application/rtf text/html "/usr/local/dpsearch/sbin/unrtf --html $1"
xlhtml XLS to html converter
indexer.conf:
Mime application/vnd.ms-excel text/html "/usr/local/dpsearch/sbin/xlhtml $1"
ppthtml PowerPoint (PPT) to html converter. Part of xlhtml 0.5.
indexer.conf:
Mime application/vnd.ms-powerpoint text/html "/usr/local/dpsearch/sbin/ppthtml $1"
Using vwHtml (DOC to html).
/usr/local/dpsearch/sbin/0vwHtml.pl:
#!/usr/bin/perl -w $p = $ARGV[1]; $f = $ARGV[1]; $p =~ s/(.*)\/([^\/]*)/$1\//; $f =~ s/(.*)\/([^\/]*)/$2/; system("/usr/local/bin/wvHtml --targetdir=$p $ARGV[0] $f");
indexer.conf:
Mime application/msword text/html "/usr/local/dpsearch/sbin/0wvHtml.pl $1 $2" Mime application/vnd.ms-word text/html "/usr/local/dpsearch/sbin/0wvHtml.pl $1 $2"
swf2html from Flash Search Engine SDK
indexer.conf:
Mime application/x-shockwave-flash text/html "/usr/local/dpsearch/sbin/swf2html $1"
djvutxt from djvuLibre
indexer.conf:
Mime image/djvu text/plain "/usr/local/bin/djvutxt $1 $2" Mime image/x.djvu text/plain "/usr/local/bin/djvutxt $1 $2" Mime image/x-djvu text/plain "/usr/local/bin/djvutxt $1 $2" Mime image/vnd.djvu text/plain "/usr/local/bin/djvutxt $1 $2"
DataparkSearch can be build with libextractor library. Using this library, DataparkSearch can index keywords from files of the following formats: PDF, PS, OLE2 (DOC, XLS, PPT), OpenOffice (sxw), StarOffice (sdw), DVI, MAN, FLAC, MP3 (ID3v1 and ID3v2), NSF(E) (NES music), SID (C64 music), OGG, WAV, EXIV2, JPEG, GIF, PNG, TIFF, DEB, RPM, TAR(.GZ), ZIP, ELF, S3M (Scream Tracker 3), XM (eXtended Module), IT (Impulse Tracker), FLV, REAL, RIFF (AVI), MPEG, QT and ASF.
To build DataparkSearch with libextractor library, install the library, and then configure and compile DataparkSearch.
Bellow the relationship between keyword types of libextractor version prior to 0.6 and DataparkSearch's section names is given:
Table 3-1. Relationship between libextractor's keyword types and DataparkSearch section names
Keyword Type | Section name |
---|---|
EXTRACTOR_FILENAME | Filename |
EXTRACTOR_MIMETYPE | Mimetype |
EXTRACTOR_TITLE | Title |
EXTRACTOR_AUTHOR | Author |
EXTRACTOR_ARTIST | Artist |
EXTRACTOR_DESCRIPTION | Description |
EXTRACTOR_COMMENT | Comment |
EXTRACTOR_DATE | Date |
EXTRACTOR_PUBLISHER | Publisher |
EXTRACTOR_LANGUAGE | Content-Language |
EXTRACTOR_ALBUM | Album |
EXTRACTOR_GENRE | Genre |
EXTRACTOR_LOCATION | Location |
EXTRACTOR_VERSIONNUMBER | VersionNumber |
EXTRACTOR_ORGANIZATION | Organization |
EXTRACTOR_COPYRIGHT | Copyright |
EXTRACTOR_SUBJECT | Subject |
EXTRACTOR_KEYWORDS | Meta.Keywords |
EXTRACTOR_CONTRIBUTOR | Contributor |
EXTRACTOR_RESOURCE_TYPE | Resource-Type |
EXTRACTOR_FORMAT | Format |
EXTRACTOR_RESOURCE_IDENTIFIER | Resource-Idendifier |
EXTRACTOR_SOURCE | Source |
EXTRACTOR_RELATION | Relation |
EXTRACTOR_COVERAGE | Coverage |
EXTRACTOR_SOFTWARE | Software |
EXTRACTOR_DISCLAIMER | Disclaimer |
EXTRACTOR_WARNING | Warning |
EXTRACTOR_TRANSLATED | Translated |
EXTRACTOR_CREATION_DATE | Creation-Date |
EXTRACTOR_MODIFICATION_DATE | Modification-Date |
EXTRACTOR_CREATOR | Creator |
EXTRACTOR_PRODUCER | Producer |
EXTRACTOR_PAGE_COUNT | Page-Count |
EXTRACTOR_PAGE_ORIENTATION | Page-Orientation |
EXTRACTOR_PAPER_SIZE | Paper-Size |
EXTRACTOR_USED_FONTS | Used-Fonts |
EXTRACTOR_PAGE_ORDER | Page-Order |
EXTRACTOR_CREATED_FOR | Created-For |
EXTRACTOR_MAGNIFICATION | Magnification |
EXTRACTOR_RELEASE | Release |
EXTRACTOR_GROUP | Group |
EXTRACTOR_SIZE | Size |
EXTRACTOR_SUMMARY | Summary |
EXTRACTOR_PACKAGER | Packager |
EXTRACTOR_VENDOR | Vendor |
EXTRACTOR_LICENSE | License |
EXTRACTOR_DISTRIBUTION | Distribution |
EXTRACTOR_BUILDHOST | BuildHost |
EXTRACTOR_OS | OS |
EXTRACTOR_DEPENDENCY | Dependency |
EXTRACTOR_HASH_MD4 | Hash-MD4 |
EXTRACTOR_HASH_MD5 | Hash-MD5 |
EXTRACTOR_HASH_SHA0 | Hash-SHA0 |
EXTRACTOR_HASH_SHA1 | Hash-SHA1 |
EXTRACTOR_HASH_RMD160 | Hash-RMD160 |
EXTRACTOR_RESOLUTION | Resolution |
EXTRACTOR_CATEGORY | Ext.Category |
EXTRACTOR_BOOKTITLE | BookTitle |
EXTRACTOR_PRIORITY | Priority |
EXTRACTOR_CONFLICTS | Conflicts |
EXTRACTOR_REPLACES | Replaces |
EXTRACTOR_PROVIDES | Provides |
EXTRACTOR_CONDUCTOR | Conductor |
EXTRACTOR_INTERPRET | Interpret |
EXTRACTOR_OWNER | Owner |
EXTRACTOR_LYRICS | Lyrics |
EXTRACTOR_MEDIA_TYPE | Media-Type |
EXTRACTOR_CONTACT | Contact |
EXTRACTOR_THUMBNAIL_DATA | Thumbnail-Data |
EXTRACTOR_PUBLICATION_DATE | Publication-Date |
EXTRACTOR_CAMERA_MAKE | Camera-Make |
EXTRACTOR_CAMERA_MODEL | Camera-Model |
EXTRACTOR_EXPOSURE | Exposure |
EXTRACTOR_APERTURE | Aperture |
EXTRACTOR_EXPOSURE_BIAS | Exposure-Bias |
EXTRACTOR_FLASH | Flash |
EXTRACTOR_FLASH_BIAS | Flash-Bias |
EXTRACTOR_FOCAL_LENGTH | Focal-Length |
EXTRACTOR_FOCAL_LENGTH_35MM | Focal-Length-35MM |
EXTRACTOR_ISO_SPEED | ISO-Speed |
EXTRACTOR_EXPOSURE_MODE | Exposure-Mode |
EXTRACTOR_METERING_MODE | Metering-Mode |
EXTRACTOR_MACRO_MODE | Macro-Mode |
EXTRACTOR_IMAGE_QUALITY | Image-Quality |
EXTRACTOR_WHITE_BALANCE | White-Balance |
EXTRACTOR_ORIENTATION | Orientation |
EXTRACTOR_TEMPLATE | Template |
EXTRACTOR_SPLIT | Split |
EXTRACTOR_PRODUCTVERSION | ProductVersion |
EXTRACTOR_LAST_SAVED_BY | Last-Saved-By |
EXTRACTOR_LAST_PRINTED | Last-Printed |
EXTRACTOR_WORD_COUNT | Word-Count |
EXTRACTOR_CHARACTER_COUNT | Character-Count |
EXTRACTOR_TOTAL_EDITING_TIME | Total-Editing-Time |
EXTRACTOR_THUMBNAILS | Thumbnails |
EXTRACTOR_SECURITY | Security |
EXTRACTOR_CREATED_BY_SOFTWARE | Created-By-Software |
EXTRACTOR_MODIFIED_BY_SOFTWARE | Modified-By-Software |
EXTRACTOR_REVISION_HISTORY | Revision-History |
EXTRACTOR_LOWERCASE | Lowercase |
EXTRACTOR_COMPANY | Company |
EXTRACTOR_GENERATOR | Generator |
EXTRACTOR_CHARACTER_SET | Meta-Charset |
EXTRACTOR_LINE_COUNT | Line-Count |
EXTRACTOR_PARAGRAPH_COUNT | Paragraph-Count |
EXTRACTOR_EDITING_CYCLES | Editing-Cycles |
EXTRACTOR_SCALE | Scale |
EXTRACTOR_MANAGER | Manager |
EXTRACTOR_MOVIE_DIRECTOR | Movie-Director |
EXTRACTOR_DURATION | Duration |
EXTRACTOR_INFORMATION | Information |
EXTRACTOR_FULL_NAME | Full-Name |
EXTRACTOR_CHAPTER | Chapter |
EXTRACTOR_YEAR | Year |
EXTRACTOR_LINK | Link |
EXTRACTOR_MUSIC_CD_IDENTIFIER | Music-CD-Identifier |
EXTRACTOR_PLAY_COUNTER | Play-Counter |
EXTRACTOR_POPULARITY_METER | Popularity-Meter |
EXTRACTOR_CONTENT_TYPE | Ext.Content-Type |
EXTRACTOR_ENCODED_BY | Encoded-By |
EXTRACTOR_TIME | Time |
EXTRACTOR_MUSICIAN_CREDITS_LIST | Musician-Credits-List |
EXTRACTOR_MOOD | Mood |
EXTRACTOR_FORMAT_VERSION | Format-Version |
EXTRACTOR_TELEVISION_SYSTEM | Television-System |
EXTRACTOR_SONG_COUNT | Song-Count |
EXTRACTOR_STARTING_SONG | Strting-Song |
EXTRACTOR_HARDWARE_DEPENDENCY | Hardware-Dependency |
EXTRACTOR_RIPPER | Ripper |
EXTRACTOR_FILE_SIZE | File-Size |
EXTRACTOR_TRACK_NUMBER | Track-Number |
EXTRACTOR_ISRC | ISRC |
EXTRACTOR_DISC_NUMBER | Disc-Number |
If a section name from the list above doesn't specified in sections.conf, the value of corresponding keyword is written as body
section.
Keywords of unknown type are written as body
section as well.
For libextractor 0.6.x, the values returned by EXTRACTOR_metatype_to_string function are used as section names.