public class ReplaceIndexer extends Object implements IndexingFilter
index-replace
to your
plugin.includes
. Example:
<property> <name>plugin.includes</name> <value>protocol-(http)|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata|replace)|urlnormalizer-(pass|regex|basic)|indexer-solr</value> </property>And then add the
index.replace.regexp
property to
conf/nutch-site.xml
. This contains a list of replacement
instructions per field name, one per line. eg.
fieldname=/regexp/replacement/[flags]
<property> <name>index.replace.regexp</name> <value> hostmatch=.\*\.com title=/search/replace/2 </value> </property>
hostmatch=
and urlmatch=
lines indicate the match
pattern for a host or url. The field replacements that follow this line will
apply only to pages from the matching host or url. Replacements run in the
order specified. Field names may appear multiple times if multiple
replacements are needed.
The property format is defined in greater detail in
conf/nutch-default.xml
.X_POINT_ID
Constructor and Description |
---|
ReplaceIndexer() |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
Configuration |
getConf() |
void |
setConf(Configuration conf) |
public void setConf(Configuration conf)
setConf
in interface Configurable
public Configuration getConf()
getConf
in interface Configurable
public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
filter
in interface IndexingFilter
doc
- document instance for collecting fieldsparse
- parse data instanceurl
- page urldatum
- crawl datum for the page (fetch datum from segment containing
fetch status and fetch time)inlinks
- page inlinksIndexingException
Copyright © 2021 The Apache Software Foundation