public class LinksIndexingFilter extends Object implements IndexingFilter
IndexingFilter
that adds
outlinks
and inlinks
field(s) to the document.
In case that you want to ignore the outlinks that point to the same host
as the URL being indexed use the following settings in your configuration
file:
<property>
<name>index.links.outlinks.host.ignore</name>
<value>true</value>
</property>
The same configuration is available for inlinks:
<property>
<name>index.links.inlinks.host.ignore</name>
<value>true</value>
</property>
To store only the host portion of each inlink URL or outlink URL add the
following to your configuration file.
<property>
<name>index.links.hosts.only</name>
<value>false</value>
</property>Modifier and Type | Field and Description |
---|---|
static String |
LINKS_INLINKS_HOST |
static String |
LINKS_ONLY_HOSTS |
static String |
LINKS_OUTLINKS_HOST |
X_POINT_ID
Constructor and Description |
---|
LinksIndexingFilter() |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
Configuration |
getConf() |
void |
setConf(Configuration conf) |
public static final String LINKS_OUTLINKS_HOST
public static final String LINKS_INLINKS_HOST
public static final String LINKS_ONLY_HOSTS
public NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) throws IndexingException
IndexingFilter
filter
in interface IndexingFilter
doc
- document instance for collecting fieldsparse
- parse data instanceurl
- page urldatum
- crawl datum for the page (fetch datum from segment containing
fetch status and fetch time)inlinks
- page inlinksIndexingException
public void setConf(Configuration conf)
setConf
in interface Configurable
public Configuration getConf()
getConf
in interface Configurable
Copyright © 2021 The Apache Software Foundation