public class ExemptionUrlFilter extends RegexURLFilter implements URLExemptionFilter
URLExemptionFilter
uses regex configuration
to check if URL is eligible for exemption from 'db.ignore.external'.
When this filter is enabled, the external urls will be checked against configured sequence of regex rules.
The exemption rule file defaults to db-ignore-external-exemptions.txt in the classpath but can be
overridden using the property "db.ignore.external.exemptions.file" in ./conf/nutch-*.xml
URLExemptionFilter
,
RegexURLFilter
Modifier and Type | Field and Description |
---|---|
static String |
DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE |
URLFILTER_REGEX_FILE, URLFILTER_REGEX_RULES
hasHostDomainRules
X_POINT_ID
X_POINT_ID
Constructor and Description |
---|
ExemptionUrlFilter() |
Modifier and Type | Method and Description |
---|---|
boolean |
filter(String fromUrl,
String toUrl)
Checks if toUrl is exempted when the ignore external is enabled
|
List<Pattern> |
getExemptions() |
protected Reader |
getRulesReader(Configuration conf)
Gets reader for regex rules
|
static void |
main(String[] args) |
createRule, createRule
filter, getConf, main, setConf
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getConf, setConf
public static final String DB_IGNORE_EXTERNAL_EXEMPTIONS_FILE
public boolean filter(String fromUrl, String toUrl)
URLExemptionFilter
filter
in interface URLExemptionFilter
fromUrl
- : the source url which generated the outlinktoUrl
- : the destination url which needs to be checked for exemptionprotected Reader getRulesReader(Configuration conf) throws IOException
getRulesReader
in class RegexURLFilter
conf
- is the current configuration.IOException
public static void main(String[] args)
Copyright © 2021 The Apache Software Foundation