Filters URLs based on a file of regular expressions using host/domains
matching first. The default policy is to accept a URL if no matches are
found.
Rule Format:
Host www.example.org
DenyPath /path/to/be/excluded
DenyPath /some/other/path/excluded
# Deny everything from *.example.com and example.com
Domain example.com
DenyPath .*
Domain example.org
DenyPathQuery /resource/.*?action=exclude
Host
rules are evaluated before
Domain
rules. For
Host
rules the entire host name of a URL must match while the
domain names in
Domain
rules are considered as matches if the
domain is a suffix of the host name (consisting of complete host name parts).
Shorter domain suffixes are checked first, a single dot
"
.
" as "domain name" can be used to specify
global rules applied to every URL.
E.g., for "www.example.com" the rules given above are looked up in the
following order:
- check "www.example.com" whether host-based rules exist and whether one of
them matches
- check "www.example.com" for domain-based rules
- check "example.com" for domain-based rules
- check "com" for domain-based rules
- check for global rules ("
Domain .
")
The first matching rule will reject the URL and no further rules are checked.
If no rule matches the URL is accepted. URLs without a host name (e.g.,
file:/path/file.txt
are checked for global rules only. URLs
which fail to be parsed as
URL
are always rejected.
For rules either the URL path (
DenyPath
) or path and query
(
DenyPathQuery
) are checked whether the given
Java Regular expression
is found (see
Matcher.find()
) in the URL path (and query).
Rules are applied in the order of their definition. For better performance,
regular expressions which are simpler/faster or match more URLs should be
defined earlier.
Comments in the rule file start with the
#
character and reach
until the end of the line.
The rules file is defined via the property
urlfilter.fast.file
,
the default name is
fast-urlfilter.txt
.