public abstract class RobotRulesParser extends Object implements Tool
robots.txt
files. It emits SimpleRobotRules objects, which describe
the download permissions as described in SimpleRobotRulesParser.
Protocol-specific implementations have to implement the method
getRobotRulesSet
.Modifier and Type | Field and Description |
---|---|
protected String |
agentNames |
protected static Hashtable<String,crawlercommons.robots.BaseRobotRules> |
CACHE |
protected Configuration |
conf |
static crawlercommons.robots.BaseRobotRules |
EMPTY_RULES
A
BaseRobotRules object appropriate for use when the
robots.txt file is empty or missing; all requests are allowed. |
static crawlercommons.robots.BaseRobotRules |
FORBID_ALL_RULES
A
BaseRobotRules object appropriate for use when the
robots.txt file is not fetched due to a 403/Forbidden
response; all requests are disallowed. |
protected Set<String> |
whiteList
set of host names or IPs to be explicitly excluded from robots.txt checking
|
Constructor and Description |
---|
RobotRulesParser() |
RobotRulesParser(Configuration conf) |
Modifier and Type | Method and Description |
---|---|
Configuration |
getConf()
Get the
Configuration object |
crawlercommons.robots.BaseRobotRules |
getRobotRulesSet(Protocol protocol,
Text url,
List<Content> robotsTxtContent)
Fetch robots.txt (or it's protocol-specific equivalent) which applies to
the given URL, parse it and return the set of robot rules applicable for
the configured agent name(s).
|
abstract crawlercommons.robots.BaseRobotRules |
getRobotRulesSet(Protocol protocol,
URL url,
List<Content> robotsTxtContent)
Fetch robots.txt (or it's protocol-specific equivalent) which applies to
the given URL, parse it and return the set of robot rules applicable for
the configured agent name(s).
|
boolean |
isWhiteListed(URL url)
Check whether a URL belongs to a whitelisted host.
|
static void |
main(String[] args) |
crawlercommons.robots.BaseRobotRules |
parseRules(String url,
byte[] content,
String contentType,
String robotName)
Parses the robots content using the
SimpleRobotRulesParser from
crawler commons |
int |
run(String[] args) |
void |
setConf(Configuration conf)
Set the
Configuration object |
public static final crawlercommons.robots.BaseRobotRules EMPTY_RULES
BaseRobotRules
object appropriate for use when the
robots.txt
file is empty or missing; all requests are allowed.public static crawlercommons.robots.BaseRobotRules FORBID_ALL_RULES
BaseRobotRules
object appropriate for use when the
robots.txt
file is not fetched due to a 403/Forbidden
response; all requests are disallowed.protected Configuration conf
protected String agentNames
public RobotRulesParser()
public RobotRulesParser(Configuration conf)
public void setConf(Configuration conf)
Configuration
objectsetConf
in interface Configurable
public Configuration getConf()
Configuration
objectgetConf
in interface Configurable
public boolean isWhiteListed(URL url)
public crawlercommons.robots.BaseRobotRules parseRules(String url, byte[] content, String contentType, String robotName)
SimpleRobotRulesParser
from
crawler commonsurl
- A string containing urlcontent
- Contents of the robots file in a byte arraycontentType
- The content type of the robots filerobotName
- A string containing all the robots agent names used by parser for
matchingpublic crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, Text url, List<Content> robotsTxtContent)
protocol
- Protocol
url
- URL to checkrobotsTxtContent
- container to store responses when fetching the robots.txt file for
debugging or archival purposes. Instead of a robots.txt file, it
may include redirects or an error page (404, etc.). Response
Content
is appended to the passed list. If null is passed
nothing is stored.public abstract crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol protocol, URL url, List<Content> robotsTxtContent)
protocol
- Protocol
url
- URL to checkrobotsTxtContent
- container to store responses when fetching the robots.txt file for
debugging or archival purposes. Instead of a robots.txt file, it
may include redirects or an error page (404, etc.). Response
Content
is appended to the passed list. If null is passed
nothing is stored.Copyright © 2021 The Apache Software Foundation