public class HttpRobotRulesParser extends RobotRulesParser
RobotRulesParser
class and contains Http protocol
specific implementation for obtaining the robots file.Modifier and Type | Field and Description |
---|---|
protected boolean |
allowForbidden |
agentNames, CACHE, conf, EMPTY_RULES, FORBID_ALL_RULES, whiteList
Constructor and Description |
---|
HttpRobotRulesParser(Configuration conf) |
Modifier and Type | Method and Description |
---|---|
protected void |
addRobotsContent(List<Content> robotsTxtContent,
URL robotsUrl,
Response robotsResponse)
Append
Content of robots.txt to robotsTxtContent |
protected static String |
getCacheKey(URL url)
Compose unique key to store and access robot rules in cache for given URL
|
crawlercommons.robots.BaseRobotRules |
getRobotRulesSet(Protocol http,
URL url,
List<Content> robotsTxtContent)
Get the rules from robots.txt which applies for the given
url . |
void |
setConf(Configuration conf)
Set the
Configuration object |
getConf, getRobotRulesSet, isWhiteListed, main, parseRules, run
public HttpRobotRulesParser(Configuration conf)
public void setConf(Configuration conf)
RobotRulesParser
Configuration
objectsetConf
in interface Configurable
setConf
in class RobotRulesParser
protected static String getCacheKey(URL url)
public crawlercommons.robots.BaseRobotRules getRobotRulesSet(Protocol http, URL url, List<Content> robotsTxtContent)
url
.
Robot rules are cached for a unique combination of host, protocol, and
port. If no rules are found in the cache, a HTTP request is send to fetch
{{protocol://host:port/robots.txt}}. The robots.txt is then parsed and the
rules are cached to avoid re-fetching and re-parsing it again.getRobotRulesSet
in class RobotRulesParser
http
- The Protocol
objecturl
- URLrobotsTxtContent
- container to store responses when fetching the robots.txt file for
debugging or archival purposes. Instead of a robots.txt file, it
may include redirects or an error page (404, etc.). Response
Content
is appended to the passed list. If null is passed
nothing is stored.BaseRobotRules
object for the rulesprotected void addRobotsContent(List<Content> robotsTxtContent, URL robotsUrl, Response robotsResponse)
Content
of robots.txt to robotsTxtContentrobotsTxtContent
- container to store robots.txt response contentrobotsUrl
- robots.txt URLrobotsResponse
- response object to be storedCopyright © 2021 The Apache Software Foundation