Package | Description |
---|---|
org.apache.nutch.protocol |
Classes related to the
Protocol interface,
see also org.apache.nutch.net.protocols . |
org.apache.nutch.protocol.file |
Protocol plugin which supports retrieving local file resources.
|
org.apache.nutch.protocol.ftp |
Protocol plugin which supports retrieving documents via the ftp protocol.
|
org.apache.nutch.protocol.htmlunit |
Protocol plugin which supports retrieving documents via the http protocol.
|
org.apache.nutch.protocol.http.api |
Common API used by HTTP plugins (
http ,
httpclient ) |
org.apache.nutch.protocol.okhttp |
Protocol plugin based on okhttp, supports http, https, http/2.
|
Modifier and Type | Method and Description |
---|---|
Protocol |
ProtocolFactory.getProtocol(String urlString)
Returns the appropriate
Protocol implementation for a url. |
Protocol |
ProtocolFactory.getProtocol(URL url)
Returns the appropriate
Protocol implementation for a url. |
Protocol |
ProtocolFactory.getProtocolById(String id) |
Modifier and Type | Method and Description |
---|---|
crawlercommons.robots.BaseRobotRules |
RobotRulesParser.getRobotRulesSet(Protocol protocol,
Text url,
List<Content> robotsTxtContent)
Fetch robots.txt (or it's protocol-specific equivalent) which applies to
the given URL, parse it and return the set of robot rules applicable for
the configured agent name(s).
|
abstract crawlercommons.robots.BaseRobotRules |
RobotRulesParser.getRobotRulesSet(Protocol protocol,
URL url,
List<Content> robotsTxtContent)
Fetch robots.txt (or it's protocol-specific equivalent) which applies to
the given URL, parse it and return the set of robot rules applicable for
the configured agent name(s).
|
Modifier and Type | Class and Description |
---|---|
class |
File
This class is a protocol plugin used for file: scheme.
|
Modifier and Type | Class and Description |
---|---|
class |
Ftp
This class is a protocol plugin used for ftp: scheme.
|
Modifier and Type | Method and Description |
---|---|
crawlercommons.robots.BaseRobotRules |
FtpRobotRulesParser.getRobotRulesSet(Protocol ftp,
URL url,
List<Content> robotsTxtContent)
The hosts for which the caching of robots rules is yet to be done, it sends
a Ftp request to the host corresponding to the
URL passed, gets
robots file, parses the rules and caches the rules object to avoid re-work
in future. |
Modifier and Type | Class and Description |
---|---|
class |
Http |
Modifier and Type | Class and Description |
---|---|
class |
HttpBase |
Modifier and Type | Method and Description |
---|---|
crawlercommons.robots.BaseRobotRules |
HttpRobotRulesParser.getRobotRulesSet(Protocol http,
URL url,
List<Content> robotsTxtContent)
Get the rules from robots.txt which applies for the given
url . |
Modifier and Type | Class and Description |
---|---|
class |
OkHttp |
Copyright © 2021 The Apache Software Foundation