public class Ftp extends Object implements Protocol
FtpResponse
object and gets the content of the url from it.
Configurable parameters are ftp.username
, ftp.password
,
ftp.content.limit
, ftp.timeout
, ftp.server.timeout
,
ftp.password
, ftp.keep.connection
and ftp.follow.talk
. For details see "FTP properties" section in nutch-default.xml
.Modifier and Type | Field and Description |
---|---|
protected static org.slf4j.Logger |
LOG |
X_POINT_ID
Constructor and Description |
---|
Ftp() |
Modifier and Type | Method and Description |
---|---|
protected void |
finalize() |
int |
getBufferSize() |
Configuration |
getConf()
Get the
Configuration object |
ProtocolOutput |
getProtocolOutput(Text url,
CrawlDatum datum)
Creates a
FtpResponse object corresponding to the url and returns a
ProtocolOutput object as per the content received |
crawlercommons.robots.BaseRobotRules |
getRobotRules(Text url,
CrawlDatum datum,
List<Content> robotsTxtContent)
Get the robots rules for a given url
|
static void |
main(String[] args)
For debugging.
|
void |
setConf(Configuration conf)
Set the
Configuration object |
void |
setFollowTalk(boolean followTalk)
Set followTalk
|
void |
setKeepConnection(boolean keepConnection)
Set keepConnection
|
void |
setMaxContentLength(int length)
Set the point at which content is truncated.
|
void |
setTimeout(int to)
Set the timeout.
|
public void setTimeout(int to)
public void setMaxContentLength(int length)
public void setFollowTalk(boolean followTalk)
public void setKeepConnection(boolean keepConnection)
public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum)
FtpResponse
object corresponding to the url and returns a
ProtocolOutput
object as per the content receivedgetProtocolOutput
in interface Protocol
url
- Text containing the ftp urldatum
- The CrawlDatum object corresponding to the urlProtocolOutput
object for the urlpublic void setConf(Configuration conf)
Configuration
objectsetConf
in interface Configurable
public Configuration getConf()
Configuration
objectgetConf
in interface Configurable
public crawlercommons.robots.BaseRobotRules getRobotRules(Text url, CrawlDatum datum, List<Content> robotsTxtContent)
getRobotRules
in interface Protocol
url
- URL to checkdatum
- page datumrobotsTxtContent
- container to store responses when fetching the robots.txt file for
debugging or archival purposes. Instead of a robots.txt file, it
may include redirects or an error page (404, etc.). Response
Content
is appended to the passed list. If null is passed
nothing is stored.public int getBufferSize()
Copyright © 2021 The Apache Software Foundation