public interface FetchSchedule extends Configurable
Modifier and Type | Field and Description |
---|---|
static int |
SECONDS_PER_DAY |
static int |
STATUS_MODIFIED
Page is known to have been modified since our last visit.
|
static int |
STATUS_NOTMODIFIED
Page is known to remain unmodified since our last visit.
|
static int |
STATUS_UNKNOWN
It is unknown whether page was changed since our last visit.
|
Modifier and Type | Method and Description |
---|---|
long |
calculateLastFetchTime(CrawlDatum datum)
Calculates last fetch time of the given CrawlDatum.
|
CrawlDatum |
forceRefetch(Text url,
CrawlDatum datum,
boolean asap)
This method resets fetchTime, fetchInterval, modifiedTime and page
signature, so that it forces refetching.
|
CrawlDatum |
initializeSchedule(Text url,
CrawlDatum datum)
Initialize fetch schedule related data.
|
CrawlDatum |
setFetchSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime,
long modifiedTime,
int state)
Sets the
fetchInterval and fetchTime on a
successfully fetched page. |
CrawlDatum |
setPageGoneSchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method specifies how to schedule refetching of pages marked as GONE.
|
CrawlDatum |
setPageRetrySchedule(Text url,
CrawlDatum datum,
long prevFetchTime,
long prevModifiedTime,
long fetchTime)
This method adjusts the fetch schedule if fetching needs to be re-tried due
to transient errors.
|
boolean |
shouldFetch(Text url,
CrawlDatum datum,
long curTime)
This method provides information whether the page is suitable for selection
in the current fetchlist.
|
getConf, setConf
static final int STATUS_UNKNOWN
static final int STATUS_MODIFIED
static final int STATUS_NOTMODIFIED
static final int SECONDS_PER_DAY
CrawlDatum initializeSchedule(Text url, CrawlDatum datum)
fetchTime
and fetchInterval
. The default
implementation set the fetchTime
to now, using the default
fetchInterval
.url
- URL of the page.datum
- datum instance to be initialized.CrawlDatum setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state)
fetchInterval
and fetchTime
on a
successfully fetched page. Implementations may use supplied arguments to
support different re-fetching schedules.url
- url of the pagedatum
- page description to be adjusted. NOTE: this instance, passed by
reference, may be modified inside the method.prevFetchTime
- previous value of fetch time, or 0 if not available.prevModifiedTime
- previous value of modifiedTime, or 0 if not available.fetchTime
- the latest time, when the page was recently re-fetched. Most
FetchSchedule implementations should update the value in @see
CrawlDatum to something greater than this value.modifiedTime
- last time the content was modified. This information comes from
the protocol implementations, or is set to < 0 if not available.
Most FetchSchedule implementations should update the value in @see
CrawlDatum to this value.state
- if STATUS_MODIFIED
, then the content is considered to be
"changed" before the fetchTime
, if
STATUS_NOTMODIFIED
then the content is known to be
unchanged. This information may be obtained by comparing page
signatures before and after fetching. If this is set to
STATUS_UNKNOWN
, then it is unknown whether the page was
changed; implementations are free to follow a sensible default
behavior.CrawlDatum setPageGoneSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
maxInterval
it calls
forceRefetch(Text, CrawlDatum, boolean)
.url
- URL of the pagedatum
- datum instance to be adjusted.CrawlDatum setPageRetrySchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime)
url
- URL of the page.datum
- page information.prevFetchTime
- previous fetch time.prevModifiedTime
- previous modified time.fetchTime
- current fetch time.long calculateLastFetchTime(CrawlDatum datum)
boolean shouldFetch(Text url, CrawlDatum datum, long curTime)
fetchTime
, if it is higher than the curTime it returns false,
and true otherwise. It will also check that fetchTime is not too remote
(more than maxInterval
), in which case it lowers the interval
and returns true.url
- URL of the page.datum
- datum instance.curTime
- reference time (usually set to the time when the fetchlist
generation process was started).CrawlDatum forceRefetch(Text url, CrawlDatum datum, boolean asap)
url
- URL of the page.datum
- datum instance.asap
- if true, force refetch as soon as possible - this sets the
fetchTime to now. If false, force refetch whenever the next fetch
time is set.Copyright © 2021 The Apache Software Foundation