Package | Description |
---|---|
org.apache.nutch.crawl |
Crawl control code and tools to run the crawler.
|
org.apache.nutch.parse |
The
Parse interface and related classes. |
org.apache.nutch.scoring |
The
ScoringFilter interface. |
org.apache.nutch.scoring.depth |
Scoring filter to stop crawling at a configurable depth
(number of "hops" from seed URLs).
|
org.apache.nutch.scoring.link |
Scoring filter used in conjunction with
WebGraph . |
org.apache.nutch.scoring.opic |
Scoring filter implementing a variant of the Online Page Importance Computation
(OPIC) algorithm.
|
org.apache.nutch.scoring.similarity | |
org.apache.nutch.scoring.similarity.cosine |
Implements the cosine similarity metric for scoring relevant documents
|
org.apache.nutch.scoring.tld |
Top Level Domain Scoring plugin.
|
org.apache.nutch.scoring.urlmeta |
URL Meta Tag Scoring Plugin
|
org.apache.nutch.segment |
A segment stores all data from on generate/fetch/update cycle:
fetch list, protocol status, raw content, parsed content, and extracted outgoing links.
|
org.apache.nutch.tools |
Miscellaneous tools.
|
Modifier and Type | Method and Description |
---|---|
void |
LinkDb.LinkDbMapper.map(Text key,
ParseData parseData,
Mapper.Context context) |
Modifier and Type | Method and Description |
---|---|
ParseData |
Parse.getData()
Other data extracted from the page.
|
ParseData |
ParseImpl.getData() |
static ParseData |
ParseData.read(DataInput in) |
Modifier and Type | Method and Description |
---|---|
void |
ParseResult.put(String key,
ParseText text,
ParseData data)
Store a result of parsing.
|
void |
ParseResult.put(Text key,
ParseText text,
ParseData data)
Store a result of parsing.
|
Constructor and Description |
---|
ParseImpl(ParseText text,
ParseData data) |
ParseImpl(ParseText text,
ParseData data,
boolean isCanonical) |
ParseImpl(String text,
ParseData data) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
ScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Distribute score value from the current page to all its outlinked pages.
|
CrawlDatum |
ScoringFilters.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
CrawlDatum |
AbstractScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
DepthScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
LinkAnalysisScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
OPICScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
Get a float value from Fetcher.SCORE_KEY, divide it by the number of
outlinks and apply.
|
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
SimilarityScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
CrawlDatum |
SimilarityModel.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
CosineSimilarity.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
TLDScoringFilter.distributeScoreToOutlink(Text fromUrl,
Text toUrl,
ParseData parseData,
CrawlDatum target,
CrawlDatum adjust,
int allCount,
int validCount) |
CrawlDatum |
TLDScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount) |
Modifier and Type | Method and Description |
---|---|
CrawlDatum |
URLMetaScoringFilter.distributeScoreToOutlinks(Text fromUrl,
ParseData parseData,
Collection<Map.Entry<Text,CrawlDatum>> targets,
CrawlDatum adjust,
int allCount)
This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the parseData object.
|
Modifier and Type | Method and Description |
---|---|
boolean |
SegmentMergeFilters.filter(Text key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
Iterates over all
SegmentMergeFilter extensions and if any of them
returns false, it will return false as well. |
boolean |
SegmentMergeFilter.filter(Text key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
The filtering method which gets all information being merged for a given
key (URL).
|
Modifier and Type | Method and Description |
---|---|
String |
CommonCrawlFormat.getJsonData(String url,
Content content,
Metadata metadata,
ParseData parseData)
Returns a string representation of the JSON structure of the URL content
takes into account the parsed metadata about the URL
|
String |
AbstractCommonCrawlFormat.getJsonData(String url,
Content content,
Metadata metadata,
ParseData parseData) |
String |
CommonCrawlFormatWARC.getJsonData(String url,
Content content,
Metadata metadata,
ParseData parseData) |
Constructor and Description |
---|
CommonCrawlFormatWARC(String url,
Content content,
Metadata metadata,
Configuration nutchConf,
CommonCrawlConfig config,
ParseData parseData) |
Copyright © 2021 The Apache Software Foundation