Package | Description |
---|---|
org.apache.nutch.analysis.lang |
Text document language identifier.
|
org.apache.nutch.any23 |
This packages uses the Apache Any23 library
for parsing and extracting structured data in RDF format from a
variety of Web documents.
|
org.apache.nutch.crawl |
Crawl control code and tools to run the crawler.
|
org.apache.nutch.microformats.reltag |
A microformats Rel-Tag
Parser/Indexer/Querier plugin.
|
org.apache.nutch.parse |
The
Parse interface and related classes. |
org.apache.nutch.parse.ext |
Parse wrapper to run external command to do the parsing.
|
org.apache.nutch.parse.feed |
Parse RSS feeds.
|
org.apache.nutch.parse.headings |
Parse filter to extract headings (h1, h2, etc.) from DOM parse tree.
|
org.apache.nutch.parse.html |
An HTML document parsing plugin.
|
org.apache.nutch.parse.js |
Parser and parse filter plugin to extract all (possible) links
from JavaScript files and embedded JavaScript code snippets.
|
org.apache.nutch.parse.metatags |
Parse filter to extract meta tags: keywords, description, etc.
|
org.apache.nutch.parse.swf |
Parse Flash SWF files.
|
org.apache.nutch.parse.tika |
Parse various document formats with help of
Apache Tika.
|
org.apache.nutch.parse.zip |
Parse ZIP files: embedded files are recursively passed to appropriate parsers.
|
org.apache.nutch.parsefilter.debug |
Adds serialized DOM to parse data, useful for debugging, to understand how
the parser implementation interprets a document (not only HTML).
|
org.apache.nutch.parsefilter.naivebayes |
Html Parse filter that classifies the outlinks from the parseresult as
relevant or irrelevant based on the parseText's relevancy (using a training
file where you can give positive and negative example texts see the
description of parsefilter.naivebayes.trainfile) and if found irrelevent
it gives the link a second chance if it contains any of the words from the
list given in parsefilter.naivebayes.wordlist.
|
org.apache.nutch.parsefilter.regex |
RegexParseFilter.
|
org.apache.nutch.protocol |
Classes related to the
Protocol interface,
see also org.apache.nutch.net.protocols . |
org.apache.nutch.protocol.file |
Protocol plugin which supports retrieving local file resources.
|
org.apache.nutch.protocol.ftp |
Protocol plugin which supports retrieving documents via the ftp protocol.
|
org.apache.nutch.protocol.http.api |
Common API used by HTTP plugins (
http ,
httpclient ) |
org.apache.nutch.scoring |
The
ScoringFilter interface. |
org.apache.nutch.scoring.depth |
Scoring filter to stop crawling at a configurable depth
(number of "hops" from seed URLs).
|
org.apache.nutch.scoring.link |
Scoring filter used in conjunction with
WebGraph . |
org.apache.nutch.scoring.opic |
Scoring filter implementing a variant of the Online Page Importance Computation
(OPIC) algorithm.
|
org.apache.nutch.scoring.similarity | |
org.apache.nutch.scoring.similarity.cosine |
Implements the cosine similarity metric for scoring relevant documents
|
org.apache.nutch.scoring.tld |
Top Level Domain Scoring plugin.
|
org.apache.nutch.scoring.urlmeta |
URL Meta Tag Scoring Plugin
|
org.apache.nutch.segment |
A segment stores all data from on generate/fetch/update cycle:
fetch list, protocol status, raw content, parsed content, and extracted outgoing links.
|
org.apache.nutch.tools |
Miscellaneous tools.
|
org.apache.nutch.util |
Miscellaneous utility classes.
|
org.creativecommons.nutch |
Sample plugins that parse and index Creative Commons medadata.
|
Modifier and Type | Method and Description |
---|---|
ParseResult |
HTMLLanguageParser.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Scan the HTML document looking at possible indications of content language
1. |
Modifier and Type | Method and Description |
---|---|
ParseResult |
Any23ParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
byte[] |
TextMD5Signature.calculate(Content content,
Parse parse) |
byte[] |
TextProfileSignature.calculate(Content content,
Parse parse) |
abstract byte[] |
Signature.calculate(Content content,
Parse parse) |
byte[] |
MD5Signature.calculate(Content content,
Parse parse) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
RelTagParser.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Scan the HTML document looking at possible rel-tags
|
Modifier and Type | Method and Description |
---|---|
ParseResult |
HtmlParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Adds metadata or otherwise modifies a parse of HTML content, given the DOM
tree of a page.
|
ParseResult |
HtmlParseFilters.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Run all defined filters.
|
ParseResult |
Parser.getParse(Content c)
This method parses the given content and returns a map of <key,
parse> pairs.
|
static boolean |
ParseSegment.isTruncated(Content content)
Checks if the page's content is truncated.
|
void |
ParseSegment.ParseSegmentMapper.map(WritableComparable<?> key,
Content content,
Mapper.Context context) |
ParseResult |
ParseUtil.parse(Content content)
|
ParseResult |
ParseUtil.parseByExtensionId(String extId,
Content content)
|
Modifier and Type | Method and Description |
---|---|
ParseResult |
ExtParser.getParse(Content content) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
FeedParser.getParse(Content content)
Parses the given feed and extracts out and parsers all linked items within
the feed, using the underlying ROME feed parsing library.
|
Modifier and Type | Method and Description |
---|---|
ParseResult |
HeadingsParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
HtmlParser.getParse(Content content) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
JSParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Scan the JavaScript fragments of a HTML page looking for possible
Outlink 's |
ParseResult |
JSParseFilter.getParse(Content c)
Parse a JavaScript file and extract outlinks
|
Modifier and Type | Method and Description |
---|---|
ParseResult |
MetaTagsParser.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
SWFParser.getParse(Content content) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
TikaParser.getParse(Content content) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
ZipParser.getParse(Content content) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
DebugParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
NaiveBayesParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
ParseResult |
RegexParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc) |
Modifier and Type | Method and Description |
---|---|
Content |
ProtocolOutput.getContent() |
static Content |
Content.read(DataInput in) |
Modifier and Type | Method and Description |
---|---|
void |
ProtocolOutput.setContent(Content content) |
Modifier and Type | Method and Description |
---|---|
crawlercommons.robots.BaseRobotRules |
Protocol.getRobotRules(Text url,
CrawlDatum datum,
List<Content> robotsTxtContent)
Retrieve robot rules applicable for this URL.
|
crawlercommons.robots.BaseRobotRules |
RobotRulesParser.getRobotRulesSet(Protocol protocol,
Text url,
List<Content> robotsTxtContent)
Fetch robots.txt (or it's protocol-specific equivalent) which applies to
the given URL, parse it and return the set of robot rules applicable for
the configured agent name(s).
|
abstract crawlercommons.robots.BaseRobotRules |
RobotRulesParser.getRobotRulesSet(Protocol protocol,
URL url,
List<Content> robotsTxtContent)
Fetch robots.txt (or it's protocol-specific equivalent) which applies to
the given URL, parse it and return the set of robot rules applicable for
the configured agent name(s).
|
Constructor and Description |
---|
ProtocolOutput(Content content) |
ProtocolOutput(Content content,
ProtocolStatus status) |
Modifier and Type | Method and Description |
---|---|
Content |
FileResponse.toContent() |
Modifier and Type | Method and Description |
---|---|
crawlercommons.robots.BaseRobotRules |
File.getRobotRules(Text url,
CrawlDatum datum,
List<Content> robotsTxtContent)
No robots parsing is done for file protocol.
|
Modifier and Type | Method and Description |
---|---|
Content |
FtpResponse.toContent() |
Modifier and Type | Method and Description |
---|---|
crawlercommons.robots.BaseRobotRules |
Ftp.getRobotRules(Text url,
CrawlDatum datum,
List<Content> robotsTxtContent)
Get the robots rules for a given url
|
crawlercommons.robots.BaseRobotRules |
FtpRobotRulesParser.getRobotRulesSet(Protocol ftp,
URL url,
List<Content> robotsTxtContent)
The hosts for which the caching of robots rules is yet to be done, it sends
a Ftp request to the host corresponding to the
URL passed, gets
robots file, parses the rules and caches the rules object to avoid re-work
in future. |
Modifier and Type | Method and Description |
---|---|
protected void |
HttpRobotRulesParser.addRobotsContent(List<Content> robotsTxtContent,
URL robotsUrl,
Response robotsResponse)
Append
Content of robots.txt to robotsTxtContent |
crawlercommons.robots.BaseRobotRules |
HttpBase.getRobotRules(Text url,
CrawlDatum datum,
List<Content> robotsTxtContent) |
crawlercommons.robots.BaseRobotRules |
HttpRobotRulesParser.getRobotRulesSet(Protocol http,
URL url,
List<Content> robotsTxtContent)
Get the rules from robots.txt which applies for the given
url . |
Modifier and Type | Method and Description |
---|---|
void |
ScoringFilter.passScoreAfterParsing(Text url,
Content content,
Parse parse)
Currently a part of score distribution is performed using only data coming
from the parsing process.
|
void |
ScoringFilters.passScoreAfterParsing(Text url,
Content content,
Parse parse) |
void |
AbstractScoringFilter.passScoreAfterParsing(Text url,
Content content,
Parse parse) |
void |
ScoringFilter.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content)
This method takes all relevant score information from the current datum
(coming from a generated fetchlist) and stores it into
Content metadata. |
void |
ScoringFilters.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content) |
void |
AbstractScoringFilter.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content) |
Modifier and Type | Method and Description |
---|---|
void |
DepthScoringFilter.passScoreAfterParsing(Text url,
Content content,
Parse parse) |
void |
DepthScoringFilter.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content) |
Modifier and Type | Method and Description |
---|---|
void |
LinkAnalysisScoringFilter.passScoreAfterParsing(Text url,
Content content,
Parse parse) |
void |
LinkAnalysisScoringFilter.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content) |
Modifier and Type | Method and Description |
---|---|
void |
OPICScoringFilter.passScoreAfterParsing(Text url,
Content content,
Parse parse)
Copy the value from Content metadata under Fetcher.SCORE_KEY to parseData.
|
void |
OPICScoringFilter.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content)
Store a float value of CrawlDatum.getScore() under Fetcher.SCORE_KEY.
|
Modifier and Type | Method and Description |
---|---|
void |
SimilarityScoringFilter.passScoreAfterParsing(Text url,
Content content,
Parse parse) |
float |
SimilarityModel.setURLScoreAfterParsing(Text url,
Content content,
Parse parse) |
Modifier and Type | Method and Description |
---|---|
float |
CosineSimilarity.setURLScoreAfterParsing(Text url,
Content content,
Parse parse) |
Modifier and Type | Method and Description |
---|---|
void |
TLDScoringFilter.passScoreAfterParsing(Text url,
Content content,
Parse parse) |
void |
TLDScoringFilter.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content) |
Modifier and Type | Method and Description |
---|---|
void |
URLMetaScoringFilter.passScoreAfterParsing(Text url,
Content content,
Parse parse)
Takes the metadata, which was lumped inside the content, and replicates it
within your parse data.
|
void |
URLMetaScoringFilter.passScoreBeforeParsing(Text url,
CrawlDatum datum,
Content content)
Takes the metadata, specified in your "urlmeta.tags" property, from the
datum object and injects it into the content.
|
Modifier and Type | Method and Description |
---|---|
boolean |
SegmentMergeFilters.filter(Text key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
Iterates over all
SegmentMergeFilter extensions and if any of them
returns false, it will return false as well. |
boolean |
SegmentMergeFilter.filter(Text key,
CrawlDatum generateData,
CrawlDatum fetchData,
CrawlDatum sigData,
Content content,
ParseData parseData,
ParseText parseText,
Collection<CrawlDatum> linked)
The filtering method which gets all information being merged for a given
key (URL).
|
Modifier and Type | Field and Description |
---|---|
protected Content |
AbstractCommonCrawlFormat.content |
Modifier and Type | Method and Description |
---|---|
static CommonCrawlFormat |
CommonCrawlFormatFactory.getCommonCrawlFormat(String formatType,
String url,
Content content,
Metadata metadata,
Configuration nutchConf,
CommonCrawlConfig config)
Deprecated.
|
String |
CommonCrawlFormat.getJsonData(String url,
Content content,
Metadata metadata)
Returns a string representation of the JSON structure of the URL content
|
String |
AbstractCommonCrawlFormat.getJsonData(String url,
Content content,
Metadata metadata) |
String |
CommonCrawlFormat.getJsonData(String url,
Content content,
Metadata metadata,
ParseData parseData)
Returns a string representation of the JSON structure of the URL content
takes into account the parsed metadata about the URL
|
String |
AbstractCommonCrawlFormat.getJsonData(String url,
Content content,
Metadata metadata,
ParseData parseData) |
String |
CommonCrawlFormatWARC.getJsonData(String url,
Content content,
Metadata metadata,
ParseData parseData) |
Constructor and Description |
---|
AbstractCommonCrawlFormat(String url,
Content content,
Metadata metadata,
Configuration nutchConf,
CommonCrawlConfig config) |
CommonCrawlFormatJackson(String url,
Content content,
Metadata metadata,
Configuration nutchConf,
CommonCrawlConfig config) |
CommonCrawlFormatJettinson(String url,
Content content,
Metadata metadata,
Configuration nutchConf,
CommonCrawlConfig config) |
CommonCrawlFormatSimple(String url,
Content content,
Metadata metadata,
Configuration nutchConf,
CommonCrawlConfig config) |
CommonCrawlFormatWARC(String url,
Content content,
Metadata metadata,
Configuration nutchConf,
CommonCrawlConfig config,
ParseData parseData) |
Modifier and Type | Method and Description |
---|---|
void |
EncodingDetector.autoDetectClues(Content content,
boolean filter) |
String |
EncodingDetector.guessEncoding(Content content,
String defaultValue)
Guess the encoding with the previously specified list of clues.
|
Modifier and Type | Method and Description |
---|---|
ParseResult |
CCParseFilter.filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Adds metadata or otherwise modifies a parse of an HTML document, given the
DOM tree of a page.
|
Copyright © 2021 The Apache Software Foundation