Package | Description |
---|---|
org.apache.nutch.analysis.lang |
Text document language identifier.
|
org.apache.nutch.any23 |
This packages uses the Apache Any23 library
for parsing and extracting structured data in RDF format from a
variety of Web documents.
|
org.apache.nutch.crawl |
Crawl control code and tools to run the crawler.
|
org.apache.nutch.microformats.reltag |
A microformats Rel-Tag
Parser/Indexer/Querier plugin.
|
org.apache.nutch.parse |
The
Parse interface and related classes. |
org.apache.nutch.parse.ext |
Parse wrapper to run external command to do the parsing.
|
org.apache.nutch.parse.feed |
Parse RSS feeds.
|
org.apache.nutch.parse.headings |
Parse filter to extract headings (h1, h2, etc.) from DOM parse tree.
|
org.apache.nutch.parse.html |
An HTML document parsing plugin.
|
org.apache.nutch.parse.js |
Parser and parse filter plugin to extract all (possible) links
from JavaScript files and embedded JavaScript code snippets.
|
org.apache.nutch.parse.metatags |
Parse filter to extract meta tags: keywords, description, etc.
|
org.apache.nutch.parse.swf |
Parse Flash SWF files.
|
org.apache.nutch.parse.tika |
Parse various document formats with help of
Apache Tika.
|
org.apache.nutch.parse.zip |
Parse ZIP files: embedded files are recursively passed to appropriate parsers.
|
org.apache.nutch.parsefilter.debug |
Adds serialized DOM to parse data, useful for debugging, to understand how
the parser implementation interprets a document (not only HTML).
|
org.apache.nutch.parsefilter.naivebayes |
Html Parse filter that classifies the outlinks from the parseresult as
relevant or irrelevant based on the parseText's relevancy (using a training
file where you can give positive and negative example texts see the
description of parsefilter.naivebayes.trainfile) and if found irrelevent
it gives the link a second chance if it contains any of the words from the
list given in parsefilter.naivebayes.wordlist.
|
org.apache.nutch.parsefilter.regex |
RegexParseFilter.
|
org.apache.nutch.protocol |
Classes related to the
Protocol interface,
see also org.apache.nutch.net.protocols . |
org.apache.nutch.protocol.file |
Protocol plugin which supports retrieving local file resources.
|
org.apache.nutch.protocol.ftp |
Protocol plugin which supports retrieving documents via the ftp protocol.
|
org.apache.nutch.protocol.htmlunit |
Protocol plugin which supports retrieving documents via the http protocol.
|
org.apache.nutch.protocol.http.api |
Common API used by HTTP plugins (
http ,
httpclient ) |
org.apache.nutch.protocol.okhttp |
Protocol plugin based on okhttp, supports http, https, http/2.
|
org.apache.nutch.scoring |
The
ScoringFilter interface. |
org.apache.nutch.scoring.depth |
Scoring filter to stop crawling at a configurable depth
(number of "hops" from seed URLs).
|
org.apache.nutch.scoring.link |
Scoring filter used in conjunction with
WebGraph . |
org.apache.nutch.scoring.opic |
Scoring filter implementing a variant of the Online Page Importance Computation
(OPIC) algorithm.
|
org.apache.nutch.scoring.similarity | |
org.apache.nutch.scoring.similarity.cosine |
Implements the cosine similarity metric for scoring relevant documents
|
org.apache.nutch.scoring.tld |
Top Level Domain Scoring plugin.
|
org.apache.nutch.scoring.urlmeta |
URL Meta Tag Scoring Plugin
|
org.apache.nutch.segment |
A segment stores all data from on generate/fetch/update cycle:
fetch list, protocol status, raw content, parsed content, and extracted outgoing links.
|
org.apache.nutch.tools |
Miscellaneous tools.
|
org.apache.nutch.util |
Miscellaneous utility classes.
|
org.creativecommons.nutch |
Sample plugins that parse and index Creative Commons medadata.
|
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Protocol
A retriever of url content.
|
ProtocolException |
ProtocolNotFound |
ProtocolOutput
Simple aggregate to pass from protocol plugins both content and protocol
status.
|
ProtocolStatus |
Class and Description |
---|
Content |
Protocol
A retriever of url content.
|
ProtocolException |
ProtocolOutput
Simple aggregate to pass from protocol plugins both content and protocol
status.
|
Class and Description |
---|
Content |
Protocol
A retriever of url content.
|
ProtocolException |
ProtocolOutput
Simple aggregate to pass from protocol plugins both content and protocol
status.
|
RobotRulesParser
This class uses crawler-commons for handling the parsing of
robots.txt files. |
Class and Description |
---|
Protocol
A retriever of url content.
|
ProtocolException |
Class and Description |
---|
Content |
Protocol
A retriever of url content.
|
ProtocolException |
ProtocolOutput
Simple aggregate to pass from protocol plugins both content and protocol
status.
|
RobotRulesParser
This class uses crawler-commons for handling the parsing of
robots.txt files. |
Class and Description |
---|
Protocol
A retriever of url content.
|
ProtocolException |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
Class and Description |
---|
Content |
ProtocolOutput
Simple aggregate to pass from protocol plugins both content and protocol
status.
|
Class and Description |
---|
Content |
Copyright © 2021 The Apache Software Foundation