Package | Description |
---|---|
org.apache.nutch.analysis.lang |
Text document language identifier.
|
org.apache.nutch.any23 |
This packages uses the Apache Any23 library
for parsing and extracting structured data in RDF format from a
variety of Web documents.
|
org.apache.nutch.collection |
Subcollection is a subset of an index.
|
org.apache.nutch.exchange |
Control code for exchange component, which acts in indexing job and decides to
which index writer a document should be routed, based on plugins behavior.
|
org.apache.nutch.exchange.jexl |
Plugin of Exchange component based on JEXL expressions.
|
org.apache.nutch.indexer |
Index content, configure and run indexing and cleaning jobs to
add, update, and delete documents from an index.
|
org.apache.nutch.indexer.anchor |
An indexing plugin for inbound anchor text.
|
org.apache.nutch.indexer.basic |
A basic indexing plugin, adds basic fields: url, host, title, content, etc.
|
org.apache.nutch.indexer.feed |
Indexing filter to index meta data from RSS feeds.
|
org.apache.nutch.indexer.filter | |
org.apache.nutch.indexer.geoip |
This plugin implements an indexing filter which takes
advantage of the
GeoIP2-java API.
|
org.apache.nutch.indexer.jexl |
This plugin implements a dynamic indexing filter which uses JEXL
expressions to allow filtering based on the page's metadata
|
org.apache.nutch.indexer.links | |
org.apache.nutch.indexer.metadata |
Indexing filter to add document metadata to the index.
|
org.apache.nutch.indexer.more |
A more indexing plugin, adds "more" index fields:
last modified date, MIME type, content length.
|
org.apache.nutch.indexer.replace |
Indexing filter to allow pattern replacements on metadata.
|
org.apache.nutch.indexer.staticfield |
A simple plugin called at indexing that adds fields with static data.
|
org.apache.nutch.indexer.subcollection |
Indexing filter to assign documents to subcollections.
|
org.apache.nutch.indexer.tld |
Top Level Domain Indexing plugin.
|
org.apache.nutch.indexer.urlmeta |
URL Meta Tag Indexing Plugin
|
org.apache.nutch.indexwriter.cloudsearch | |
org.apache.nutch.indexwriter.csv |
Index writer plugin to write a plain CSV file.
|
org.apache.nutch.indexwriter.dummy |
Index writer plugin for debugging, writes pairs of <action, url> to a
text file, action is one of "add", "update", or "delete".
|
org.apache.nutch.indexwriter.elastic |
Index writer plugin for Elasticsearch.
|
org.apache.nutch.indexwriter.kafka |
Index writer plugin to produce JSON messages to Kafka.
|
org.apache.nutch.indexwriter.rabbit | |
org.apache.nutch.indexwriter.solr |
Index writer plugin for Apache Solr.
|
org.apache.nutch.microformats.reltag |
A microformats Rel-Tag
Parser/Indexer/Querier plugin.
|
org.apache.nutch.net |
Web-related interfaces: URL
filters
and normalizers . |
org.apache.nutch.parse |
The
Parse interface and related classes. |
org.apache.nutch.parse.ext |
Parse wrapper to run external command to do the parsing.
|
org.apache.nutch.parse.feed |
Parse RSS feeds.
|
org.apache.nutch.parse.headings |
Parse filter to extract headings (h1, h2, etc.) from DOM parse tree.
|
org.apache.nutch.parse.html |
An HTML document parsing plugin.
|
org.apache.nutch.parse.js |
Parser and parse filter plugin to extract all (possible) links
from JavaScript files and embedded JavaScript code snippets.
|
org.apache.nutch.parse.metatags |
Parse filter to extract meta tags: keywords, description, etc.
|
org.apache.nutch.parse.swf |
Parse Flash SWF files.
|
org.apache.nutch.parse.tika |
Parse various document formats with help of
Apache Tika.
|
org.apache.nutch.parse.zip |
Parse ZIP files: embedded files are recursively passed to appropriate parsers.
|
org.apache.nutch.parsefilter.debug |
Adds serialized DOM to parse data, useful for debugging, to understand how
the parser implementation interprets a document (not only HTML).
|
org.apache.nutch.parsefilter.naivebayes |
Html Parse filter that classifies the outlinks from the parseresult as
relevant or irrelevant based on the parseText's relevancy (using a training
file where you can give positive and negative example texts see the
description of parsefilter.naivebayes.trainfile) and if found irrelevent
it gives the link a second chance if it contains any of the words from the
list given in parsefilter.naivebayes.wordlist.
|
org.apache.nutch.parsefilter.regex |
RegexParseFilter.
|
org.apache.nutch.protocol |
Classes related to the
Protocol interface,
see also org.apache.nutch.net.protocols . |
org.apache.nutch.protocol.file |
Protocol plugin which supports retrieving local file resources.
|
org.apache.nutch.protocol.ftp |
Protocol plugin which supports retrieving documents via the ftp protocol.
|
org.apache.nutch.protocol.htmlunit |
Protocol plugin which supports retrieving documents via the http protocol.
|
org.apache.nutch.protocol.http.api |
Common API used by HTTP plugins (
http ,
httpclient ) |
org.apache.nutch.protocol.okhttp |
Protocol plugin based on okhttp, supports http, https, http/2.
|
org.apache.nutch.publisher | |
org.apache.nutch.publisher.rabbitmq |
Publisher package to implement queues
|
org.apache.nutch.scoring |
The
ScoringFilter interface. |
org.apache.nutch.scoring.depth |
Scoring filter to stop crawling at a configurable depth
(number of "hops" from seed URLs).
|
org.apache.nutch.scoring.link |
Scoring filter used in conjunction with
WebGraph . |
org.apache.nutch.scoring.opic |
Scoring filter implementing a variant of the Online Page Importance Computation
(OPIC) algorithm.
|
org.apache.nutch.scoring.orphan |
Scoring filter to modify score or status of orphaned pages (no inlinks found
for a configurable amount of time).
|
org.apache.nutch.scoring.similarity | |
org.apache.nutch.scoring.tld |
Top Level Domain Scoring plugin.
|
org.apache.nutch.scoring.urlmeta |
URL Meta Tag Scoring Plugin
|
org.apache.nutch.urlfilter.api |
Generic
URL filter library,
abstracting away from regular expression implementations. |
org.apache.nutch.urlfilter.automaton |
URL filter plugin based on
dk.brics.automaton Finite-State
Automata for JavaTM.
|
org.apache.nutch.urlfilter.domain |
URL filter plugin to include only URLs which match an element in a given list of
domain suffixes, domain names, and/or host names.
|
org.apache.nutch.urlfilter.domaindenylist |
URL filter plugin to exclude URLs by domain suffixes, domain names, and/or host names.
|
org.apache.nutch.urlfilter.fast |
URL filter plugin that first does fast exact suffix matches on host/domain
names before applying regular expressions to the path component of a URL.
|
org.apache.nutch.urlfilter.ignoreexempt |
URL filter plugin which identifies exemptions to external urls when
when external urls are set to ignore.
|
org.apache.nutch.urlfilter.prefix |
URL filter plugin to include only URLs which match one of a given list of URL prefixes.
|
org.apache.nutch.urlfilter.regex |
URL filter plugin to include and/or exclude URLs matching Java regular expressions.
|
org.apache.nutch.urlfilter.suffix |
URL filter plugin to either exclude or include only URLs which match
one of the given (path) suffixes.
|
org.apache.nutch.urlfilter.validator |
URL filter plugin that validates given urls.
|
org.creativecommons.nutch |
Sample plugins that parse and index Creative Commons medadata.
|
Modifier and Type | Class and Description |
---|---|
class |
HTMLLanguageParser |
class |
LanguageIndexingFilter
An
IndexingFilter that add a
lang (language) field to the document. |
Modifier and Type | Class and Description |
---|---|
class |
Any23IndexingFilter
This implementation of
IndexingFilter
adds a triple(s) field to the NutchDocument . |
class |
Any23ParseFilter
This implementation of
HtmlParseFilter
uses the Apache Any23 library
for parsing and extracting structured data in RDF format from a
variety of Web documents. |
Modifier and Type | Class and Description |
---|---|
class |
Subcollection
SubCollection represents a subset of index, you can define url patterns that
will indicate that particular page (url) is part of SubCollection.
|
Modifier and Type | Interface and Description |
---|---|
interface |
Exchange |
Modifier and Type | Class and Description |
---|---|
class |
JexlExchange |
Modifier and Type | Interface and Description |
---|---|
interface |
IndexingFilter
Extension point for indexing.
|
interface |
IndexWriter |
Modifier and Type | Class and Description |
---|---|
class |
AnchorIndexingFilter
Indexing filter that offers an option to either index all inbound anchor text
for a document or deduplicate anchors.
|
Modifier and Type | Class and Description |
---|---|
class |
BasicIndexingFilter
Adds basic searchable fields to a document.
|
Modifier and Type | Class and Description |
---|---|
class |
FeedIndexingFilter |
Modifier and Type | Class and Description |
---|---|
class |
MimeTypeIndexingFilter
An
IndexingFilter that allows filtering
of documents based on the MIME Type detected by Tika |
Modifier and Type | Class and Description |
---|---|
class |
GeoIPIndexingFilter
This plugin implements an indexing filter which takes advantage of the GeoIP2-java API.
|
Modifier and Type | Class and Description |
---|---|
class |
JexlIndexingFilter
An
IndexingFilter that allows filtering of
documents based on a JEXL expression. |
Modifier and Type | Class and Description |
---|---|
class |
LinksIndexingFilter
|
Modifier and Type | Class and Description |
---|---|
class |
MetadataIndexer
Indexer which can be configured to extract metadata from the crawldb, parse
metadata or content metadata.
|
Modifier and Type | Class and Description |
---|---|
class |
MoreIndexingFilter
Add (or reset) a few metaData properties as respective fields (if they are
available), so that they can be accurately used within the search index.
|
Modifier and Type | Class and Description |
---|---|
class |
ReplaceIndexer
Do pattern replacements on selected field contents prior to indexing.
|
Modifier and Type | Class and Description |
---|---|
class |
StaticFieldIndexer
A simple plugin called at indexing that adds fields with static data.
|
Modifier and Type | Class and Description |
---|---|
class |
SubcollectionIndexingFilter |
Modifier and Type | Class and Description |
---|---|
class |
TLDIndexingFilter
Adds the Top level domain extensions to the index
|
Modifier and Type | Class and Description |
---|---|
class |
URLMetaIndexingFilter
This is part of the URL Meta plugin.
|
Modifier and Type | Class and Description |
---|---|
class |
CloudSearchIndexWriter
Writes documents to CloudSearch.
|
Modifier and Type | Class and Description |
---|---|
class |
CSVIndexWriter
Write Nutch documents to a CSV file (comma separated values), i.e., dump
index as CSV or tab-separated plain text table.
|
Modifier and Type | Class and Description |
---|---|
class |
DummyIndexWriter
DummyIndexWriter.
|
Modifier and Type | Class and Description |
---|---|
class |
ElasticIndexWriter
Sends NutchDocuments to a configured Elasticsearch index.
|
Modifier and Type | Class and Description |
---|---|
class |
KafkaIndexWriter
Sends Nutch documents to a configured Kafka Cluster
|
Modifier and Type | Class and Description |
---|---|
class |
RabbitIndexWriter |
Modifier and Type | Class and Description |
---|---|
class |
SolrIndexWriter |
Modifier and Type | Class and Description |
---|---|
class |
RelTagIndexingFilter
An
IndexingFilter that add tag
field(s) to the document. |
class |
RelTagParser
Adds microformat rel-tags of document if found.
|
Modifier and Type | Interface and Description |
---|---|
interface |
URLExemptionFilter
Interface used to allow exemptions to external domain resources by overriding
db.ignore.external.links . |
interface |
URLFilter
Interface used to limit which URLs enter Nutch.
|
Modifier and Type | Interface and Description |
---|---|
interface |
HtmlParseFilter
Extension point for DOM-based HTML parsers.
|
interface |
Parser
A parser for content generated by a
Protocol implementation. |
Modifier and Type | Class and Description |
---|---|
class |
ExtParser
A wrapper that invokes external command to do real parsing job.
|
Modifier and Type | Class and Description |
---|---|
class |
FeedParser |
Modifier and Type | Class and Description |
---|---|
class |
HeadingsParseFilter
HtmlParseFilter to retrieve h1 and h2 values from the DOM.
|
Modifier and Type | Class and Description |
---|---|
class |
HtmlParser |
Modifier and Type | Class and Description |
---|---|
class |
JSParseFilter
This class is a heuristic link extractor for JavaScript files and code
snippets.
|
Modifier and Type | Class and Description |
---|---|
class |
MetaTagsParser
Parse HTML meta tags (keywords, description) and store them in the parse
metadata so that they can be indexed with the index-metadata plugin with the
prefix 'metatag.'.
|
Modifier and Type | Class and Description |
---|---|
class |
SWFParser
Parser for Flash SWF files.
|
Modifier and Type | Class and Description |
---|---|
class |
TikaParser
Wrapper for Tika parsers.
|
Modifier and Type | Class and Description |
---|---|
class |
ZipParser
ZipParser class based on MSPowerPointParser class by Stephan Strittmatter.
|
Modifier and Type | Class and Description |
---|---|
class |
DebugParseFilter
Adds serialized DOM to parse data, useful for debugging, to understand how
the parser implementation interprets a document (not only HTML).
|
Modifier and Type | Class and Description |
---|---|
class |
NaiveBayesParseFilter
Html Parse filter that classifies the outlinks from the parseresult as
relevant or irrelevant based on the parseText's relevancy (using a training
file where you can give positive and negative example texts see the
description of parsefilter.naivebayes.trainfile) and if found irrelevant it
gives the link a second chance if it contains any of the words from the list
given in parsefilter.naivebayes.wordlist.
|
Modifier and Type | Class and Description |
---|---|
class |
RegexParseFilter
RegexParseFilter.
|
Modifier and Type | Interface and Description |
---|---|
interface |
Protocol
A retriever of url content.
|
Modifier and Type | Class and Description |
---|---|
class |
File
This class is a protocol plugin used for file: scheme.
|
Modifier and Type | Class and Description |
---|---|
class |
Ftp
This class is a protocol plugin used for ftp: scheme.
|
Modifier and Type | Class and Description |
---|---|
class |
Http |
Modifier and Type | Class and Description |
---|---|
class |
HttpBase |
Modifier and Type | Class and Description |
---|---|
class |
OkHttp |
Modifier and Type | Interface and Description |
---|---|
interface |
NutchPublisher
All publisher subscriber model implementations should implement this interface.
|
Modifier and Type | Class and Description |
---|---|
class |
NutchPublishers |
Modifier and Type | Class and Description |
---|---|
class |
RabbitMQPublisherImpl |
Modifier and Type | Interface and Description |
---|---|
interface |
ScoringFilter
A contract defining behavior of scoring plugins.
|
Modifier and Type | Class and Description |
---|---|
class |
AbstractScoringFilter |
class |
ScoringFilters
Creates and caches
ScoringFilter implementing plugins. |
Modifier and Type | Class and Description |
---|---|
class |
DepthScoringFilter
This scoring filter limits the number of hops from the initial seed urls.
|
Modifier and Type | Class and Description |
---|---|
class |
LinkAnalysisScoringFilter |
Modifier and Type | Class and Description |
---|---|
class |
OPICScoringFilter
This plugin implements a variant of an Online Page Importance Computation
(OPIC) score, described in this paper:
Abiteboul, Serge and Preda, Mihai and Cobena, Gregory (2003), Adaptive
On-Line Page Importance Computation.
|
Modifier and Type | Class and Description |
---|---|
class |
OrphanScoringFilter
Orphan scoring filter that determines whether a page has become orphaned,
e.g.
|
Modifier and Type | Class and Description |
---|---|
class |
SimilarityScoringFilter |
Modifier and Type | Class and Description |
---|---|
class |
TLDScoringFilter
Scoring filter to boost tlds.
|
Modifier and Type | Class and Description |
---|---|
class |
URLMetaScoringFilter
For documentation:
org.apache.nutch.scoring.urlmeta |
Modifier and Type | Class and Description |
---|---|
class |
RegexURLFilterBase
Generic
URL filter based on regular
expressions. |
Modifier and Type | Class and Description |
---|---|
class |
AutomatonURLFilter
RegexURLFilterBase implementation based on the dk.brics.automaton Finite-State
Automata for JavaTM.
|
Modifier and Type | Class and Description |
---|---|
class |
DomainURLFilter
Filters URLs based on a file containing domain suffixes, domain names, and
hostnames.
|
Modifier and Type | Class and Description |
---|---|
class |
DomainDenylistURLFilter
Filters URLs based on a file containing domain suffixes, domain names, and
hostnames.
|
Modifier and Type | Class and Description |
---|---|
class |
FastURLFilter
Filters URLs based on a file of regular expressions using host/domains
matching first.
|
Modifier and Type | Class and Description |
---|---|
class |
ExemptionUrlFilter
This implementation of
URLExemptionFilter uses regex configuration
to check if URL is eligible for exemption from 'db.ignore.external'. |
Modifier and Type | Class and Description |
---|---|
class |
PrefixURLFilter
Filters URLs based on a file of URL prefixes.
|
Modifier and Type | Class and Description |
---|---|
class |
RegexURLFilter
Filters URLs based on a file of regular expressions using the
Java Regex implementation . |
Modifier and Type | Class and Description |
---|---|
class |
SuffixURLFilter
Filters URLs based on a file of URL suffixes.
|
Modifier and Type | Class and Description |
---|---|
class |
UrlValidator
Validates URLs.
|
Modifier and Type | Class and Description |
---|---|
class |
CCIndexingFilter
Adds basic searchable fields to a document.
|
class |
CCParseFilter
Adds metadata identifying the Creative Commons license used, if any.
|
Copyright © 2021 The Apache Software Foundation