Package | Description |
---|---|
org.apache.nutch.analysis.lang |
Text document language identifier.
|
org.apache.nutch.any23 |
This packages uses the Apache Any23 library
for parsing and extracting structured data in RDF format from a
variety of Web documents.
|
org.apache.nutch.exchange |
Control code for exchange component, which acts in indexing job and decides to
which index writer a document should be routed, based on plugins behavior.
|
org.apache.nutch.exchange.jexl |
Plugin of Exchange component based on JEXL expressions.
|
org.apache.nutch.indexer |
Index content, configure and run indexing and cleaning jobs to
add, update, and delete documents from an index.
|
org.apache.nutch.indexer.anchor |
An indexing plugin for inbound anchor text.
|
org.apache.nutch.indexer.basic |
A basic indexing plugin, adds basic fields: url, host, title, content, etc.
|
org.apache.nutch.indexer.feed |
Indexing filter to index meta data from RSS feeds.
|
org.apache.nutch.indexer.filter | |
org.apache.nutch.indexer.geoip |
This plugin implements an indexing filter which takes
advantage of the
GeoIP2-java API.
|
org.apache.nutch.indexer.jexl |
This plugin implements a dynamic indexing filter which uses JEXL
expressions to allow filtering based on the page's metadata
|
org.apache.nutch.indexer.links | |
org.apache.nutch.indexer.metadata |
Indexing filter to add document metadata to the index.
|
org.apache.nutch.indexer.more |
A more indexing plugin, adds "more" index fields:
last modified date, MIME type, content length.
|
org.apache.nutch.indexer.replace |
Indexing filter to allow pattern replacements on metadata.
|
org.apache.nutch.indexer.staticfield |
A simple plugin called at indexing that adds fields with static data.
|
org.apache.nutch.indexer.subcollection |
Indexing filter to assign documents to subcollections.
|
org.apache.nutch.indexer.tld |
Top Level Domain Indexing plugin.
|
org.apache.nutch.indexer.urlmeta |
URL Meta Tag Indexing Plugin
|
org.apache.nutch.indexwriter.cloudsearch | |
org.apache.nutch.indexwriter.csv |
Index writer plugin to write a plain CSV file.
|
org.apache.nutch.indexwriter.dummy |
Index writer plugin for debugging, writes pairs of <action, url> to a
text file, action is one of "add", "update", or "delete".
|
org.apache.nutch.indexwriter.elastic |
Index writer plugin for Elasticsearch.
|
org.apache.nutch.indexwriter.kafka |
Index writer plugin to produce JSON messages to Kafka.
|
org.apache.nutch.indexwriter.rabbit | |
org.apache.nutch.indexwriter.solr |
Index writer plugin for Apache Solr.
|
org.apache.nutch.microformats.reltag |
A microformats Rel-Tag
Parser/Indexer/Querier plugin.
|
org.apache.nutch.scoring |
The
ScoringFilter interface. |
org.apache.nutch.scoring.depth |
Scoring filter to stop crawling at a configurable depth
(number of "hops" from seed URLs).
|
org.apache.nutch.scoring.link |
Scoring filter used in conjunction with
WebGraph . |
org.apache.nutch.scoring.opic |
Scoring filter implementing a variant of the Online Page Importance Computation
(OPIC) algorithm.
|
org.apache.nutch.scoring.tld |
Top Level Domain Scoring plugin.
|
org.apache.nutch.scoring.urlmeta |
URL Meta Tag Scoring Plugin
|
org.apache.nutch.tools |
Miscellaneous tools.
|
org.creativecommons.nutch |
Sample plugins that parse and index Creative Commons medadata.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
LanguageIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
LanguageIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
Any23IndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
Any23IndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
String[] |
Exchanges.indexWriters(NutchDocument nutchDocument)
Returns all the indexers where the document must be sent to.
|
boolean |
Exchange.match(NutchDocument doc)
Determines if the document must go to the related index writers.
|
Modifier and Type | Method and Description |
---|---|
boolean |
JexlExchange.match(NutchDocument doc)
Determines if the document must go to the related index writers.
|
Modifier and Type | Field and Description |
---|---|
NutchDocument |
NutchIndexAction.doc |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
NutchDocument.clone() |
NutchDocument |
IndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
NutchDocument |
IndexingFilters.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Run all defined filters.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
IndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
NutchDocument |
IndexingFilters.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Run all defined filters.
|
void |
IndexWriters.update(NutchDocument doc) |
void |
IndexWriter.update(NutchDocument doc) |
void |
IndexWriters.write(NutchDocument doc) |
void |
IndexWriter.write(NutchDocument doc) |
Constructor and Description |
---|
NutchIndexAction(NutchDocument doc,
byte action) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
AnchorIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
The
AnchorIndexingFilter filter object which supports boolean
configuration settings for the deduplication of anchors. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
AnchorIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
The
AnchorIndexingFilter filter object which supports boolean
configuration settings for the deduplication of anchors. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
BasicIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
The
BasicIndexingFilter filter object which supports few
configuration settings for adding basic searchable fields. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
BasicIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
The
BasicIndexingFilter filter object which supports few
configuration settings for adding basic searchable fields. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
FeedIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Extracts out the relevant fields:
FEED_AUTHOR
FEED_TAGS
FEED_PUBLISHED
FEED_UPDATED
FEED
And sends them to the
Indexer for indexing within the Nutch index. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
FeedIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Extracts out the relevant fields:
FEED_AUTHOR
FEED_TAGS
FEED_PUBLISHED
FEED_UPDATED
FEED
And sends them to the
Indexer for indexing within the Nutch index. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MimeTypeIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MimeTypeIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
static NutchDocument |
GeoIPDocumentCreator.createDocFromCityDb(String serverIp,
NutchDocument doc,
DatabaseReader reader) |
static NutchDocument |
GeoIPDocumentCreator.createDocFromCityService(String serverIp,
NutchDocument doc,
WebServiceClient client) |
static NutchDocument |
GeoIPDocumentCreator.createDocFromConnectionDb(String serverIp,
NutchDocument doc,
DatabaseReader reader) |
static NutchDocument |
GeoIPDocumentCreator.createDocFromCountryService(String serverIp,
NutchDocument doc,
WebServiceClient client) |
static NutchDocument |
GeoIPDocumentCreator.createDocFromDomainDb(String serverIp,
NutchDocument doc,
DatabaseReader reader) |
static NutchDocument |
GeoIPDocumentCreator.createDocFromInsightsService(String serverIp,
NutchDocument doc,
WebServiceClient client) |
static NutchDocument |
GeoIPDocumentCreator.createDocFromIspDb(String serverIp,
NutchDocument doc,
DatabaseReader reader) |
NutchDocument |
GeoIPIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
static void |
GeoIPDocumentCreator.addIfNotNull(NutchDocument doc,
String name,
Integer value)
Add field to document but only if value isn't null
|
static void |
GeoIPDocumentCreator.addIfNotNull(NutchDocument doc,
String name,
String value)
Add field to document but only if value isn't null
|
static NutchDocument |
GeoIPDocumentCreator.createDocFromCityDb(String serverIp,
NutchDocument doc,
DatabaseReader reader) |
static NutchDocument |
GeoIPDocumentCreator.createDocFromCityService(String serverIp,
NutchDocument doc,
WebServiceClient client) |
static NutchDocument |
GeoIPDocumentCreator.createDocFromConnectionDb(String serverIp,
NutchDocument doc,
DatabaseReader reader) |
static NutchDocument |
GeoIPDocumentCreator.createDocFromCountryService(String serverIp,
NutchDocument doc,
WebServiceClient client) |
static NutchDocument |
GeoIPDocumentCreator.createDocFromDomainDb(String serverIp,
NutchDocument doc,
DatabaseReader reader) |
static NutchDocument |
GeoIPDocumentCreator.createDocFromInsightsService(String serverIp,
NutchDocument doc,
WebServiceClient client) |
static NutchDocument |
GeoIPDocumentCreator.createDocFromIspDb(String serverIp,
NutchDocument doc,
DatabaseReader reader) |
NutchDocument |
GeoIPIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
JexlIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
JexlIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
LinksIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
LinksIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MetadataIndexer.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
protected void |
MetadataIndexer.add(NutchDocument doc,
String key,
String value) |
NutchDocument |
MetadataIndexer.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MoreIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
MoreIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
ReplaceIndexer.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
ReplaceIndexer.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
Adds fields or otherwise modifies the document that will be indexed for a
parse.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
StaticFieldIndexer.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
The
StaticFieldIndexer filter object which adds fields as per
configuration setting. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
StaticFieldIndexer.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
The
StaticFieldIndexer filter object which adds fields as per
configuration setting. |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
SubcollectionIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
SubcollectionIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
TLDIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text urlText,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
TLDIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text urlText,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
URLMetaIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the CrawlDatum object.
|
Modifier and Type | Method and Description |
---|---|
NutchDocument |
URLMetaIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks)
This will take the metatags that you have listed in your "urlmeta.tags"
property, and looks for them inside the CrawlDatum object.
|
Modifier and Type | Method and Description |
---|---|
void |
CloudSearchIndexWriter.update(NutchDocument doc) |
void |
CloudSearchIndexWriter.write(NutchDocument doc) |
Modifier and Type | Method and Description |
---|---|
void |
CSVIndexWriter.update(NutchDocument doc) |
void |
CSVIndexWriter.write(NutchDocument doc) |
Modifier and Type | Method and Description |
---|---|
void |
DummyIndexWriter.update(NutchDocument doc) |
void |
DummyIndexWriter.write(NutchDocument doc) |
Modifier and Type | Method and Description |
---|---|
void |
ElasticIndexWriter.update(NutchDocument doc) |
void |
ElasticIndexWriter.write(NutchDocument doc) |
Modifier and Type | Method and Description |
---|---|
void |
KafkaIndexWriter.update(NutchDocument doc) |
void |
KafkaIndexWriter.write(NutchDocument doc) |
Modifier and Type | Method and Description |
---|---|
void |
RabbitIndexWriter.update(NutchDocument doc) |
void |
RabbitIndexWriter.write(NutchDocument doc) |
Modifier and Type | Method and Description |
---|---|
void |
SolrIndexWriter.update(NutchDocument doc) |
void |
SolrIndexWriter.write(NutchDocument doc) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
RelTagIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
RelTagIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
float |
ScoringFilter.indexerScore(Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
This method calculates a indexed document score/boost.
|
float |
ScoringFilters.indexerScore(Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) |
float |
AbstractScoringFilter.indexerScore(Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) |
Modifier and Type | Method and Description |
---|---|
float |
DepthScoringFilter.indexerScore(Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) |
Modifier and Type | Method and Description |
---|---|
float |
LinkAnalysisScoringFilter.indexerScore(Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) |
Modifier and Type | Method and Description |
---|---|
float |
OPICScoringFilter.indexerScore(Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
Dampen the boost value by scorePower.
|
Modifier and Type | Method and Description |
---|---|
float |
TLDScoringFilter.indexerScore(Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore) |
Modifier and Type | Method and Description |
---|---|
float |
URLMetaScoringFilter.indexerScore(Text url,
NutchDocument doc,
CrawlDatum dbDatum,
CrawlDatum fetchDatum,
Parse parse,
Inlinks inlinks,
float initScore)
Boilerplate
|
Modifier and Type | Method and Description |
---|---|
static org.archive.io.warc.WARCRecordInfo |
WARCUtils.docToMetadata(NutchDocument doc) |
Modifier and Type | Method and Description |
---|---|
NutchDocument |
CCIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Modifier and Type | Method and Description |
---|---|
void |
CCIndexingFilter.addUrlFeatures(NutchDocument doc,
String urlString)
Add the features represented by a license URL.
|
NutchDocument |
CCIndexingFilter.filter(NutchDocument doc,
Parse parse,
Text url,
CrawlDatum datum,
Inlinks inlinks) |
Copyright © 2021 The Apache Software Foundation