Package | Description |
---|---|
org.apache.nutch.indexer |
Index content, configure and run indexing and cleaning jobs to
add, update, and delete documents from an index.
|
org.apache.nutch.metadata |
A Multi-valued Metadata container, and set
of constant fields for Nutch Metadata.
|
org.apache.nutch.net.protocols |
Helper classes related to the
Protocol
interface, see also org.apache.nutch.protocol . |
org.apache.nutch.parse |
The
Parse interface and related classes. |
org.apache.nutch.protocol |
Classes related to the
Protocol interface,
see also org.apache.nutch.net.protocols . |
org.apache.nutch.protocol.htmlunit |
Protocol plugin which supports retrieving documents via the http protocol.
|
org.apache.nutch.protocol.http |
Protocol plugin which supports retrieving documents via the http protocol.
|
org.apache.nutch.protocol.httpclient |
Protocol plugin which supports retrieving documents via the HTTP and
HTTPS protocols, optionally with Basic, Digest and NTLM authentication
schemes for web server as well as proxy server.
|
org.apache.nutch.protocol.interactiveselenium |
Protocol plugin which supports retrieving documents via selenium.
|
org.apache.nutch.protocol.okhttp |
Protocol plugin based on okhttp, supports http, https, http/2.
|
org.apache.nutch.protocol.selenium |
Protocol plugin which supports retrieving documents via selenium.
|
org.apache.nutch.scoring.webgraph | |
org.apache.nutch.segment |
A segment stores all data from on generate/fetch/update cycle:
fetch list, protocol status, raw content, parsed content, and extracted outgoing links.
|
org.apache.nutch.tools |
Miscellaneous tools.
|
org.creativecommons.nutch |
Sample plugins that parse and index Creative Commons medadata.
|
Modifier and Type | Method and Description |
---|---|
Metadata |
NutchDocument.getDocumentMeta() |
Modifier and Type | Class and Description |
---|---|
class |
SpellCheckedMetadata
A decorator to Metadata that adds spellchecking capabilities to property
names.
|
Modifier and Type | Method and Description |
---|---|
Metadata |
MetaWrapper.getMetadata()
Get all metadata.
|
Modifier and Type | Method and Description |
---|---|
void |
Metadata.addAll(Metadata metadata)
Add all name/value mappings (merge two metadata mappings).
|
Constructor and Description |
---|
MetaWrapper(Metadata metadata,
Writable instance,
Configuration conf) |
Modifier and Type | Method and Description |
---|---|
Metadata |
Response.getHeaders()
Returns all the headers.
|
Modifier and Type | Method and Description |
---|---|
Metadata |
ParseData.getContentMeta()
The original Metadata retrieved from content
|
Metadata |
HTMLMetaTags.getGeneralTags()
Returns all collected values of the general meta tags.
|
Metadata |
ParseData.getParseMeta()
Other content properties.
|
Modifier and Type | Method and Description |
---|---|
void |
ParseData.setParseMeta(Metadata parseMeta) |
Constructor and Description |
---|
ParseData(ParseStatus status,
String title,
Outlink[] outlinks,
Metadata contentMeta) |
ParseData(ParseStatus status,
String title,
Outlink[] outlinks,
Metadata contentMeta,
Metadata parseMeta) |
Modifier and Type | Method and Description |
---|---|
Metadata |
Content.getMetadata()
Other protocol-specific data.
|
Modifier and Type | Method and Description |
---|---|
void |
Content.setMetadata(Metadata metadata)
Other protocol-specific data.
|
Constructor and Description |
---|
Content(String url,
String base,
byte[] content,
String contentType,
Metadata metadata,
Configuration conf) |
Content(String url,
String base,
byte[] content,
String contentType,
Metadata metadata,
MimeUtil mimeTypes) |
Modifier and Type | Method and Description |
---|---|
Metadata |
HttpResponse.getHeaders() |
Modifier and Type | Method and Description |
---|---|
Metadata |
HttpResponse.getHeaders() |
Modifier and Type | Method and Description |
---|---|
Metadata |
HttpResponse.getHeaders() |
Modifier and Type | Method and Description |
---|---|
HttpAuthentication |
HttpAuthenticationFactory.findAuthentication(Metadata header) |
Modifier and Type | Method and Description |
---|---|
Metadata |
HttpResponse.getHeaders() |
Modifier and Type | Method and Description |
---|---|
Metadata |
OkHttpResponse.getHeaders() |
Modifier and Type | Method and Description |
---|---|
Metadata |
HttpResponse.getHeaders() |
Modifier and Type | Method and Description |
---|---|
Metadata |
Node.getMetadata() |
Modifier and Type | Method and Description |
---|---|
void |
Node.setMetadata(Metadata metadata) |
Modifier and Type | Method and Description |
---|---|
static Charset |
SegmentReader.getCharset(Metadata parseMeta)
Try to get HTML encoding from parse metadata
|
Modifier and Type | Field and Description |
---|---|
protected Metadata |
AbstractCommonCrawlFormat.metadata |
Modifier and Type | Method and Description |
---|---|
static CommonCrawlFormat |
CommonCrawlFormatFactory.getCommonCrawlFormat(String formatType,
String url,
Content content,
Metadata metadata,
Configuration nutchConf,
CommonCrawlConfig config)
Deprecated.
|
String |
CommonCrawlFormat.getJsonData(String url,
Content content,
Metadata metadata)
Returns a string representation of the JSON structure of the URL content
|
String |
AbstractCommonCrawlFormat.getJsonData(String url,
Content content,
Metadata metadata) |
String |
CommonCrawlFormat.getJsonData(String url,
Content content,
Metadata metadata,
ParseData parseData)
Returns a string representation of the JSON structure of the URL content
takes into account the parsed metadata about the URL
|
String |
AbstractCommonCrawlFormat.getJsonData(String url,
Content content,
Metadata metadata,
ParseData parseData) |
String |
CommonCrawlFormatWARC.getJsonData(String url,
Content content,
Metadata metadata,
ParseData parseData) |
Constructor and Description |
---|
AbstractCommonCrawlFormat(String url,
Content content,
Metadata metadata,
Configuration nutchConf,
CommonCrawlConfig config) |
CommonCrawlFormatJackson(String url,
Content content,
Metadata metadata,
Configuration nutchConf,
CommonCrawlConfig config) |
CommonCrawlFormatJettinson(String url,
Content content,
Metadata metadata,
Configuration nutchConf,
CommonCrawlConfig config) |
CommonCrawlFormatSimple(String url,
Content content,
Metadata metadata,
Configuration nutchConf,
CommonCrawlConfig config) |
CommonCrawlFormatWARC(String url,
Content content,
Metadata metadata,
Configuration nutchConf,
CommonCrawlConfig config,
ParseData parseData) |
Modifier and Type | Method and Description |
---|---|
static void |
CCParseFilter.Walker.walk(Node doc,
URL base,
Metadata metadata,
Configuration conf)
Scan the document adding attributes to metadata.
|
Copyright © 2021 The Apache Software Foundation