Package | Description |
---|---|
org.apache.nutch.crawl |
Crawl control code and tools to run the crawler.
|
org.apache.nutch.fetcher |
The Nutch robot.
|
org.apache.nutch.indexer |
Index content, configure and run indexing and cleaning jobs to
add, update, and delete documents from an index.
|
org.apache.nutch.parse |
The
Parse interface and related classes. |
org.apache.nutch.service.impl | |
org.apache.nutch.tools |
Miscellaneous tools.
|
Modifier and Type | Class and Description |
---|---|
class |
CrawlDb
This class takes the output of the fetcher and updates the crawldb
accordingly.
|
class |
DeduplicationJob
Generic deduplicator which groups fetched URLs with the same digest and marks
all of them as duplicate except the one with the highest score (based on the
score in the crawldb, which is not necessarily the same as the score
indexed).
|
class |
Generator
Generates a subset of a crawl db to fetch.
|
class |
Injector
Injector takes a flat text file of URLs (or a folder containing text files)
and merges ("injects") these URLs into the CrawlDb.
|
class |
LinkDb
Maintains an inverted link map, listing incoming links for each url.
|
Modifier and Type | Class and Description |
---|---|
class |
Fetcher
A queue-based fetcher.
|
Modifier and Type | Class and Description |
---|---|
class |
IndexingJob
Generic indexer which relies on the plugins implementing IndexWriter
|
Modifier and Type | Class and Description |
---|---|
class |
ParseSegment |
Modifier and Type | Method and Description |
---|---|
NutchTool |
JobFactory.createToolByClassName(String className,
Configuration conf) |
NutchTool |
JobFactory.createToolByType(JobManager.JobType type,
Configuration conf) |
Constructor and Description |
---|
JobWorker(JobConfig jobConfig,
Configuration conf,
NutchTool tool)
To initialize JobWorker thread with the Job Configurations provided by user.
|
ServiceWorker(ServiceConfig serviceConfig,
NutchTool tool) |
Modifier and Type | Class and Description |
---|---|
class |
CommonCrawlDataDumper
The Common Crawl Data Dumper tool enables one to reverse generate the raw
content from Nutch segment data directories into a common crawling data
format, consumed by many applications.
|
Copyright © 2021 The Apache Software Foundation