Interface | Description |
---|---|
CommonCrawlFormat |
Interface for all CommonCrawl formatter.
|
Class | Description |
---|---|
AbstractCommonCrawlFormat |
Abstract class that implements { @see org.apache.nutch.tools.CommonCrawlFormat } interface.
|
Benchmark | |
Benchmark.BenchmarkResults | |
CommonCrawlConfig | |
CommonCrawlDataDumper |
The Common Crawl Data Dumper tool enables one to reverse generate the raw
content from Nutch segment data directories into a common crawling data
format, consumed by many applications.
|
CommonCrawlFormatFactory |
Factory class that creates new
CommonCrawlFormat objects (a.k.a. |
CommonCrawlFormatJackson |
This class provides methods to map crawled data on JSON using Jackson Streaming APIs.
|
CommonCrawlFormatJettinson |
This class provides methods to map crawled data on JSON using Jettinson APIs.
|
CommonCrawlFormatSimple |
This class provides methods to map crawled data on JSON using a StringBuilder object.
|
CommonCrawlFormatWARC | |
DmozParser |
Utility that converts DMOZ
RDF into a flat file of URLs to be injected.
|
FileDumper |
The file dumper tool enables one to reverse generate the raw content from
Nutch segment data directories.
|
FreeGenerator |
This tool generates fetchlists (segments to be fetched) from plain text files
containing one URL per line.
|
FreeGenerator.FG | |
FreeGenerator.FG.FGMapper | |
FreeGenerator.FG.FGReducer | |
ResolveUrls |
A simple tool that will spin up multiple threads to resolve urls to ip
addresses.
|
ShowProperties |
Tool to list properties and their values set by the current Nutch
configuration
|
WARCUtils |
Copyright © 2021 The Apache Software Foundation