public class SitemapProcessor extends Configured implements Tool
Performs Sitemap processing by fetching sitemap links, parsing the content and merging the urls from Sitemap (with the metadata) with the existing crawldb.
There are two use cases supported in Nutch's Sitemap processing:
For more details see: https://cwiki.apache.org/confluence/display/NUTCH/SitemapFeature
Modifier and Type | Field and Description |
---|---|
static String |
CURRENT_NAME |
static String |
LOCK_NAME |
static org.slf4j.Logger |
LOG |
static SimpleDateFormat |
sdf |
static String |
SITEMAP_ALWAYS_TRY_SITEMAPXML_ON_ROOT |
static String |
SITEMAP_OVERWRITE_EXISTING |
static String |
SITEMAP_REDIR_MAX |
static String |
SITEMAP_SIZE_MAX |
static String |
SITEMAP_STRICT_PARSING |
static String |
SITEMAP_URL_FILTERING |
static String |
SITEMAP_URL_NORMALIZING |
Constructor and Description |
---|
SitemapProcessor() |
Modifier and Type | Method and Description |
---|---|
static void |
main(String[] args) |
int |
run(String[] args) |
void |
sitemap(Path crawldb,
Path hostdb,
Path sitemapUrlDir,
boolean strict,
boolean filter,
boolean normalize,
int threads) |
static void |
usage() |
getConf, setConf
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getConf, setConf
public static final org.slf4j.Logger LOG
public static final SimpleDateFormat sdf
public static final String CURRENT_NAME
public static final String LOCK_NAME
public static final String SITEMAP_STRICT_PARSING
public static final String SITEMAP_URL_FILTERING
public static final String SITEMAP_URL_NORMALIZING
public static final String SITEMAP_ALWAYS_TRY_SITEMAPXML_ON_ROOT
public static final String SITEMAP_OVERWRITE_EXISTING
public static final String SITEMAP_REDIR_MAX
public static final String SITEMAP_SIZE_MAX
Copyright © 2021 The Apache Software Foundation