public class CrawlDbMerger extends Configured implements Tool
It's possible to use this tool just for filtering - in that case only one CrawlDb should be specified in arguments.
If more than one CrawlDb contains information about the same URL, only the
most recent version is retained, as determined by the value of
CrawlDatum.getFetchTime()
. However, all
metadata information from all versions is accumulated, with newer values
taking precedence over older values.
Modifier and Type | Class and Description |
---|---|
static class |
CrawlDbMerger.Merger |
Constructor and Description |
---|
CrawlDbMerger() |
CrawlDbMerger(Configuration conf) |
Modifier and Type | Method and Description |
---|---|
static Job |
createMergeJob(Configuration conf,
Path output,
boolean normalize,
boolean filter) |
static void |
main(String[] args) |
void |
merge(Path output,
Path[] dbs,
boolean normalize,
boolean filter) |
int |
run(String[] args) |
getConf, setConf
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getConf, setConf
public CrawlDbMerger()
public CrawlDbMerger(Configuration conf)
public void merge(Path output, Path[] dbs, boolean normalize, boolean filter) throws Exception
Exception
public static Job createMergeJob(Configuration conf, Path output, boolean normalize, boolean filter) throws IOException
IOException
Copyright © 2021 The Apache Software Foundation