public class SegmentMerger extends Configured implements Tool
Also, it's possible to slice the resulting segment into chunks of fixed size.
It doesn't make sense to merge data from segments, which are at different stages of processing (e.g. one unfetched segment, one fetched but not parsed, and one fetched and parsed). Therefore, prior to merging, the tool will determine the lowest common set of input data, and only this data will be merged. This may have some unintended consequences: e.g. if majority of input segments are fetched and parsed, but one of them is unfetched, the tool will fall back to just merging fetchlists, and it will skip all other data from all segments.
Merging segments, which contain just fetchlists (i.e. prior to fetching) is
not recommended, because this tool (unlike the
Generator
doesn't ensure that fetchlist parts
for each map task are disjoint.
For some types of data (especially ParseText) it's not possible to determine which version is really older. Therefore the tool always uses segment names as timestamps, for all types of input data. Segment names are compared in forward lexicographic order (0-9a-zA-Z), and data from segments with "higher" names will prevail. It follows then that it is extremely important that segments be named in an increasing lexicographic order as their creation time increases.
Modifier and Type | Class and Description |
---|---|
static class |
SegmentMerger.ObjectInputFormat
Wraps inputs in an
MetaWrapper , to permit merging different types
in reduce and use additional metadata. |
static class |
SegmentMerger.SegmentMergerMapper |
static class |
SegmentMerger.SegmentMergerReducer
NOTE: in selecting the latest version we rely exclusively on the segment
name (not all segment data contain time information).
|
static class |
SegmentMerger.SegmentOutputFormat |
Constructor and Description |
---|
SegmentMerger() |
SegmentMerger(Configuration conf) |
Modifier and Type | Method and Description |
---|---|
static void |
main(String[] args) |
void |
merge(Path out,
Path[] segs,
boolean filter,
boolean normalize,
long slice) |
int |
run(String[] args) |
void |
setConf(Configuration conf) |
getConf
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
getConf
public SegmentMerger()
public SegmentMerger(Configuration conf)
public void setConf(Configuration conf)
setConf
in interface Configurable
setConf
in class Configured
public void merge(Path out, Path[] segs, boolean filter, boolean normalize, long slice) throws IOException, ClassNotFoundException, InterruptedException
Copyright © 2021 The Apache Software Foundation