public class Any23ParseFilter extends Object implements HtmlParseFilter
This implementation of HtmlParseFilter
uses the Apache Any23 library
for parsing and extracting structured data in RDF format from a
variety of Web documents. The supported formats can be found at Apache Any23.
In this implementation triples are written as Notation3 e.g.
and triples are identified within output triple streams by the presence of '\n'.
The presence of the '\n' is a characteristic specific to N3 serialization in Any23.
In order to use another/other writers implementing the
TripleHandler
interface, we will most likely need to identify an alternative data characteristic
which we can use to split triples streams.
Modifier and Type | Field and Description |
---|---|
static String |
ANY_23_CONTENT_TYPES_CONF |
static String |
ANY_23_EXTRACTORS_CONF |
static String |
ANY23_TRIPLES
Constant identifier used as a Key for writing and reading
triples to and from the metadata Map field.
|
static org.slf4j.Logger |
LOG
Logging instance
|
X_POINT_ID
Constructor and Description |
---|
Any23ParseFilter() |
Modifier and Type | Method and Description |
---|---|
ParseResult |
filter(Content content,
ParseResult parseResult,
HTMLMetaTags metaTags,
DocumentFragment doc)
Adds metadata or otherwise modifies a parse of HTML content, given the DOM
tree of a page.
|
Configuration |
getConf() |
void |
setConf(Configuration conf) |
public static final org.slf4j.Logger LOG
public static final String ANY23_TRIPLES
public static final String ANY_23_EXTRACTORS_CONF
public static final String ANY_23_CONTENT_TYPES_CONF
public Configuration getConf()
getConf
in interface Configurable
public void setConf(Configuration conf)
setConf
in interface Configurable
public ParseResult filter(Content content, ParseResult parseResult, HTMLMetaTags metaTags, DocumentFragment doc)
HtmlParseFilter
filter
in interface HtmlParseFilter
HtmlParseFilter.filter(Content, ParseResult, HTMLMetaTags, DocumentFragment)
Copyright © 2021 The Apache Software Foundation