public class WARCUtils extends Object
Modifier and Type | Field and Description |
---|---|
static String |
COLONSP |
static String |
CONFORMS_TO |
static String |
CRLF |
static String |
FORMAT |
static org.archive.uid.UUIDGenerator |
generator |
static String |
HOSTNAME |
static String |
HTTP_HEADER_FROM |
static String |
HTTP_HEADER_USER_AGENT |
static String |
IP |
static String |
OPERATOR |
protected static Pattern |
PROBLEMATIC_HEADERS |
static String |
ROBOTS |
static String |
SOFTWARE |
protected static String |
X_HIDE_HEADER |
Constructor and Description |
---|
WARCUtils() |
Modifier and Type | Method and Description |
---|---|
static org.archive.io.warc.WARCRecordInfo |
docToMetadata(NutchDocument doc) |
static String |
fixHttpHeaders(String headers,
int contentLength)
Modify verbatim HTTP response headers: fix, remove or replace headers
Content-Length , Content-Encoding and
Transfer-Encoding which may confuse WARC readers. |
static String |
getAgentString(String name,
String version,
String description,
String URL,
String email) |
static String |
getHostname(Configuration conf) |
static String |
getIPAddress(Configuration conf) |
static org.archive.util.anvl.ANVLRecord |
getWARCInfoContent(Configuration conf) |
static byte[] |
toByteArray(org.archive.format.http.HttpHeaders headers) |
public static final String SOFTWARE
public static final String HTTP_HEADER_FROM
public static final String HTTP_HEADER_USER_AGENT
public static final String HOSTNAME
public static final String ROBOTS
public static final String OPERATOR
public static final String FORMAT
public static final String CONFORMS_TO
public static final String IP
public static final org.archive.uid.UUIDGenerator generator
public static final String CRLF
public static final String COLONSP
protected static final Pattern PROBLEMATIC_HEADERS
protected static final String X_HIDE_HEADER
public static final org.archive.util.anvl.ANVLRecord getWARCInfoContent(Configuration conf)
public static final String getHostname(Configuration conf) throws UnknownHostException
UnknownHostException
public static final String getIPAddress(Configuration conf) throws UnknownHostException
UnknownHostException
public static final byte[] toByteArray(org.archive.format.http.HttpHeaders headers) throws IOException
IOException
public static final String getAgentString(String name, String version, String description, String URL, String email)
public static final org.archive.io.warc.WARCRecordInfo docToMetadata(NutchDocument doc) throws UnsupportedEncodingException
UnsupportedEncodingException
public static final String fixHttpHeaders(String headers, int contentLength)
Content-Length
, Content-Encoding
and
Transfer-Encoding
which may confuse WARC readers. Ensure that
returned header end with a single empty line (\r\n\r\n
).headers
- HTTP 1.1 or 1.0 response header string, CR-LF-separated lines,
first line is status lineCopyright © 2021 The Apache Software Foundation