Skip navigation links

Package org.apache.nutch.net.urlnormalizer.basic

URL normalizer performing basic normalizations: remove default ports, e.g., port 80 for http:// URLs remove needless slashes and dot segments in the path component remove anchors use percent-encoding (only) where needed E.g., https://www.example.org/a/../b//./select%2Dlang.php?lang=español#anchor is normalized to https://www.example.org/b/select-lang.php?lang=espa%C3%B1ol Optional and configurable normalizations are: convert Internationalized Domain Names (IDNs) uniquely either to the ASCII (Punycode) or Unicode representation, see property urlnormalizer.basic.host.idn remove a trailing dot from host names, see property urlnormalizer.basic.host.trim-trailing-dot

See: Description

  • Class Summary 
    Class Description
    BasicURLNormalizer
    Converts URLs to a normal form: remove dot segments in path: /./ or /../ remove default ports, e.g.

Package org.apache.nutch.net.urlnormalizer.basic Description

URL normalizer performing basic normalizations:
  • remove default ports, e.g., port 80 for http:// URLs
  • remove needless slashes and dot segments in the path component
  • remove anchors
  • use percent-encoding (only) where needed
E.g., https://www.example.org/a/../b//./select%2Dlang.php?lang=español#anchor is normalized to https://www.example.org/b/select-lang.php?lang=espa%C3%B1ol Optional and configurable normalizations are:
  • convert Internationalized Domain Names (IDNs) uniquely either to the ASCII (Punycode) or Unicode representation, see property urlnormalizer.basic.host.idn
  • remove a trailing dot from host names, see property urlnormalizer.basic.host.trim-trailing-dot

Copyright © 2021 The Apache Software Foundation