[Wget] [TitleIndex] [WordIndex

Web Path-Oriented Configuration

This feature would add the capability for users to specify configuration settings that only apply to specific initial URI subpaths. For instance, a user could specify that hosts should be followed, but only for links coming from host foo.com. Or, a user could specify that Wget should only follow links to PDF files if the referring page is a subpath of http://bar.com/articles/. This could also be used to specify realms for a given username and/or password; cf ../HttpAuthentication.

1. Wgetrc Example

Ideas on how this might be specified in a .wgetrc file.

span_hosts = off
realm http://sekur.com/members/
      https://admin.sekur.com/  {
    # Applies to all paths under the ones specified
    http_user = user
    http_password = secret
}

realm r:http://balthazar.myblog.com/.*-links {
    # Regex path matching. Should probably assume enclosure within ^...$ ?
    span_hosts = on
}

realm drop.com
      *.drop.com {
    # Scheme not specified, matches any scheme. Use of wildcard.
    span_hosts = on
    domains = drop.com *.drop.com
    user_agent = Mozilla
}

2. Command-Line Usage

Perhaps something like --realm='...', before which all options are global, and after which all options belong to the specified realm, up to the following instance of --realm...

It's possible we don't want to put the full power of this configuration syntax into something that can be used from the command-line; however, it seems likely that we might want to at least provide some sort of shorthand syntax for the most commonly needed uses.

Perhaps it should be possible for .wgetrc files to specify shorthands for commonly-referenced realms:

realm a: drop.com *.drop.com { ... }

$ wget [a -U Mozilla -H ] http://foo.drop.com/dir/

3. Conflict Resolution

It will happen that sometimes more than one realm will apply to the same URI. In that case, the rule shall be that the most specific match will take precedence over any less-specific matches. Given equal specificity, the most recently-specified of the realms shall take precedence.

For purposes of conflict-resolution, all regular expression matches shall be considered to completely specify the URI (and thus, can only be overridden by more recently-specified realm matches that completely specify the URI).

URIs that specify the scheme are only more specific than URIs that don't, when the latter specifies a pathname of equal or lesser specificity (that is, specificity in the pathname portion of a URI always takes precedence over specificity of the scheme).

Any realms specified on the command-line are considered to have been specified after anything specified in a wgetrc file.

This conflict resolution is per-configuration-option, not per-realm; that is, a configuration setting that occurs in a less-specific realm will still take effect if the more specific realm does not explicitly specify a setting for it.

(This conflict resolution algorithm was based upon CSS's "cascading" stylesheet application algorithm (see http://www.w3.org/TR/CSS21/cascade.html#cascade)

Example:

realm foo.com/bar/baz/ {
}
realm http://foo.com {
}
realm foo.com {
}
realm foo.com/bar/.* {
}

In this example, the URI http://foo.com/foo/ would use any settings specified by the realm designated http://foo.com before those in foo.com, because the former, though it occurs earlier in the RC file, specifies the scheme, while the latter doesn't (and they both specify the same, non-existant path).

OTOH, the URI http://foo.com/bar/baz/ would have realm foo.com/bar/.*'s settings applied to it before foo.com/bar/baz/, as it has the same specificity (complete specificity), and occurs later in the RC file. Anything not covered in those would use settings from http://foo.com, followed by foo.com.


2017-01-04 00:04