Scripting Functionality For Wget
Contents
1. Preface
There are many Internet download programs, and each has its own mechanisms for fine-grained control over what is and what is not to be downloaded. HTTrack distinguishes internal and external recursion levels, and allows the grabbing of all non-HTML that a page links to. Backstreet Browser has sequences of filters, where the first matching filter decides. But in the end only code can express exactly what is needed.
Adding a scripting language to Wget would allow this precise expression.
2. Decision points
Scripts would steer the program at various points, based on the data available. These data would be inspectable with regexps or other programmatic means.
Given a URL |
whether to request the server data (size, date, mime type). |
Given the server data |
whether to request the file contents. |
Given the contents |
* whether to store it locally (and under what name) |
* whether to recurse, to consider the embedded URLs. |
|
3. States
These decisions would be taken based on the current state. A state is a compound data structure. The initial state is given, and script code computes a next state based on the state given and the other data avaliable. An example field would be an integer that starts at 0 and is incremented for each recursion. This would effectively implement the --level command line option, but allow for fine-tuning: e.g. the script may decline to increment it in certain cases.
--accept, --reject, --domains and a host of other command line options translate directly to scripting, and give the user the power to steer Wget precisely.
4. Scripting language
An obvious choice for the scripting language would be Lua, where tables could fulfil the roles of states. Given that a JavaScript engine may be added anyway, JavaScript might be another reasonable choice.
