JavaScript Functionality For Wget

1. Preface

A number of users have asked for JavaScript support in Wget. There are a lot of websites that operate in ways that require JavaScript support. Clicking a link on such sites may run a program in JavaScript that decides what link to follow, or submits a form.

For such sites, it would obviously be handy if Wget undertood JavaScript, so that it could run those same scripts, and continue to find and fetch all the resources that a typical user might want it to have.

Unfortunately, it would be effectively impossible to write Wget so that it knows how to get all possible resources that a JavaScript program might possibly fetch. In fact, in some cases, the number of all such resources might be infinite, if the JavaScript reads all or part of the URLs it fetches from a page parameter or user-entry field. Even if that's not the case, running through all the possibilities in a large JavaScript program would have running-time requirements that are exponential on the number of possible branches in that program (in other words, effectively impossible). So a complete solution that does everything that every user wants, is not possible.

But all is not lost: just because we can't make Wget do everything, doesn't mean we can't let it do anything... we'll just have to find something we can do somewhere between nothing and everything, that will meet a reasonable amount of users' expectations.

However, it is the opinion of the current (as of the time of writing) Wget maintainer, MicahCowan, that such a feature should not be an official part of Wget, the reasons being that it could be potentially dangerous (may be difficult to handle infinite loops, and may be easier for malicious authors to write pages specifically to cause Wget to have unpleasant results), is pretty much guaranteed to severely impede Wget's performance in the best of cases, and will never be more than a hack, as it can never do all that could be expected of it—and may never even get close.

Therefore, it's intended that whatever JavaScript functionality may be provided for Wget, will be available separately, in the form of a plugin module (taking advantage of the planned ../Plugins architecture).

2. Levels of Support

So, let's talk about the two extremes we could attempt to support, first—the most, and the least, amount of support that could be offered for JavaScript. Then it may be easier to discuss the options in-between. :)

2.1. Everything, but Everything

Well, we've already discussed this one, actually. There's no way to get everything automatically. The closest we might get would be to allow users to script exactly what form entries and JavaScript events wget should trigger for each individual page that's specified. That doesn't really strike me as within the scope for Wget, but of course, it's always possible that someone will be willing to develope such a thing.

2.2. String-Literal URI recognition

On the other extreme, we could just write a simple lexer that understands JavaScript tokens, and just looks for the presence of URIs within the string literals it finds.

This would provide decent first-order support, as it would catch a number of simple JavaScript usages, such as window.open("http://foo.com/"), or document.url = "http://foo.com/", or even img.src = "http://foo.com/img.jpeg". It even would allow rewriting the URIs it finds when --convert-links has been specified!

Naturally, it would not cover anything that generates links programatically, and so would be extremely limited. Still, it would require relatively little work to write, would cover a number of simple cases, and would be one of the few algorithms simple enough so as to not significantly slow down operation, all of which makes it an attractive option.

2.3. Javascript engine, just 'onload' event

Many pages execute javascript on load, which generates the page presented to the user, before they click on anything at all. Let wget parse this computed page.

2.4. JavaScript Engine, Generate Events

Another technique could be to include a JavaScript interpreter with Wget, load all of a page's JavaScript program, and send it all the typical events that might cause a JavaScript program to fetch links; typically onclick and onmouseover events. This would be equivalent to moving the mouse over, and clicking, every JavaScript-enabled link.

This would be one of the potentially heavy time- and memory-consuming options, but would provide very reasonable coverage. However, it still wouldn't cover everything, as many sites may have different results based on the order in which things are clicked; and of course, pages obtained via form submissions would not be found.

Such a system could work by implementing the core features of typical JavaScript library (and all or most of the core language), especially string manipulation things, and fetching values from HTML elements. Then it'd just need to understand things like foo.src = ... or window.open(...) or document.url = ..., and add those to the fetch queue.

2.5. Scriptable Engine

This would be the same as the above, but would allow users to script which events were fired in which order. This would be extra work, and result in bulky custom configurations for each site that took advantage of it. The advantage, of course, would be that users could have control over how Wget processes JavaScript pages, and could potentially take better advantage of the JavaScript support to get more of the resources that are desired; and even do form submissions.

3. Other Problems

Of course, with the exception of very simple implementations like the string-literal parser, it's not possible to honor --convert-links, as generated links can't be easily converted in such a way.

FeatureSpecifications/JavaScript (last edited 2009-09-25 17:46:05 by MicahCowan)