Concurrency

Concurrency is the primary focus for Wget in the 2012 Google Summer of Code project.

Possible projects related to concurrency in Wget include:

Support for downloading multiple resources simultaneously
Support for downloading a single resource from multiple sources simultaneously (download acceleration")
Improvements to the progress-drawing code to support concurrency
Improvements to Wget's code architecture in general, to make it more flexible and adaptable.

Wget's current maintainer, GiuseppeScrivano, has written threads-based code to support downloading multiple files simultaneously, which is an important piece of concurrency support in Wget. The patch for this support may be found here.

However, while support for downloading multiple files simultaneously is essentially done, there isn't yet support for downloading a single file from multiple sources simultaneously. A sensible means of supporting the specification of multiple sources for a single resource, would be FeatureSpecifications/Metalink, a file format and HTTP extension which governs the specification of multiple sources for single resources. Of course, it does not specify how choices are made for how to divide up transmission of a resource across its sources.

In addition, the code for drawing progress bars still only handles a single resource being downloaded from a single source at one time. It's very simplistic drawing code, which simply draws a status line, then issues a carriage return to return the cursor to the beginning of the line, so it can draw a new one over it. Progress bar code that can handle multiple simultaneous transmissions will need to be somewhat sophisticated. Work in this area will probably require the use of curses or termcap/terminfo libraries, and perhaps knowledge of terminal escape sequences.

In addition to progress bar code, there'll be a need to produce sensible logs when multiple sources are involved. Perhaps something similar to the dot-printing method currently in use, but with more symbols than the single dot, to represent the different sources. Or perhaps separate log files for separate connections.

Many areas of Wget's source code are large, bloated, handle too many responsibilities, retain too much knowledge about implementation particulars, and can't be easily changed or adapted to deal with new requirements. This was a primary reason for the choice to support multiple simultaneous transmissions via threads, rather than asynchronous I/O. But even if support is never changed from threads to async I/O, it would be of great value to clean up Wget's code; to split apart the behemoth functions into smaller, more manageable chunks of code, and to partition knowledge and responsibilities better, so that changing implementations only requires changes to code that directly deals with the matters involved, rather than changing it in a hundred places.