Contents
- File Existence Checking
- Continuing an Aborted Session
- Re-Attempting Failed Downloads
- Original Content Type
- Session Details Lookup
- Filename conversions on a separate invocation
- Timestamp from session database
- If-None-Match
- Refetching Local File
- Local Copy Integrity Checking
- File Tree Reorginization
- Deltas
- Fields/Items
- Format Discussion
- Other Considerations
Overview
A Session Info Database that provides the following information for each link downloaded:
- The original URL
- Any chain of redirects
- The locally-downloaded filename path (relative to the session's root)
- Content-Type (with character-encoding information)
- Content size
- Entity tag (if applicable)
- Download timestamp (suitable for If-Modified-Since)
- Checksum
- Information on what conversions or other transformations were performed on content
- User-specifiable other HTTP headers
- Perhaps a list of links found within this file? Or possibly just the new ones not yet found from other files (that would suffer from removal of this entry from the database, however).
Additionally, information about Wget's configuration settings (and version), and list of URLs at invocation time, would be very useful, and could even be used to resume a session that had been aborted, or to perform post-download conversions.
Information should also be saved on failed download attempts, so that they can be retried later (especially, ones that were due to server or lookup or other network failures).
The Session Info Database should be in a human-readable text format, for greatest interoperability, and should be extensible, with thought given to how older Wget versions (with session database support) might use session databases from future versions. The format chosen should be robust enough that, even if Wget is aborted suddenly, enough information is available to pick up the session where it left off. The human-readable text version could conceivably be accompanied by a more efficient binary version (which would already provide things like hashes mapping local file paths to URIs, and vice-versa); Wget would then use whichever one has the most recent timestamp.
One specific requirement is that it should be possible to extend the format to map byte ranges from multiple URLs to a single local file, in anticipation of support for ../MetaLink. Might as well just support that mapping explicitly now, by indicating the byte range (0-content-size) for files that came from a single URL (or chain of redirected URLs).
Note that the term "database" is not meant to imply integration with RDBMSs such as MySQL, etc, only to refer to a file containing various data about a download session.
Each recursive-mode or multifile wget invocation will create a new session database file in the current directory (unless the session database is disabled via .wgetrc or commandline options). If a file already exists with the name wget wants to use, wget will ensure a unique name is used. The exception of course is when --continue-session or --append-session is specified, in which case an existing file (given as an argument to those options) will be modified. Perhaps a basic pid-based locking mechanism to prevent/warn when it appears that a session is being continued that is already in use by another wget instance.
YAML has been suggested as a possible basis for the format of this session-info database.
This feature is an expansion of one originally suggested by Juhana Sadeharju
Use Cases
Here are the cases for which a session database could be useful, in rough order of priority. A first-cut implementation of this mechanism might include just the facilities necessary to implement the top few of these cases.
1. File Existence Checking
Currently, Wget uses the filename portion of a URL to determine if a local copy has already been downloaded; for instance, for the -N or -c options. However, this method will miss for multiple files that had the same name (some of which would then get .1, .2 extensions), or for URIs that redirect or make use of mechanisms such as the Content-Disposition header. With a (revsersible) mapping of URIs to local filenames specified in the session info db, it will now be a cinch to find the filename whose existence we should be checking.
2. Continuing an Aborted Session
Wget should be able to use the information from a session database to know what the invocation, configuration settings, and continue the session from where it left off.
3. Re-Attempting Failed Downloads
The session database could be used to retry downloads that failed for temporary reasons (such as network loss).
4. Original Content Type
Wget could use the Content-Type information for mirroring sessions when it cannot otherwise easily determine previously-downloaded content's Content-Type. See bug 20496 for an example of where this would be useful.
Including the "charset" parameter settings from the Content-Type header also means that, when Wget has better support for non-ASCII encodings, that information can be used to support IRIs properly (by understanding the characters, and not just the bytes), and to deal properly with HTML encoded with exotic shift-based mechanisms such as ISO-2022-JP.
5. Session Details Lookup
A separate tool specifically for querying the database could be handy for finding files that 404'd, or otherwise failed to download. It could also be used to enumerate the files that Wget chose not to download, based on configuration settings or robots.txt (see bug 20398).
6. Filename conversions on a separate invocation
Wget could use the Session Database to perform -k conversions on a separate invocation, or to perform conversions only on newly downloaded files.
7. Timestamp from session database
The Session Database could also be used to track timestamp information, so that user modifications to local files might not change the behavior of -N.
8. If-None-Match
Wget could use a resource's entity tag with HTTP 1.1's "If-None-Match" header, for a better way than timestamping to ensure up-to-date local copies.
9. Refetching Local File
Wget could use the Session Database to determine a local file's URI of origin, and refresh that file.
10. Local Copy Integrity Checking
The content size and checksum (and perhaps the timestamp) could be used together to verify that a file has not been locally altered; it could be useful to disable the use of --continue on such files.
11. File Tree Reorginization
Wget could conceivably convert a downloaded file tree from its current format to whatever format it would have been with a different combination of -x, -nd, -nH, --cut-dirs, etc.
12. Deltas
The entity tags could also be used to allow Wget to support RFC 3229's Delta encoding.
API Specification
The interface for just the first use case might look like:
/** SIDB Writer Facilities **/
sidb_writer *
sidb_write_start(const char *filename);
sidb_writer_entry *
sidb_writer_entry_new(sidb_writer *w, const char *uri);
void
sidb_writer_entry_redirect(sidb_writer_entry *rw, const char *uri, int
redirect_http_status_code);
void
sidb_writer_entry_local_path(sidb_writer_entry *rw, const char
*fname);
void
sidb_entry_finish(sidb_writer_entry *rw);
void
sidb_write_end(sidb_writer *w);
/** SIDB Reader Facilities **/
sidb_reader *
sidb_read(const char *filename);
sidb_entry *
sidb_lookup_uri(struct sidb_reader *, const char *uri);
const char *
sidb_entry_get_local_name(sidb_entry *e);
"sidb" for "Session Info DataBase". I've left out error code returns; it seems to me that wget will not normally want to terminate just because an error occurred in writing the session info database; we can ask the sidb modules to spew warnings automatically when errors occur by giving it flags, or supply it with an error callback function, etc. For situations where we do want wget to immediately abort for sidb errors (for instance, continuing a session), it could check a sidb_error function or some such.
The intent is that all the writer operations would take virtually no time at all. The sidb_read function should take at most O(N log N) time on the size of the SIDB file, and should take less than a second under normal circumstances on typical machines, for a file with entries for a thousand web resources. Thereafter, the other SIDB reader operations should take virtually no time at all.
The sidb_lookup_uri should be able to find an entry based on either the URI that was specified to the corresponding call to sidb_writer_entry_new, or by any URI that was added to a resource entry via sidb_writer_entry_redirect.
Interposing writes to different entries should be allowed and explicitly tested (to prepare the way for multiple simultaneous downloads in the future). That is, the following should be valid:
sidb_writer *sw = sidb_write_start(".wget-sidb");
sidb_writer_entry *foo, *bar;
foo = sidb_writer_entry_new(sw, "http://example.com/foo");
bar = sidb_writer_entry_new(sw, "http://example.com/bar");
/* Add info to foo entry. */
sidb_writer_entry_redirect(foo,
"http://newsite.example.com/news.html");
/* Add to bar entry. */
sidb_writer_entry_local_path(bar, "example.com/bar");
/* Add to foo entry again. */
sidb_writer_entry_local_name(foo, "newsite.example.com/news.html");
sidb_entry_finish(foo);
sidb_entry_finish(bar);
sidb_write_end(sw);
On reading back the information, calling sidb_lookup_uri for either "http://example.com/foo" or "http://newsite.example.com/news.html" should both give "newsite.example.com/news.html".
(Note: failure to allocate appropriate resources in the call to sidb_write_start should nevertheless return a valid sidb_writer, and calls to other writer operations using that handle remain valid, even if they don't actually do anything.)
The following should also result in a well-formed SIDB file:
sidb_writer *sw = sidb_write_start(".wget-sidb");
sidb_writer_entry *foo
= sidb_writer_entry_new(sw, "http://example.com/foo");
sidb_writer_entry_local_path(foo, "example.com/foo.html");
raise(SIGKILL); /* Die with uncatchable signal. */
The only difference between the above, and the same code with appropriate cleanup (especially, sidb_write_end(sw)), is that wget will handle the former under the assumption that the session never completed downloading of foo.html; when support for continued sessions is added, wget would automatically attempt to resume download of foo.html.
SIDB File Format
1. Fields/Items
There are candidate items for Session DB Info file.
Item Name |
Description |
Wget Invocation configuration |
This information indicates how this session was invoked(i.e.Start option). Also this item would be used by Wget when it reads SIDB file and resume previous session. |
Timestamp |
This item indicates when Wget was invoked. |
SIDB Version Number (Major/Minor) |
Version number of SIDB file format. This would be used for forward-compatibility(For example, Older version of Wget check this version number, and fail out by itself if it is above compatibility of that version of Wget.) There are discussions about how to use major number and minor number. |
Start Tag |
This item indicates Wget started it's session. |
End Tag |
This item indicates Wget completed it's session. This item is also used for checking whether previous session completed or not. |
HTTP Version |
Wget might use this item to determine which item it will use. |
Status Code |
HTTP Status Code. Wget would be able to return list of entry which has certain Status Code(For example it will be able to return list of entries that could not downloaded for 404/Not Found). |
Expiration Dates |
This item would be used to determine whether Wget should retrieve this entry again or not. |
Ages |
Same to Expiration Dates |
If-None-Match Header |
Same to Expiration Dates |
Content-Type( Mime Type/Character encoding ) |
This item is used for mirroring sessions when Wget cannot easily determine previously-downloaded content's Content-Type. |
Checksum |
SIDB would include MD5 Checksum info (for FTP) and content size to verify that a local file has not changed.( It is possible for HTTP files to generate MD5 checksum for downloaded files: TBD) |
Local Filename |
This item indicates path of local filename |
URL |
Corresponding URL |
Completion status |
To distinguish files that were in the midst of downloading when execution was abruptly ended, and between files that lacked content-length information and were considered finished when the connection was closed, and those for which content-length information was available (either via Content-Length, or the use of chunked encoding) |
2. Format Discussion
- The file sould have both definition information and data, each field should be clearly associated with field definition information.
- HTTP's header mechanism actually could make a good model.
3. Other Considerations
Forward-compatibility: Considering about future version up, it is desirable that older version of Wget can still read newer version of SIDB, ignoring new features.
Robustness: SIDB should be readable even if Wget is interrupted and aborted abruptly.
Versioning: It is possible for SIDB could have information of its own version, which means this file format is for that version of Wget whereas older version of Wget can still parse it.(TBD)