Getting Started With the Wget Program Source Code

Be forewarned: the Wget source code can be fairly hard to follow. It contains a fair number of hacks, and functionality that was "tacked on". All the components touch each others' private bits (in particular, the gethttp and getftp functions contain logic to handle a lot of things besides just "getting" something over HTTP or FTP).

This page is just a basic introduction to the Wget source code. Another helpful page for understanding the Wget source better is the OptionsHowto, which describes how to find the command-line option you're looking for, and how to add new ones.

Source Overview

So. Here's a call graph of some of the most interesting functions, in a typical Wget usage scenario.

  main.c:     main
  recur.c:      retrieve_tree
  recur.c:        url_enqueue
  recur.c:        url_dequeue
  retr.c:         retrieve_url
  http.c:           http_loop
  http.c:             gethttp
  retr.c:               fd_read_body
  ftp.c:            ftp_loop
  ftp.c:              ftp_loop_internal
  ftp.c:                getftp
  retr.c:                 ftp_get_listing
  retr.c:                 fd_read_body
  recur.c:        get_urls_html

After about three hundred lines of initialization and options-handling code, main finally starts traversing the set of URLs specified on the command-line. In recursive mode, it will invoke retrieve_tree for each of these; otherwise, it'll just call retrieve_url on each.

The retrieve_tree function invokes url_enqueue to add its URL argument to a queue. It will immediately dequeue this argument (with a call to url_dequeue), and invoke retrieve_url on it; but this starts off a loop of dequeue URL / process URL / parse for more URLs to queue up, until it finally runs out of URLs in the queue. New URLs are obtained from successfully retrieved HTML files via an invocation of get_urls_html.

retrieve_url will call http_loop or ftp_loop, as appropriate for the current URL. ftp_loop is essentially just a thin wrapper around ftp_loop_internal. http_loop and ftp_loop_internal are essentially responsible for handling Wget's "robust" retrieval of files: if the connection drops, or a partial copy of the desired file already exists, it is the *_loop functions' job to retry the connection, set up partial file retrieval, etc. The *_loop functions also handles things like deciding whether timestamps need to be checked first before actually downloading the file.

The gethttp and getftp functions do the actual protocol-level file retrieval requests. Both of these are really big, honking, several-hundreds-of-lines functions. They take on far more responsibility than they ought (for instance, doing the actual timestamp checking directly in the function, rather than delegating that elsewhere. Even the WgetMaintainer has difficulty following it sometimes, as it can be a bit of a tangled mess. :-\

Once the server has actually sent a successful response to gethttp or getftp, these will in turn call out to fd_read_body to handle the actual file retrieval over a socket.

The getftp function may also call out to ftp_get_listing, in order to obtain a directory listing from the FTP server, and parse the results. This is especially used for recursive fetches, so it can traverse all the links it finds in a directory.

Hopefully, this is enough information about the Wget source code to roll up your sleeves and get started. Happy browsing!

Tools For Navigating the Source

If you're not already familiar with using your favorite editor in conjunction with ctags, etags or cscope, you really should learn to take advantage of them. These tools build an index of where to find definitions for various C symbols, and many tools have facilities for taking advantage of these indices to quickly find the definition for whatever you happen to be looking at, at the moment. If you are using Vim or vi, you can type Control-] to jump to the definition of whatever's under the cursor (in Vim you may also control-click on the item whose definition you want); in Emacs, you type M-. .

For more information, see :help tags in Vim; or type info emacs tags at the commandline if you're an Emacs user (and have the Emacs manual installed). Note that Vim and vi expect to find a file named tags in the current directory, which you can generate with either make ctags or ctags * on the command line. Emacs expects a file named TAGS, which has a different format, and is generated by either make tags, make TAGS, or etags *. If you're using neither Emacs nor vi, you should check your editor's documentation to see if it supports the tags file. If it doesn't, you should consider using a more powerful editor for browsing source code (you might also want to look into using the cscope command-line program directly, instead of through an interface).

Cscope is a very useful program for browsing source files. As with the ctags, it indexes the definitions it finds in C source files; but it also additionally indexes such things as where a given C symbol occurs, or where a given function is invoked. Very handy!

Cscope ships with a module for Emacs to provide an interface; Vim has builtin support for Cscope. See the Cscope website for more information. In addition to these editor interfaces, there is the cscope command, which you can use directly for your source-code browsing pleasure.

NavigatingTheSource (last edited 2012-01-01 17:50:48 by c-67-190-70-101)