Frequently Asked Questions About GNU Wget
Contents
- About This FAQ
- About Wget
- Installing Wget
-
Using Wget
- How do I use wget to download pages or files that require login/password?
- Why isn't Wget downloading all the links? I have recursive mode set
- How do I get Wget to follow links on a different host?
- How can I make Wget ignore the robots.txt file/no-follow attribute?
- Tool ''X'' lets me mirror a site, but Wget gives an HTTP error?
- How Do I Hide Wget From The Task Bar (Windows)?
- Feature Requests
1. About This FAQ
1.1. Referring to FAQ Entries
Please don't refer to any of the FAQs or sections by number: these are liable to change frequently, so "See Faq #2.1" isn't going to be meaningful.
Similarly, while it might seem like a good idea to use the links from the table of contents on this page, those, too, are not persistent.
If you look at the source of this page (by clicking here), you'll see that some FAQ entries include text like, "(Please use this link to refer to this answer.)" This is what you should use when you're referring to an answer on this page.
If the answer you want to reference doesn't have a link like that, you gotta add one...
To do this, pick a descriptive name for the anchor and put <<Anchor(descriptive-name)>> on the line before the start of the section you're linking to. Then, at the start of the section text, add the entry:
(Please use [[#descriptive-name|this link]] to refer to this answer.)
2. About Wget
2.1. What is Wget?
GNU Wget is a network utility to retrieve files from the World Wide Web using HTTP and FTP, the two most widely used Internet protocols. It works non-interactively, so it can work in the background, after having logged off. The program supports recursive retrieval of web-authoring pages as well as FTP sites—you can use Wget to make mirrors of archives and home pages or to travel the Web like a WWW robot, checking for broken links.
2.2. Where is the home page?
You can find the official Wget homepage at this URL:
There's also the Wget entry in the FSF Free Software Directory:
2.3. Where can I download Wget?
(Please use this link to refer to this answer.)
Source Tarball:
http://www.gnu.org/order/ftp.html (GNU mirror list)
Windows Binaries
- courtesy of Jernej Simončič:
- from sourceforge:
- courtesy of Bart Puype (stand-alone wget.exe):
- courtesy of Christopher G. Lewis:
http://www.christopherlewis.com/WGet/WGetFiles.htm [Deleted October 2011 - site gives only a 404 Error.]
MS-DOS
- An MS-DOS binary has been made available by Michael Kostylev:
VMS
- VMS port by Antinode.org:
Solaris
- Solaris Packages for SPARC and x86:
Apple OS X package by Andrew Merenbach:
The latest development source for Wget is always available in our source code repository (see RepositoryAccess). The source and binary versions of the current development sources, patched for compilation in the MS Windows environment are made available by Christopher G. Lewis as well, and are available from the URL given above.
2.4. Where can I find documentation?
Well, aside from the information found on this Wiki, you can:
browse the GNU Wget manual online, or
- read the man page or the texinfo documentation included in the GNU Wget distribution.
2.5. Where can I get help?
The main mailing list for end users is bug-wget@gnu.org. . You can subscribe by sending an email to bug-wget-join@gnu.org. If you wish to post to the list, please be sure and include the complete output of your problem when using the -d flag with Wget. It will drastically improve the likelihood and quality of responses. Look over your wget invocation and output carefully, to make sure you're not including any sensitive information.
You can view the mailing list archives at http://lists.gnu.org/archive/html/bug-wget/
Mailing list archives prior to November 2008 are available at http://www.mail-archive.com/wget%40sunsite.dk/
More info about other mailing lists can be found on the MailingLists page.
2.6. Where can I report a bug or feature request?
Use our BugTracker!
2.7. How can I help develop Wget?
Excellent question! See the HelpingWithWget page.
3. Installing Wget
3.1. How do I compile Wget?
On most UNIX-like operating systems, this will work:
$ gunzip < wget-1.12.tar.gz | tar -xv $ cd wget-1.12 $ ./configure $ make # make install
If it doesn't, be sure to look at the README and INSTALL files that came with your distribution. You can also run configure with the --help flag to get more options.
See the RepositoryAccess page for additional requirements and steps to compile the source obtained from the source repository.
4. Using Wget
4.1. How do I use wget to download pages or files that require login/password?
(Please use this link to refer to this answer.)
Well, if "login" means that your browser pops up a window, specifying a "realm" and asking that you enter a username and password, you should be able to simply use Wget's --user and --password options to provide the necessary information to Wget.
However, if "login" means a page with a web form and a "submit" button right in the page, things get a little more complicated.
The easiest way to do what you need may be to log in using your browser, and then tell Wget to use the cookies from your browser, using --load-cookies=path-to-browser's-cookies. Of course, this only works if your browser saves its cookies in the standard text format (Firefox prior to version 3 will do this), or can export to that format (note that someone contributed a patch to allow Wget to work with Firefox 3 cookies; it's linked from the FrontPage, and is unofficial so I can't vouch for its quality). It also won't work if the server relies on "session" cookies, since those aren't saved to the file.
Otherwise, you can perform the login using Wget, saving the cookies to a file of your choice, using --post-data=..., --save-cookies=cookies.txt, and probably --keep-session-cookies. This will require that you know what data to place in --post-data, which generally requires that you dig around in the HTML to find the right form field names, and where to post them.
For instance, if you find a form like the following within the page containing the log-in form:
<form action="/doLogin.php" method="POST"> <input type="text" name="s-login" /> <input type="password" name="s-pass" /> <input type="hidden" name="token" value="AF9FF24" /> <input type="submit" name="s-action" value="Login" /> </form>
then you need to do something like:
$ wget --post-data='s-login=USERNAME&s-pass=PASSWORD&token=AF9FF24&s-action=Login' \ --save-cookies=my-cookies.txt --keep-session-cookies \ http://HOSTNAME/doLogin.php
Note that you don't necessarily send the information to the page that had the login page: you send it to the spot mentioned in the "action" attribute of the password form. Also note that you should include values for all of the fields that appear in the form, including "hidden"-type fields. If the submit button has a name, then its name/value pairs should be included as well.
Note too, that you might possibly have to percent-encode some characters in order to make a valid URL (usually not, but it happens). This is complicated and technical work to do by hand, though there may be tools available to help. Also, space characters should always be swapped out for the plus (+) character, and plus characters need to be encoded (as %2B).
Once this is done, you should be able to perform further operations with Wget as if you're logged in, by using
$ wget --load-cookies=my-cookies.txt --save-cookies=my-cookies.txt \ --keep-session-cookies ...
This is all a lot of trouble, obviously, and there are tentative plans to perhaps eventually write a helpful utility to automate some of this drudgery.
4.2. Why isn't Wget downloading all the links? I have recursive mode set
(Please use this link to refer to this answer.)
There could be various reasons why Wget doesn't download links you expect it to. Make sure to get as much detailed information from Wget by using the --debug flag, and then have a look at the next several questions to solve specific situations that might lead to Wget not downloading a link it finds.
4.3. How do I get Wget to follow links on a different host?
By default, Wget will not follow links across to a different host than the one the link was found on.
If Wget's --debug output says something like
This is not the same hostname as the parent's (foo.com and bar.com)
it means that Wget decided not to follow a link because it goes to a different host.
To ask Wget to follow links to a different host, you need to specify the --span-hosts option. You may also want to use the --domains and/or --exclude-domains options, to control which hosts Wget will follow links to.
4.4. How can I make Wget ignore the robots.txt file/no-follow attribute?
(Please use this link to refer to this answer.)
By default, Wget plays the role of a web-spider that plays nice, and obeys a site's robots.txt file and no-follow attributes.
If Wget's --debug output says something like
Not following foo.bar because robots.txt forbids it
or
no-follow in index.html
then this is the cause of your trouble.
Wget enables you to ignore robots.txt and no-follow attributes; however, you should think about what you're doing first, and what those robots.txt files may be preventing. While some people use the robots.txt to block people from automatically fetching portions of their site, they can also be used to prevent automata from incurring huge loads on the server, by following links to CGI scripts that require some processing power. Ignoring a robots.txt or no-follow can mean giving migraines to site administrators, so please be sure you know what you're doing before disabling these things.
To ignore robots.txt and no-follow, use something like:
wget -e robots=off --wait 0.25 http://your.site.here
Whenever possible, please do include an appropriate option like --wait 0.25 or --limit-rate=80k, so that you won't hammer sites that have added Wget to their disallowed list to escape users performing mass downloads. If the run includes a lot of small downloads, --wait is a better option than --limit-rate, because --limit-rate has little to no effect on small downloads.
4.5. Tool ''X'' lets me mirror a site, but Wget gives an HTTP error?
(Please use this link to refer to this answer.)
The server admin may be specifically denying the Wget user agent. Try changing the identification string to something else:
wget -m -U "Mozilla/5.0 (compatible; Konqueror/3.2; Linux)" http://some.web.site
Before rushing ahead with such a solution, pause to think for a moment about why it is that they might be trying to prevent people from using Wget on their site. It may be that thoughtless use of wget may be taxing their system, by sending too many requests in too short a timespan, or fetching CGI scripts that require some processing power. If you use this option, please consider whether it might be appropriate to use one of --wait or --limit-rate, and perhaps to judiciously apply the --accept or --reject options to avoid fetching things wget should not be following automatically.
Another possibility is that the server could be attempting to defeat direct links to that specific resource. If it works when you click on a link to that resource, but not when you paste that link directly into your browser's address bar, then that is your problem. You can use the --referer option to directly specify the page on which the link resides.
4.6. How Do I Hide Wget From The Task Bar (Windows)?
Christopher G. Lewis writes:
Depends on the scripting language. VBScript has the Shell.RUN command, first parm is the command "WGET.EXE http://www.google.com" Second parm is window state. 6, I believe, hides the window. Dim oShell Set oShell = WScript.CreateObject ("WSCript.shell") oShell.run "WGET.EXE http://www.google.com", 6, true Set oShell = Nothing
5. Feature Requests
5.1. Does Wget understand HTTP/1.1?
Wget is an HTTP/1.0 client. But, since the HTTP/1.1 protocol was designed to fully support HTTP/1.0 clients, Wget interoperates with most HTTP/1.1 servers.
In addition, Wget support several features introduced by HTTP/1.1 and used by many web servers, such as keep-alive connections and the Host header.
5.2. Can Wget download links found in CSS?
Thanks to code supplied by Ted Mielczarek, Wget can now parse embedded CSS stylesheet data and text/css files to find additional links for recursion, as of version 1.12.
5.3. Does Wget understand JavaScript?
Wget doesn't feature JavaScript support and is not capable of performing recursive retrieval of URLs included in JavaScript code.
In fact, it is impossible to extract URLs from JavaScript by merely parsing it. Web clients need to actually execute it, which is difficult to know how to do in a non-interactive web client. It is also slow, and memory- and CPU-intensive. However, there is a lot of demand for such a feature.
The problem is that it would be effectively impossible to fetch any and all web URLs that a given JavaScript program might fetch, on every possible user interaction; so Wget can never fetch everything that a user might wish it to, or that a user might be able to retrieve through specific interactions on a web page.
However, there is a wealth of possibilities that lie between the extremes of getting nothing, and getting everything; and an examination of what behavior within that realm might be appropriate could be very illuminating. There are discussions underway as to what level of JavaScript support Wget might offer in the future.
However, given the fact that it adds a huge amount of complexity, and the fact that we could never hope to achieve "perfect" results, always just approximating it a little better and a little better, it is extremely likely that JavaScript support for Wget will be a separate development effort from Wget itself, and will probably be offered in the form of a separately-downloaded plugin (a plugin architecture being another thing planned for Wget).
See FeatureSpecifications/JavaScript for the latest status of discussons related to this feature.
5.4. Will Wget support multiple simultaneous connections?
While this will take a significant redesign of Wget's architecture, this feature is planned. There's no reason why the slowness of one server should halt all processing until that one server responds.
However, note that this does not mean that Wget will take on the functionality of a so-called "download accelerator". Opening more than two simultaneous connections to the same server is rather rude, and can create serious problems for system administrators, especially on systems that have limited resources. It also goes against the recommendations of the HTTP specification, RFC 2616.
There is a much better way to "accelerate" downloads, and that is to use the Metalink specification to supply Wget with a list of alternative locations for retrieving the same location. Then Wget can grab different portions of the same resource from separate hosts, accelerating download time while not overtaxing the servers. Support for the Metalink format is planned for a future release of Wget.
5.5. Is there a way to hide my clear-text user/pass combo from the process table?
Wget now offers a hidden-password prompt for more secure entry of authentication information, as of version 1.12 (--ask-password).
If you are stuck with an older version of Wget, you could put your URLs with passwords into a file and invoke Wget with wget -i FILE. Or use wget -i - and type the URL followed by a Ctrl-D.
For HTTP authentication, you could also place your password info in your wgetrc, with the lines:
user = foo password = bar
(or http_user, http_password).
Note that, if you don't want to place these into your main ~/.wgetrc file, you can put them in a different location, and tell Wget to use that file by setting the WGETRC environment variable to that location or passing the path to this file to the --config option.
If you are trying to supply a password as part of an HTML form, you can use --post-file instead of --post-data. An example of how to make this interactive, assuming your shell provides echo as a builtin, follows:
postfile=$(tempfile) \ && read -s -p "Password: " pass \ && echo "os_password=${pass//&/%26}" > "${postfile}" \ && wget --post-file "${postfile}" 'http://localhost/login' [ "${postfile}" ] && rm -f "${postfile}"
read’s -s flag hides input, and the -p flag displays the "Password: " prompt. The ${pass//&/%26} syntax specifies that every ampersand in the read-in password should be replaced with the urlencoded value so that passwords with ampersands don’t get interpreted as multiple post variables. This may be adequate to cover most cases, but ideally a more proper method of urlencoding the password value should be used.