Wget

From Christoph's Personal Wiki
Revision as of 20:50, 24 May 2015 by Christoph (Talk | contribs)

Jump to: navigation, search

wget — The non-interactive network downloader.

Usage

  • Simple download:
$ wget http://www.example.com/index.html
  • Download a file and store it locally using a different file name:
$ wget -O example.html http://www.example.com/index.html
  • Background download:
$ wget -b https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.gz
$ tail -f wget-log  # <- monitor download

The above command is useful when you initiate a download via a remote machine. This will start downloading in background, so that you can disconnect the terminal once the command is issued.

  • Mirror an entire web site:
$ wget -m http://www.example.com
  • Mirror an entire subdirectory of a web site (with no parent option in case of backlinks):
$ wget -mk -w 20 -np http://example.com/foo/
  • Download all pages from a site and the pages the site links to (one-level deep):
$ wget -H -r --level=1 -k -p http://www.example.com
  • Resume large file download:
$ wget -c --output-document=MIT8.01F99-L01.mp4 "https://www.youtube.com/watch?v=X9c0MRooBzQ"
  • Schedule hourly downloads of a file
$ wget --output-document=traffic_$(date +\%Y\%m\%d\%H).gif "http://sm3.sitemeter.com/YOUR_CODE"
  • Automatically download music (by Jeff Veen):
$ wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off -i mp3_sites.txt

where mp3_sites.txt lists your favourite (legal) download sites.

#~OR~
$ wget -r --level=1 -H --timeout=1 -nd -N -np --accept=mp3 -e robots=off -i musicblogs.txt
  • Download all mp3's listed in an html page (source):
$ wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off [url of website]
#-r: recursive
#-l1: depth = 1
#-H: span hosts
#-t1: try once
#-nd: no heirarchy of directories
#-N: turn on time-stamping
#-np: do not ascend to parents
#-A.mp3: accept only mp3 files
#-erobots=off: ignore robots.txt
  • Crawl a website and generate a log file of any broken links:
$ wget --spider -o wget.log -e robots=off --wait 1 -r -p http://www.example.com/
$ wget --user-agent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0" https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.gz
  • Limit download speed/rate:
$ wget --limit-rate=300k https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.gz

Download multiple files

  • Create variable that holds all URLs and then using 'BASH for loop' to download all files:
% URLS="http://www.example.com/foo.tar.gz ftp://ftp.example.org/pub/bar.tar.gz"
  • Use for loop as follows:
% for u in $URLS; do wget $u; done
  • You can also put a list of the URLs in a file and download using the -i option:
% wget -i download.txt

Automating/scripting download process

#!/bin/sh
# wget-list: manage the list of downloaded files

# invoke wget-list without arguments
while [ `find .wget-list -size +0` ]
 do
  url=`head -n1 .wget-list`
   wget -c $url
   sed -si 1d .wget-list
 done
#/bin/sh
# wget-all: process .wget-list in every subdirectory
# invoke wget-all without arguments

find -name .wget-list -execdir wget-list ';'
#!/bin/sh
# wget-dirs: run wget-all in specified directories
# invoking: wget-dirs <path-to-directory> ...

for dir in $*
  do
      pushd $dir
      wget-all
      popd
  done
wget-all

See also

External links