Difference between revisions of "Wget"

From Christoph's Personal Wiki
Jump to: navigation, search
(Automating/scripting download process)
 
(10 intermediate revisions by the same user not shown)
Line 1: Line 1:
{{lowercase|wget}}
 
 
'''wget''' — The non-interactive network downloader.
 
'''wget''' — The non-interactive network downloader.
  
 
==Usage==
 
==Usage==
*Mirror an entire web site
 
wget -m http://www.example.com
 
  
*Download all pages from a site and the pages the site links to (one-level deep):
+
* Simple download:
  wget -H -r --level=1 -k -p http://www.example.com
+
  $ wget <nowiki>http://www.example.com/index.html</nowiki>
  
*Resume large file download:
+
* Download a file and store it locally using a different file name:
  wget -c --output-document=Bill_Maher_-_New_Rules_2007-03-15.avi "http://www.youtube.com/watch%3Fv%3DhFjRI5jJ5I4&usg=AL29H23P1UQZRf0yDqRBlwB0jyfSLbzzhg"
+
  $ wget -O example.html <nowiki>http://www.example.com/index.html</nowiki>
  
*Schedule hourly downloads of a file
+
* Background download:
  wget --output-document=traffic_$(date +\%Y\%m\%d\%H).gif "http://sm3.sitemeter.com/YOUR_CODE"
+
  $ wget -b <nowiki>https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.gz</nowiki>
 +
$ tail -f wget-log  # <- monitor download
  
*Automatically download music (by [http://www.veen.com/jeff/archives/000573.html Jeff Veen]):
+
The above command is useful when you initiate a download via a remote machine. This will start downloading in background, so that you can disconnect the terminal once the command is issued.
  wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off -i mp3_sites.txt
+
 
 +
* Mirror an entire web site:
 +
$ wget -m <nowiki>http://www.example.com</nowiki>
 +
 
 +
* Mirror an entire subdirectory of a web site (with no parent option in case of backlinks):
 +
$ wget -mk -w 20 -np <nowiki>http://example.com/foo/</nowiki>
 +
 
 +
* Download all pages from a site and the pages the site links to (one-level deep):
 +
$ wget -H -r --level=1 -k -p <nowiki>http://www.example.com</nowiki>
 +
 
 +
* Resume large file download:
 +
$ wget -c --output-document=MIT8.01F99-L01.mp4 "<nowiki>https://www.youtube.com/watch?v=X9c0MRooBzQ</nowiki>"
 +
 
 +
* Schedule hourly downloads of a file
 +
$ wget --output-document=traffic_$(date +\%Y\%m\%d\%H).gif "<nowiki>http://sm3.sitemeter.com/YOUR_CODE</nowiki>"
 +
 
 +
* Automatically download music (by [http://www.veen.com/jeff/archives/000573.html Jeff Veen]):
 +
  $ wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off -i mp3_sites.txt
 
where <code>mp3_sites.txt</code> lists your favourite (legal) download sites.
 
where <code>mp3_sites.txt</code> lists your favourite (legal) download sites.
 +
#~OR~
 +
$ wget -r --level=1 -H --timeout=1 -nd -N -np --accept=mp3 -e robots=off -i musicblogs.txt
  
== Download multiple files ==
+
* Download all mp3's listed in an html page ([http://www.commandlinefu.com/commands/view/12986/download-all-mp3s-listed-in-an-html-page source]):
* Create variable that holds all URLs and then using 'BASH for loop' to download all files:
+
$ wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off [url of website]
  % URLS="<nowiki>http://www.example.com/foo.tar.gz ftp://ftp.example.org/pub/bar.tar.gz</nowiki>"
+
#-r: recursive
 +
#-l1: depth = 1
 +
#-H: span hosts
 +
#-t1: try once
 +
#-nd: no heirarchy of directories
 +
#-N: turn on time-stamping
 +
#-np: do not ascend to parents
 +
#-A.mp3: accept only mp3 files
 +
#-erobots=off: ignore robots.txt
 +
 
 +
* Crawl a website and generate a log file of any broken links:
 +
$ wget --spider -o wget.log -e robots=off --wait 1 -r -p <nowiki>http://www.example.com/</nowiki>
 +
 
 +
* Force wget to mimic a browser's user-agent (e.g., http://whatsmyuseragent.com/):
 +
$ wget --user-agent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0" <nowiki>https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.gz</nowiki>
 +
 
 +
* Limit download speed/rate:
 +
$ wget --limit-rate=300k <nowiki>https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.gz</nowiki>
 +
 
 +
* Get headers:
 +
<pre>
 +
$ wget -SO /dev/null xtof.ch
 +
--2021-09-13 00:14:04--  http://xtof.ch/
 +
Resolving xtof.ch (xtof.ch)... 1.2.3.4
 +
Connecting to xtof.ch (xtof.ch)|1.2.3.4|:80... connected.
 +
HTTP request sent, awaiting response...
 +
  HTTP/1.1 200 OK
 +
  Date: Mon, 13 Sep 2021 07:14:04 GMT
 +
  Server: Apache/2.4.37 (centos)
 +
  Last-Modified: Tue, 28 Jul 2020 23:13:16 GMT
 +
  ETag: "d2-5ab88958ae676"
 +
  Accept-Ranges: bytes
 +
  Content-Length: 210
 +
  Keep-Alive: timeout=5, max=100
 +
  Connection: Keep-Alive
 +
  Content-Type: text/html; charset=UTF-8
 +
Length: 210 [text/html]
 +
Saving to: ‘/dev/null’
 +
</pre>
 +
 
 +
==Download multiple files==
 +
* Create variable that holds all URLs and then using "BASH for loop" to download all files:
 +
  $ URLS="<nowiki>http://www.example.com/foo.tar.gz ftp://ftp.example.org/pub/bar.tar.gz</nowiki>"
 
* Use for loop as follows:
 
* Use for loop as follows:
  % for u in $URLS; do wget $u; done
+
  $ for u in $URLS; do wget $u; done
  
 
* You can also put a list of the URLs in a file and download using the <code>-i</code> option:
 
* You can also put a list of the URLs in a file and download using the <code>-i</code> option:
  % wget -i download.txt
+
  $ wget -i download.txt
  
 
===Automating/scripting download process===
 
===Automating/scripting download process===
Line 34: Line 93:
  
 
# invoke wget-list without arguments
 
# invoke wget-list without arguments
 
+
while [ $(find .wget-list -size +0) ]; do
while [ `find .wget-list -size +0` ]
+
   url=$(head -n1 .wget-list)
do
+
  wget -c ${url}
   url=`head -n1 .wget-list`
+
  sed -si 1d .wget-list
  wget -c $url
+
done
  sed -si 1d .wget-list
+
done
+
 
</pre>
 
</pre>
  
 
<pre>
 
<pre>
#/bin/sh
+
#!/bin/sh
 
# wget-all: process .wget-list in every subdirectory
 
# wget-all: process .wget-list in every subdirectory
 
# invoke wget-all without arguments
 
# invoke wget-all without arguments
Line 56: Line 113:
 
# invoking: wget-dirs <path-to-directory> ...
 
# invoking: wget-dirs <path-to-directory> ...
  
for dir in $*
+
for dir in $*; do
  do
+
  pushd ${dir}
      pushd $dir
+
  wget-all
      wget-all
+
  popd
      popd
+
done
  done
+
 
wget-all
 
wget-all
 
</pre>
 
</pre>
  
== See also ==
+
==See also==
* [[Curl (command)|curl]]
+
*[[curl]]
* [[Wput (command)|wput]]
+
*[http://wput.sourceforge.net/ wput] &mdash; tiny wget-like ftp-client for uploading files
* [[Rsync (command)|rsync]]
+
*[[rsync]]
* [[Axel (command)|axel]]
+
*[[axel]]
* [http://prozilla.genesys.ro/ prozilla]
+
*[http://prozilla.genesys.ro/ prozilla]
  
== External links ==
+
==External links==
* [http://www.gnu.org/software/wget/manual/ GNU Wget Manual] &mdash; last update: 15-Jun-2005
+
*[http://www.gnu.org/software/wget/manual/ GNU Wget Manual] &mdash; last update: 15-Jun-2005
* [http://www.lifehacker.com/software/top/geek-to-live--mastering-wget-161202.php Geek to Live: Mastering Wget]
+
*[http://lifehacker.com/161202/geek-to-live--mastering-wget Geek to Live: Mastering Wget] &mdash; via lifehacker.com
* [http://www.cyberciti.biz/nixcraft/vivek/blogger/2005/06/linux-wget-your-ultimate-command-line.php wget: your ultimate command line downloader]
+
*[http://www.cyberciti.biz/nixcraft/vivek/blogger/2005/06/linux-wget-your-ultimate-command-line.php wget: your ultimate command line downloader]
  
{{stub}}
 
 
[[Category:Linux Command Line Tools]]
 
[[Category:Linux Command Line Tools]]

Latest revision as of 07:18, 13 September 2021

wget — The non-interactive network downloader.

Usage

  • Simple download:
$ wget http://www.example.com/index.html
  • Download a file and store it locally using a different file name:
$ wget -O example.html http://www.example.com/index.html
  • Background download:
$ wget -b https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.gz
$ tail -f wget-log  # <- monitor download

The above command is useful when you initiate a download via a remote machine. This will start downloading in background, so that you can disconnect the terminal once the command is issued.

  • Mirror an entire web site:
$ wget -m http://www.example.com
  • Mirror an entire subdirectory of a web site (with no parent option in case of backlinks):
$ wget -mk -w 20 -np http://example.com/foo/
  • Download all pages from a site and the pages the site links to (one-level deep):
$ wget -H -r --level=1 -k -p http://www.example.com
  • Resume large file download:
$ wget -c --output-document=MIT8.01F99-L01.mp4 "https://www.youtube.com/watch?v=X9c0MRooBzQ"
  • Schedule hourly downloads of a file
$ wget --output-document=traffic_$(date +\%Y\%m\%d\%H).gif "http://sm3.sitemeter.com/YOUR_CODE"
  • Automatically download music (by Jeff Veen):
$ wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off -i mp3_sites.txt

where mp3_sites.txt lists your favourite (legal) download sites.

#~OR~
$ wget -r --level=1 -H --timeout=1 -nd -N -np --accept=mp3 -e robots=off -i musicblogs.txt
  • Download all mp3's listed in an html page (source):
$ wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off [url of website]
#-r: recursive
#-l1: depth = 1
#-H: span hosts
#-t1: try once
#-nd: no heirarchy of directories
#-N: turn on time-stamping
#-np: do not ascend to parents
#-A.mp3: accept only mp3 files
#-erobots=off: ignore robots.txt
  • Crawl a website and generate a log file of any broken links:
$ wget --spider -o wget.log -e robots=off --wait 1 -r -p http://www.example.com/
$ wget --user-agent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0" https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.gz
  • Limit download speed/rate:
$ wget --limit-rate=300k https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.gz
  • Get headers:
$ wget -SO /dev/null xtof.ch
--2021-09-13 00:14:04--  http://xtof.ch/
Resolving xtof.ch (xtof.ch)... 1.2.3.4
Connecting to xtof.ch (xtof.ch)|1.2.3.4|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Date: Mon, 13 Sep 2021 07:14:04 GMT
  Server: Apache/2.4.37 (centos)
  Last-Modified: Tue, 28 Jul 2020 23:13:16 GMT
  ETag: "d2-5ab88958ae676"
  Accept-Ranges: bytes
  Content-Length: 210
  Keep-Alive: timeout=5, max=100
  Connection: Keep-Alive
  Content-Type: text/html; charset=UTF-8
Length: 210 [text/html]
Saving to: ‘/dev/null’

Download multiple files

  • Create variable that holds all URLs and then using "BASH for loop" to download all files:
$ URLS="http://www.example.com/foo.tar.gz ftp://ftp.example.org/pub/bar.tar.gz"
  • Use for loop as follows:
$ for u in $URLS; do wget $u; done
  • You can also put a list of the URLs in a file and download using the -i option:
$ wget -i download.txt

Automating/scripting download process

#!/bin/sh
# wget-list: manage the list of downloaded files

# invoke wget-list without arguments
while [ $(find .wget-list -size +0) ]; do
  url=$(head -n1 .wget-list)
  wget -c ${url}
  sed -si 1d .wget-list
done
#!/bin/sh
# wget-all: process .wget-list in every subdirectory
# invoke wget-all without arguments

find -name .wget-list -execdir wget-list ';'
#!/bin/sh
# wget-dirs: run wget-all in specified directories
# invoking: wget-dirs <path-to-directory> ...

for dir in $*; do
  pushd ${dir}
  wget-all
  popd
done
wget-all

See also

External links