Difference between revisions of "Wget"
From Christoph's Personal Wiki
(→Automating/scripting download process) |
|||
| (15 intermediate revisions by the same user not shown) | |||
| Line 1: | Line 1: | ||
| − | + | '''wget''' — The non-interactive network downloader. | |
| − | '''wget''' | + | |
| − | == Download multiple files == | + | ==Usage== |
| − | * Create variable that holds all | + | |
| − | + | * Simple download: | |
| + | $ wget <nowiki>http://www.example.com/index.html</nowiki> | ||
| + | |||
| + | * Download a file and store it locally using a different file name: | ||
| + | $ wget -O example.html <nowiki>http://www.example.com/index.html</nowiki> | ||
| + | |||
| + | * Background download: | ||
| + | $ wget -b <nowiki>https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.gz</nowiki> | ||
| + | $ tail -f wget-log # <- monitor download | ||
| + | |||
| + | The above command is useful when you initiate a download via a remote machine. This will start downloading in background, so that you can disconnect the terminal once the command is issued. | ||
| + | |||
| + | * Mirror an entire web site: | ||
| + | $ wget -m <nowiki>http://www.example.com</nowiki> | ||
| + | |||
| + | * Mirror an entire subdirectory of a web site (with no parent option in case of backlinks): | ||
| + | $ wget -mk -w 20 -np <nowiki>http://example.com/foo/</nowiki> | ||
| + | |||
| + | * Download all pages from a site and the pages the site links to (one-level deep): | ||
| + | $ wget -H -r --level=1 -k -p <nowiki>http://www.example.com</nowiki> | ||
| + | |||
| + | * Resume large file download: | ||
| + | $ wget -c --output-document=MIT8.01F99-L01.mp4 "<nowiki>https://www.youtube.com/watch?v=X9c0MRooBzQ</nowiki>" | ||
| + | |||
| + | * Schedule hourly downloads of a file | ||
| + | $ wget --output-document=traffic_$(date +\%Y\%m\%d\%H).gif "<nowiki>http://sm3.sitemeter.com/YOUR_CODE</nowiki>" | ||
| + | |||
| + | * Automatically download music (by [http://www.veen.com/jeff/archives/000573.html Jeff Veen]): | ||
| + | $ wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off -i mp3_sites.txt | ||
| + | where <code>mp3_sites.txt</code> lists your favourite (legal) download sites. | ||
| + | #~OR~ | ||
| + | $ wget -r --level=1 -H --timeout=1 -nd -N -np --accept=mp3 -e robots=off -i musicblogs.txt | ||
| + | |||
| + | * Download all mp3's listed in an html page ([http://www.commandlinefu.com/commands/view/12986/download-all-mp3s-listed-in-an-html-page source]): | ||
| + | $ wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off [url of website] | ||
| + | #-r: recursive | ||
| + | #-l1: depth = 1 | ||
| + | #-H: span hosts | ||
| + | #-t1: try once | ||
| + | #-nd: no heirarchy of directories | ||
| + | #-N: turn on time-stamping | ||
| + | #-np: do not ascend to parents | ||
| + | #-A.mp3: accept only mp3 files | ||
| + | #-erobots=off: ignore robots.txt | ||
| + | |||
| + | * Crawl a website and generate a log file of any broken links: | ||
| + | $ wget --spider -o wget.log -e robots=off --wait 1 -r -p <nowiki>http://www.example.com/</nowiki> | ||
| + | |||
| + | * Force wget to mimic a browser's user-agent (e.g., http://whatsmyuseragent.com/): | ||
| + | $ wget --user-agent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0" <nowiki>https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.gz</nowiki> | ||
| + | |||
| + | * Limit download speed/rate: | ||
| + | $ wget --limit-rate=300k <nowiki>https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.gz</nowiki> | ||
| + | |||
| + | * Get headers: | ||
| + | <pre> | ||
| + | $ wget -SO /dev/null xtof.ch | ||
| + | --2021-09-13 00:14:04-- http://xtof.ch/ | ||
| + | Resolving xtof.ch (xtof.ch)... 1.2.3.4 | ||
| + | Connecting to xtof.ch (xtof.ch)|1.2.3.4|:80... connected. | ||
| + | HTTP request sent, awaiting response... | ||
| + | HTTP/1.1 200 OK | ||
| + | Date: Mon, 13 Sep 2021 07:14:04 GMT | ||
| + | Server: Apache/2.4.37 (centos) | ||
| + | Last-Modified: Tue, 28 Jul 2020 23:13:16 GMT | ||
| + | ETag: "d2-5ab88958ae676" | ||
| + | Accept-Ranges: bytes | ||
| + | Content-Length: 210 | ||
| + | Keep-Alive: timeout=5, max=100 | ||
| + | Connection: Keep-Alive | ||
| + | Content-Type: text/html; charset=UTF-8 | ||
| + | Length: 210 [text/html] | ||
| + | Saving to: ‘/dev/null’ | ||
| + | </pre> | ||
| + | |||
| + | ==Download multiple files== | ||
| + | * Create variable that holds all URLs and then using "BASH for loop" to download all files: | ||
| + | $ URLS="<nowiki>http://www.example.com/foo.tar.gz ftp://ftp.example.org/pub/bar.tar.gz</nowiki>" | ||
* Use for loop as follows: | * Use for loop as follows: | ||
| − | + | $ for u in $URLS; do wget $u; done | |
| + | |||
| + | * You can also put a list of the URLs in a file and download using the <code>-i</code> option: | ||
| + | $ wget -i download.txt | ||
| + | |||
| + | ===Automating/scripting download process=== | ||
| + | <pre> | ||
| + | #!/bin/sh | ||
| + | # wget-list: manage the list of downloaded files | ||
| + | |||
| + | # invoke wget-list without arguments | ||
| + | while [ $(find .wget-list -size +0) ]; do | ||
| + | url=$(head -n1 .wget-list) | ||
| + | wget -c ${url} | ||
| + | sed -si 1d .wget-list | ||
| + | done | ||
| + | </pre> | ||
| + | |||
| + | <pre> | ||
| + | #!/bin/sh | ||
| + | # wget-all: process .wget-list in every subdirectory | ||
| + | # invoke wget-all without arguments | ||
| + | |||
| + | find -name .wget-list -execdir wget-list ';' | ||
| + | </pre> | ||
| + | |||
| + | <pre> | ||
| + | #!/bin/sh | ||
| + | # wget-dirs: run wget-all in specified directories | ||
| + | # invoking: wget-dirs <path-to-directory> ... | ||
| − | + | for dir in $*; do | |
| − | + | pushd ${dir} | |
| + | wget-all | ||
| + | popd | ||
| + | done | ||
| + | wget-all | ||
| + | </pre> | ||
| − | == See also == | + | ==See also== |
| − | * [[ | + | *[[curl]] |
| − | * [http://prozilla.genesys.ro/ prozilla] | + | *[http://wput.sourceforge.net/ wput] — tiny wget-like ftp-client for uploading files |
| + | *[[rsync]] | ||
| + | *[[axel]] | ||
| + | *[http://prozilla.genesys.ro/ prozilla] | ||
| − | == External links == | + | ==External links== |
| − | * [http://www.gnu.org/software/wget/manual/ GNU Wget Manual] — last update: 15-Jun-2005 | + | *[http://www.gnu.org/software/wget/manual/ GNU Wget Manual] — last update: 15-Jun-2005 |
| − | * [http:// | + | *[http://lifehacker.com/161202/geek-to-live--mastering-wget Geek to Live: Mastering Wget] — via lifehacker.com |
| − | * [http://www.cyberciti.biz/nixcraft/vivek/blogger/2005/06/linux-wget-your-ultimate-command-line.php wget: your ultimate command line downloader] | + | *[http://www.cyberciti.biz/nixcraft/vivek/blogger/2005/06/linux-wget-your-ultimate-command-line.php wget: your ultimate command line downloader] |
| − | |||
[[Category:Linux Command Line Tools]] | [[Category:Linux Command Line Tools]] | ||
Latest revision as of 07:18, 13 September 2021
wget — The non-interactive network downloader.
Contents
Usage
- Simple download:
$ wget http://www.example.com/index.html
- Download a file and store it locally using a different file name:
$ wget -O example.html http://www.example.com/index.html
- Background download:
$ wget -b https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.gz $ tail -f wget-log # <- monitor download
The above command is useful when you initiate a download via a remote machine. This will start downloading in background, so that you can disconnect the terminal once the command is issued.
- Mirror an entire web site:
$ wget -m http://www.example.com
- Mirror an entire subdirectory of a web site (with no parent option in case of backlinks):
$ wget -mk -w 20 -np http://example.com/foo/
- Download all pages from a site and the pages the site links to (one-level deep):
$ wget -H -r --level=1 -k -p http://www.example.com
- Resume large file download:
$ wget -c --output-document=MIT8.01F99-L01.mp4 "https://www.youtube.com/watch?v=X9c0MRooBzQ"
- Schedule hourly downloads of a file
$ wget --output-document=traffic_$(date +\%Y\%m\%d\%H).gif "http://sm3.sitemeter.com/YOUR_CODE"
- Automatically download music (by Jeff Veen):
$ wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off -i mp3_sites.txt
where mp3_sites.txt lists your favourite (legal) download sites.
#~OR~ $ wget -r --level=1 -H --timeout=1 -nd -N -np --accept=mp3 -e robots=off -i musicblogs.txt
- Download all mp3's listed in an html page (source):
$ wget -r -l1 -H -t1 -nd -N -np -A.mp3 -erobots=off [url of website] #-r: recursive #-l1: depth = 1 #-H: span hosts #-t1: try once #-nd: no heirarchy of directories #-N: turn on time-stamping #-np: do not ascend to parents #-A.mp3: accept only mp3 files #-erobots=off: ignore robots.txt
- Crawl a website and generate a log file of any broken links:
$ wget --spider -o wget.log -e robots=off --wait 1 -r -p http://www.example.com/
- Force wget to mimic a browser's user-agent (e.g., http://whatsmyuseragent.com/):
$ wget --user-agent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0" https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.gz
- Limit download speed/rate:
$ wget --limit-rate=300k https://www.kernel.org/pub/linux/kernel/v4.x/linux-4.0.4.tar.gz
- Get headers:
$ wget -SO /dev/null xtof.ch --2021-09-13 00:14:04-- http://xtof.ch/ Resolving xtof.ch (xtof.ch)... 1.2.3.4 Connecting to xtof.ch (xtof.ch)|1.2.3.4|:80... connected. HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Mon, 13 Sep 2021 07:14:04 GMT Server: Apache/2.4.37 (centos) Last-Modified: Tue, 28 Jul 2020 23:13:16 GMT ETag: "d2-5ab88958ae676" Accept-Ranges: bytes Content-Length: 210 Keep-Alive: timeout=5, max=100 Connection: Keep-Alive Content-Type: text/html; charset=UTF-8 Length: 210 [text/html] Saving to: ‘/dev/null’
Download multiple files
- Create variable that holds all URLs and then using "BASH for loop" to download all files:
$ URLS="http://www.example.com/foo.tar.gz ftp://ftp.example.org/pub/bar.tar.gz"
- Use for loop as follows:
$ for u in $URLS; do wget $u; done
- You can also put a list of the URLs in a file and download using the
-ioption:
$ wget -i download.txt
Automating/scripting download process
#!/bin/sh
# wget-list: manage the list of downloaded files
# invoke wget-list without arguments
while [ $(find .wget-list -size +0) ]; do
url=$(head -n1 .wget-list)
wget -c ${url}
sed -si 1d .wget-list
done
#!/bin/sh # wget-all: process .wget-list in every subdirectory # invoke wget-all without arguments find -name .wget-list -execdir wget-list ';'
#!/bin/sh
# wget-dirs: run wget-all in specified directories
# invoking: wget-dirs <path-to-directory> ...
for dir in $*; do
pushd ${dir}
wget-all
popd
done
wget-all
See also
External links
- GNU Wget Manual — last update: 15-Jun-2005
- Geek to Live: Mastering Wget — via lifehacker.com
- wget: your ultimate command line downloader