I am new to wget.Let's get straight to the question. I want to download all images from a website directory. The directory contains no index file. The image name follows a pattern like ABCXXXX where XXXX= any four digit number. So how to download all images under the directory?
I've tried
wget -p http://www.example.com
but it's downloading an index.html file instead of multiple images.
Using wget:
wget -r -A "*.jpg" http://example.com/images/
Using cURL:
curl "http://example.com/images/ABC[0000-9999].jpg" -o "ABC#1.jpg"
According to man curl:
You can specify multiple URLs or parts of URLs by writing part sets
within braces as in:
http://site.{one,two,three}.com
or you can get sequences of alphanumeric series by using [] as in:
ftp://ftp.numericals.com/file[1-100].txt
ftp://ftp.numericals.com/file[001-100].txt (with leading zeros)
ftp://ftp.letters.com/file[a-z].txt
And explanation for #1:
-o, --output <file>
Write output to instead of stdout. If you are using {} or [] to
fetch multiple documents, you can use '#' followed by a number in the
specifier. That variable will be replaced with the current
string for the URL being fetched. Like in:
curl http://{one,two}.site.com -o "file_#1.txt"
or use several variables like:
curl http://{site,host}.host[1-5].com -o "#1_#2"
You may use this option as many times as the number of URLs you have.
See also the --create-dirs option to create the local directories
dynamically. Specifying the output as '-' (a single dash) will force
the output to be done to stdout.
Related
I have a list of url paths saved as, say, listOfFiles.txt containing the following:
https://domain.path/level1/level2/name-of-file-01/index.html
https://domain.path/level1/level2/name-of-file-02/index.html
https://domain.path/level1/level2/name-of-file-03/index.html
...
where the name-of-file-xx has no pattern. For example,
https://domain.path/level1/level2/cR2xcet/index.html
https://domain.path/level1/level2/fse4scx/index.html
...
Question: How do you download the index.html here and saving each of the with names using their name-of-file-xx names using wget?
EDIT: What other options/arguments do we need to add in the following code to solve this problem?
wget -np -l1 -i listOfFiles.txt
I have a text file C:\folder\filelist.txt containing a list of numbers, for example:
345651
342679
344000
349080
I want to append the URL as shown below, download only the files that are >1000KB, and strip the parameters after "-a1" from the filename, for example:
URL
Size
Output File
https://some.thing.com/gab/abc-345651-def-a1?scl=1&fmt=jpeg
1024kb
C:\folder\abc-345651-def-a1.jpeg
https://some.thing.com/gab/abc-342679-def-a1?scl=1&fmt=jpeg
3201kb
C:\folder\abc-342679-def-a1.jpeg
https://some.thing.com/gab/abc-342679-def-a1?scl=1&fmt=jpeg
644kb
-
https://some.thing.com/gab/abc-349080-def-a1?scl=1&fmt=jpeg
2312kb
C:\folder\abc-349080-def-a1.jpeg
This is the code I currently have, which works for downloading the files and appending the .jpeg extension, given the full URL is in the text file. It does not filter out the smaller images or strip the parameters following "-a1".
cd C:\folder\
wget --adjust-extension --content-disposition -i C:\folder\filelist.txt
I'm running Windows and I'm a beginner at writing batch scripts. The most important thing 'm trying to accomplish is to avoid downloading images <1000kb: it would be acceptable if I had to manually append the URL in the text file and rename the files after the fact. Is it possible to do what I'm trying to do? I've tried modifying the script by referencing the posts below, but I can't seem to get it to work. Thanks in advance!
Wget images larger than x kb
Downloading pdf files with wget. (characters after file extension?)
Spider a Website and Return URLs Only
#change working directory
cd /c/folder/
#convert input file list to unix
dos2unix filelist.txt
for image in $(cat filelist.txt)
do
imageURL="https://some.thing.com/gab/abc-$image-def-a1?scl=1&fmt=jpeg"
size=`wget -d -qO- "$imageURL" 2>&1 | grep 'Content-Length' | awk {'print $2'}`
if [[ $size -gt 1024000 ]] ;then
imgname="/c/folder/abc-$image-def-a1.jpeg"
wget -O $imgname $imageURL
fi
done
I am looking to download all quality_variant_[accession_name].txt files from the Salk Arabidopsis 1001 Genomes site using wget in Bash shell.
Main page with list of accessions: http://signal.salk.edu/atg1001/download.php
Each accession links to a page (e.g., http://signal.salk.edu/atg1001/data/Salk/accession.php?id=Aa_0 where Aa_0 is the accession ID) containing three more links: unsequenced_[accession], quality_variant_[accession], and quality_variant_filtered_[accession]
I am only interested in the quality_variant_[accession] link (not quality_variant_filtered_[accession] link), which takes you to to a .txt file with sequence data (e.g., http://signal.salk.edu/atg1001/data/Salk/quality_variant_Aa_0.txt)
Running the command below, the files of interest are eventually outputted (but not downloaded because of the --spider argument), demonstrating that wget can move through the page's hyperlinks to the files I want.
wget --spider --recursive "http://signal.salk.edu/atg1001/download.php
I have not let the command run long enough to determine whether the files of interest are downloaded, but the command below does begin to download the site recursively.
# Arguments in brackets do not impact the performance of the command
wget -r [-e robots=off] [-m] [-np] [-nd] "http://signal.salk.edu/atg1001/download.php"
However, whenever I try to apply filters to pull out the .txt files of interest, whether with --accept-regex, --accept, or many other variants, I cannot get past the initial .php file.
# This and variants thereof do not work
wget -r -A "quality_variant_*.txt" "http://signal.salk.edu/atg1001/download.php"
# Returns:
# Saving to: ‘signal.salk.edu/atg1001/download.php.tmp’
# Removing signal.salk.edu/atg1001/download.php.tmp since it should be rejected.
I could make a list of the accession names and loop through those names modifying the URL in the wget command, but I was hoping for a dynamic one-liner that could extract all files of interest even if accession IDs are added over time.
Thank you!
Note: the data files of interest are contained in the directory http://signal.salk.edu/atg1001/data/Salk/, which is also home to a .php or static HTML page that is displayed when that URL is visited. This URL cannot be used in the wget command because, although the data files of interest are contained here server side, the HTML page contains no reference to these files but rather links to a different set of .txt files that I don't want.
I want to download a particular section of a website. I am following this wget - Download a sub directory . But the problem is the section of the website does not have any particular url i.e. the urls goes like this http://grephysics.net/ans/0177/* where * is a number from 1-100 and I cant use http://grephysics.net/ans/0177 in wget. How do I download this 100 webpages with link to each other (i.e. the the Previous and Next button should link to local copies)
I think this is what you need:
wget -p -k http://grephysics.net/ans/0177/{1..100}
Explanation:
-k : rewrites links to point to local assets
-p : get all images, js, css, etc. needed to display the page
{1..100} : this specifies a range of urls to download, in your case we have pages labelled 1 to 100.
Why didn't recursive downloading work?
The link you posted was a good first resource, probably what most people would want. But the way wget recursively downloads is by getting the first page specified (i.e. the root), then following links to child pages. The way grephysics is set up however, is that http://grephysics.net/ans/0177 leads us to a 404. It has no links for wget to follow to download child pages.
If your wget doesn't support {}
You can still have the same results by using the following command:
for i in {1..100}; do echo $i; done | wget -p -k -B http://grephysics.net/ans/0177/ -i -
Explanation
for i in {1..100};... : This prints the values 1 to 100.
| : For anyone who hasn't seen this, we are piping the output of the previous command into the input of the following command
-p : get all images, js, css, etc. needed to display the page
-k : rewrite the links to point to the local copies
-B : specifies the base URL to use with the -i option
-i : reads a list of urls to fetch from a file. Since we specified the 'file' - it reads from stdin.
So, we read in the values 1 to 100 and append them to our base url
http://grephysics.net/ans/0177/ and fetch all of those urls and all the assets that go with them, then rewrite links so we can browse offline.
I'm trying to retrieve working webpages with wget and this goes well for most sites with the following command:
wget -p -k http://www.example.com
In these cases I will end up with index.html and the needed CSS/JS etc.
HOWEVER, in certain situations the url will have a query string and in those cases I get an index.html with the query string appended.
Example
www.onlinetechvision.com/?p=566
Combined with the above wget command will result in:
index.html?page=566
I have tried using the --restrict-file-names=windows option, but that only gets me to
index.html#page=566
Can anyone explain why this is needed and how I can end up with a regular index.html file?
UPDATE: I'm sort of on the fence on taking a different approach. I found out I can take the first filename that wget saves by parsing the output. So the name that appears after Saving to: is the one I need.
However, this is wrapped by this strange character â - rather than just removing that hardcoded - where does this come from?
If you try with parameter "--adjust-extension"
wget -p -k --adjust-extension www.onlinetechvision.com/?p=566
you come closer. In www.onlinetechvision.com folder there will be file with corrected extension: index.html#p=566.html or index.html?p=566.html on *NiX systems. It is simple now to change that file to index.html even with script.
If you are on Microsoft OS make sure you have latter version of wget - it is also available here: https://eternallybored.org/misc/wget/
To answer your question about why this is needed, remember that the web server is likely to return different results based on the parameters in the query string. If a query for index.html?page=52 returns different results from index.html?page=53, you probably wouldn't want both pages to be saved in the same file.
Each HTTP request that uses a different set of query parameters is quite literally a request for a distinct resource. wget can't predict which of these changes is and isn't going to be significant, so it's doing the conservative thing and preserving the query parameter URLs in the filename of the local document.
My solution is to do recursive crawling outside wget:
get directory structure with wget (no file)
loop to get main entry file (index.html) from each dir
This works well with wordpress sites. Could miss some pages tho.
#!/bin/bash
#
# get directory structure
#
wget --spider -r --no-parent http://<site>/
#
# loop through each dir
#
find . -mindepth 1 -maxdepth 10 -type d | cut -c 3- > ./dir_list.txt
while read line;do
wget --wait=5 --tries=20 --page-requisites --html-extension --convert-links --execute=robots=off --domain=<domain> --strict-comments http://${line}/
done < ./dir_list.txt
The query string is required because of the website design what the site is doing is using the same standard index.html for all content and then using the querystring to pull in the content from another page like with script on the server side. (it may be client side if you look in the JavaScript).
Have you tried using --no-cookies it could be storing this information via cookie and pulling it when you hit the page. also this could be caused by URL rewrite logic which you will have little control over from the client side.
use -O or --output-document options. see http://www.electrictoolbox.com/wget-save-different-filename/