wget: Download image files from URL list that are >1000KB & strip URL parameters from filename - wget

I have a text file C:\folder\filelist.txt containing a list of numbers, for example:
 
345651
342679
344000
349080
I want to append the URL as shown below, download only the files that are >1000KB, and strip the parameters after "-a1" from the filename, for example:
URL
Size
Output File
https://some.thing.com/gab/abc-345651-def-a1?scl=1&fmt=jpeg
1024kb
C:\folder\abc-345651-def-a1.jpeg
https://some.thing.com/gab/abc-342679-def-a1?scl=1&fmt=jpeg
3201kb
C:\folder\abc-342679-def-a1.jpeg
https://some.thing.com/gab/abc-342679-def-a1?scl=1&fmt=jpeg
644kb
-
https://some.thing.com/gab/abc-349080-def-a1?scl=1&fmt=jpeg
2312kb
C:\folder\abc-349080-def-a1.jpeg
This is the code I currently have, which works for downloading the files and appending the .jpeg extension, given the full URL is in the text file. It does not filter out the smaller images or strip the parameters following "-a1".
cd C:\folder\
wget --adjust-extension --content-disposition -i C:\folder\filelist.txt
I'm running Windows and I'm a beginner at writing batch scripts. The most important thing 'm trying to accomplish is to avoid downloading images <1000kb: it would be acceptable if I had to manually append the URL in the text file and rename the files after the fact. Is it possible to do what I'm trying to do? I've tried modifying the script by referencing the posts below, but I can't seem to get it to work. Thanks in advance!
Wget images larger than x kb
Downloading pdf files with wget. (characters after file extension?)
Spider a Website and Return URLs Only

#change working directory
cd /c/folder/
#convert input file list to unix
dos2unix filelist.txt
for image in $(cat filelist.txt)
do
imageURL="https://some.thing.com/gab/abc-$image-def-a1?scl=1&fmt=jpeg"
size=`wget -d -qO- "$imageURL" 2>&1 | grep 'Content-Length' | awk {'print $2'}`
if [[ $size -gt 1024000 ]] ;then
imgname="/c/folder/abc-$image-def-a1.jpeg"
wget -O $imgname $imageURL
fi
done

Related

wget download file to filename using the another level of the url path

I have a list of url paths saved as, say, listOfFiles.txt containing the following:
https://domain.path/level1/level2/name-of-file-01/index.html
https://domain.path/level1/level2/name-of-file-02/index.html
https://domain.path/level1/level2/name-of-file-03/index.html
...
where the name-of-file-xx has no pattern. For example,
https://domain.path/level1/level2/cR2xcet/index.html
https://domain.path/level1/level2/fse4scx/index.html
...
Question: How do you download the index.html here and saving each of the with names using their name-of-file-xx names using wget?
EDIT: What other options/arguments do we need to add in the following code to solve this problem?
wget -np -l1 -i listOfFiles.txt

Using wget to recursively fetch .txt files in .php file, but filters break the command

I am looking to download all quality_variant_[accession_name].txt files from the Salk Arabidopsis 1001 Genomes site using wget in Bash shell.
Main page with list of accessions: http://signal.salk.edu/atg1001/download.php
Each accession links to a page (e.g., http://signal.salk.edu/atg1001/data/Salk/accession.php?id=Aa_0 where Aa_0 is the accession ID) containing three more links: unsequenced_[accession], quality_variant_[accession], and quality_variant_filtered_[accession]
I am only interested in the quality_variant_[accession] link (not quality_variant_filtered_[accession] link), which takes you to to a .txt file with sequence data (e.g., http://signal.salk.edu/atg1001/data/Salk/quality_variant_Aa_0.txt)
Running the command below, the files of interest are eventually outputted (but not downloaded because of the --spider argument), demonstrating that wget can move through the page's hyperlinks to the files I want.
wget --spider --recursive "http://signal.salk.edu/atg1001/download.php
I have not let the command run long enough to determine whether the files of interest are downloaded, but the command below does begin to download the site recursively.
# Arguments in brackets do not impact the performance of the command
wget -r [-e robots=off] [-m] [-np] [-nd] "http://signal.salk.edu/atg1001/download.php"
However, whenever I try to apply filters to pull out the .txt files of interest, whether with --accept-regex, --accept, or many other variants, I cannot get past the initial .php file.
# This and variants thereof do not work
wget -r -A "quality_variant_*.txt" "http://signal.salk.edu/atg1001/download.php"
# Returns:
# Saving to: ‘signal.salk.edu/atg1001/download.php.tmp’
# Removing signal.salk.edu/atg1001/download.php.tmp since it should be rejected.
I could make a list of the accession names and loop through those names modifying the URL in the wget command, but I was hoping for a dynamic one-liner that could extract all files of interest even if accession IDs are added over time.
Thank you!
Note: the data files of interest are contained in the directory http://signal.salk.edu/atg1001/data/Salk/, which is also home to a .php or static HTML page that is displayed when that URL is visited. This URL cannot be used in the wget command because, although the data files of interest are contained here server side, the HTML page contains no reference to these files but rather links to a different set of .txt files that I don't want.

Rename downloaded files with Wget -i

I am trying to download bulk images from URL'S listed in text file.
The command I am using is
wget -i linksfile.txt
The url structure of images in linksfile.txt is like below
www.domainname.com/197507/1-foto-000.jpg?20180711125016
www.domainname.com/197507/2-foto-000.jpg?20180711125030
www.domainname.com/197507/3-foto-000.jpg?20180711125044
www.domainname.com/197507/4-foto-000.jpg?20180711125059
Download images are being saved with filenames as
1-foto-000.jpg?20180711125016
2-foto-000.jpg?20180711125030
3-foto-000.jpg?20180711125044
4-foto-000.jpg?20180711125059
How can I omit all the text after .jpg ? I want file names to be saved as
1-foto-000.jpg
2-foto-000.jpg
3-foto-000.jpg
4-foto-000.jpg
and If possible can filenames be saved as
197507-1-foto-000.jpg
197507-2-foto-000.jpg
197507-3-foto-000.jpg
197507-4-foto-000.jpg
197507 is the folder name where images are hosted on server
I read tutorials on file name changing, Most of them are focused on downloading single file and using wget -o to change file name,, Is there any way we implement in above scenario ?
Maybe --content-disposition would do the trick.

How to download all images from a directory using wget?

I am new to wget.Let's get straight to the question. I want to download all images from a website directory. The directory contains no index file. The image name follows a pattern like ABCXXXX where XXXX= any four digit number. So how to download all images under the directory?
I've tried
wget -p http://www.example.com
but it's downloading an index.html file instead of multiple images.
Using wget:
wget -r -A "*.jpg" http://example.com/images/
Using cURL:
curl "http://example.com/images/ABC[0000-9999].jpg" -o "ABC#1.jpg"
According to man curl:
You can specify multiple URLs or parts of URLs by writing part sets
within braces as in:
http://site.{one,two,three}.com
or you can get sequences of alphanumeric series by using [] as in:
ftp://ftp.numericals.com/file[1-100].txt
ftp://ftp.numericals.com/file[001-100].txt (with leading zeros)
ftp://ftp.letters.com/file[a-z].txt
And explanation for #1:
-o, --output <file>
Write output to instead of stdout. If you are using {} or [] to
fetch multiple documents, you can use '#' followed by a number in the
specifier. That variable will be replaced with the current
string for the URL being fetched. Like in:
curl http://{one,two}.site.com -o "file_#1.txt"
or use several variables like:
curl http://{site,host}.host[1-5].com -o "#1_#2"
You may use this option as many times as the number of URLs you have.
See also the --create-dirs option to create the local directories
dynamically. Specifying the output as '-' (a single dash) will force
the output to be done to stdout.

using wget to overwrite file but use temporary filename until full file is received, then rename

I'm using wget in a cron job to fetch a .jpg file into a web server folder once per minute (with same filename each time, overwriting). This folder is "live" in that the web server also serves that image from there. However if someone web-browses to that page during the time the image is being fetched, it is considered a jpg with errors and says so in the browser. So what I need to do is, similar to when Firefox is downloading a file, wget should write to a temporary file, either in /var or in the destination folder but with a temporary name, until it has the whole thing, then rename in an atomic (or at least negligible-duration) step.
I've read the wget man page and there doesn't seem to be a command line option for this. Have I missed it? Or do I need to do two commands in my cron job, a wget and a move?
There is no way to do this purely with GNU Wget.
wget's job is to download files and it does that. A simple one line script can achieve what you're looking for:
$ wget -O myfile.jpg.tmp example.com/myfile.jpg && mv myfile.jpg{.tmp,}
Since mv is atomic, atleast on Linux, you get the atomic update of a ready file.
Just wanted to share my solution:
alias wget='func(){ (wget --tries=0 --retry-connrefused --timeout=30 -O download_pkg.tmp "$1" && mv download_pkg.tmp "${1##*/}") || rm download_pkg.tmp; unset -f func; }; func
it creates a function that receives a parameter "url" to download the file to a temporary name. If it is successful, it is renamed to the correct filename extracted from parameter $1 with ${1##*/}. and if it fails, deletes the temp file. If the operation is aborted, the temp file will be replace on the next run. after all, unset -f removes the function definition as the alias is executed.