Wget: Filenames without the query string - wget

I want to download a list of webpages from a file. How can I stop Wget appending the query strings on to the saved files?
wget http://www.example.com/index.html?querystring
I need this to be downloaded as index.html, not index.html?querystring

There is the -O option:
wget -O file.html http://www.example.com/index.html?querystring
so you can alter a little bit your script to pass to the -O argument the right file name.

I've finally resigned to using the -O and just wrapped it in a bash function to make it easier. I put this in my ~/.bashrc file:
wget-rmq ()
{
[ -z "$1" ] && echo 'error: wget-rmq requires a URL to retrieve as the first arg'
local output_filename="$(echo $1 | sed 's/?.*//g' | sed 's|https.*/||g')"
wget -O "${output_filename}" "${1}"
}
Then when I want to download a file:
wget-rmq http://www.example.com/index.html?querystring
The replacement regex is fairly simple. If any ?s appear in the URL before the query string begins then it will break. In practice that hasn't happened though since URL encoding requires ? to be in URLs as %3F, but I wanted to note the possibility.

Related

downloading using wget for multiple files, with renaming of files

I am aware that you can download from multiple url using:
wget "url1" "url2" "url3"
Renaming the output file can be done via:
wget "url1" -O "new_name1"
But when I tried
wget "url1" "url2" "url3" -O "name1" "name2" "name3"
all the files are using name1.
what is the proper way to do so in a single command?
Yes something like this, You can add a file name next to each URL in the file then do:
while IFS= read -r url fileName;do
wget -O "$fileName" "$url"
done < list
where it is assumed you have added a (unique) file name after each URL in the file (separated by a space).
The -O option allows you to specify the destination file name. But if you're downloading multiple files at once, wget will save all of their content to the file you specify via -O. Note that in either case, the file will be truncated if it already exists. See the man page for more info.
You can exploit this option by telling wget to download the links one-by-one:
while IFS= read -r url;do
fileName="blah" # Add a rule to define a new name for each file here
wget -O "$fileName" "$url"
done < list
hope it useful.

wget - work with arguments

I have a list of URI: uri.txt with
category1/image1.jpeg
category1/image32.jpeg
category2/image1.jpeg
and so on, and need to download them from domain example.com with wget, with additional changing filename (final at save) to categoryX-imageY.jpeg
I understand, that I should read uri.txt line by line, add "http://example.com/" in front of each line and change "/" to "-" in each line.
What I have now:
Reading from uri.txt [work]
Adding domain name in front of each URI [work]
Change filename to save [fail]
I'm trying to do this with:
wget 'http://www.example.com/{}' -O '`sed "s/\//-/" {}`' < uri.txt
but wget fails (it depends what type of quotation sign I'm using: ` or ') with:
wget: option requires an argument -- 'O'
or
sed `s/\//-/` category1/image1.jpeg: No such file or directory
sed `s/\//-/` category1/image32.jpeg: No such file or directory
Could you tell, what I'm doing wrong?
Here is how I would do that:
while read LINE ; do
wget "http://example.com/$LINE" -O $(echo $LINE|sed 's=/=-=')
done < uri.txt
In other words, read uri.txt line by line (the text being placed in $LINE bash variable), before performing the wget and saving with modified name (I use another sed delimitor, to avoid escaping / and making it more readable)
When I want to construct a list of args to be executed, I like to use xargs:
cat uri.txt | sed "s#\(.*\)/\(.*\)#http://example.com/\1/\2 -O \1-\2#" | xargs -I {} wget {}

How to force wget to overwrite an existing file ignoring timestamp?

I tried '-N' and '--no-clobber' but the only result that I get is to retrieve a new copy of the existing example.exe with number a number added using this synax 'example.exe.1'. This is not what I'd like to get. I just need to download and overwrite the file example.exe in the same folder where I already saved a copy of example.com without that wget verifies if the mine is older or newer respect the on example.exe file already present in my download folder. Do you think is i possible or I need to create a script that delete the example.exe file or maybe something that change his modification date etc?
If you specify the output file using the -O option it will overwrite any existing file.
For example:
wget -O index.html bbc.co.uk
Run multiple times will keep over-writting index.html.
wget doesn't let you overwrite an existing file unless you explicitly name the output file on the command line with option -O.
I'm a bit lazy and I don't want to type the output file name on the command line when it is already known from the downloaded file. Therefore, I use curl like this:
curl -O http://ftp.vim.org/vim/runtime/spell/fr.utf-8.spl
Be careful when downloading files like this from unsafe sites. The above command will write a file named as the connected web site wishes to name it (inside the current directory though). The final name may be hidden through redirections and php scripts or be obfuscated in the URL. You might end up overwriting a file you don't want to overwrite.
And if you ever find a file named ls or any other enticing name in the current directory after using curl that way, refrain from executing the downloaded file. It may be a trojan downloaded from a rogue or corrupted web site!
wget --backups=1 google.com
renames original file with .1 suffix and writes new file to the intended filename.
Not exactly what was requested, but could be handy in some cases.
-c or --continue
From the manual:
If you use ā€˜-cā€™ on a non-empty file, and the server does not support
continued downloading, Wget will restart the download from scratch and
overwrite the existing file entirely.
I like the -c option. I started with the man page then the web but I've searched for this several times. Like if you're relaying a webcam so the image needs to always be named image.jpg. Seems like it should be more clear in the man page.
I've been using this for a couple years to download things in the background, sometimes combined with "limit-rate = " in my wgetrc file
while true
do
wget -c -i url.txt && break
echo "Restarting wget"
sleep 2
done
Make a little file called url.txt and paste the file's URL into it. Set this script up in your path or maybe as an alias and run it. It keeps retrying the download until there's no error. Sometimes at the end it gets into a loop displaying
416 Requested Range Not Satisfiable
The file is already fully retrieved; nothing to do.
but that's harmless, just ctrl-c it. I think it's always gotten the file I wanted even if wget runs out of retries or the connection temporarily goes away. I've downloaded things for days at a time with it. A CD image on dialup, yes, always with wget.
My use case involves two different URLs, sometimes the second one doesn't exist, but if it DOES exist, I want it to overwrite the first file.
The problem of using wget -O is that, when the second file DOESN'T exist, it will overwrite the first file with a BLANK file.
So the only way I could find is with an if statement:
--spider checks if a file exists, and returns 0 if it does
--quiet fail quietly, with no output
-nv is quiet, but still reports errors
wget -nv https://example.com/files/file01.png -O file01.png
# quietly check if a different version exists
wget --quiet --spider https://example.com/custom-files/file01.png
if [ $? -eq 0 ] ; then
# A different version exists, so download and overwrite the first
wget -nv https://example.com/custom-files/file01.png -O file01.png
fi
It's verbose, but I found it necessary. I hope this is helpful for someone.
Here is an easy way to get it done with parameter trimming
url=https://example.com/example.exe ; wget -nv $url -O ${url##*/}
Or you can use basename
url=https://example.com/example.exe ; wget -nv $url -O $( basename $url )
For those who do not want to use -O and want to specify the output directory only, the following command can be used.
wget \
--directory-prefix "$dest" \
--backups 0 \
-- "$link"
the first command will download from the source with the wget command
the second command will remove the older file
wget \
--directory-prefix "$dest" \
--backups 0 \
-- "$link"; \
rm '$file.1' -f;

how to use -o flag in wget with -i?

I understand that -i flag takes a file (which may contain list of URLs) and I know that -o followed by a name can be specified to rename a item being downloaded using wget.
example:
wget -i list_of_urls.txt
wget -o my_custom_name.mp3 http://example.com/some_file.mp3
I have a file that looks like this:
file name: list_of_urls.txt
http://example.com/some_file.mp3
http://example.com/another_file.mp3
http://example.com/yet_another_file.mp3
I want to use wget to download these files with the -i flag but also save each file as 1.mp3, 2.mp3 and so on.
Can this be done?
You can use any script language (PHP or Python) for generate batch file. In thin batch file each line will contains run wget with url and -O options.
Or you can try write cycle in bash script.
I ran a web search again and found https://superuser.com/questions/336669/downloading-multiple-files-and-specifying-output-filenames-with-wget
Wget can't seem to do it but Curl can with -K flag, the file supplied can contain url and output name. See http://curl.haxx.se/docs/manpage.html#-K
If you are willing to use some shell scripting then https://unix.stackexchange.com/questions/61132/how-do-i-use-wget-with-a-list-of-urls-and-their-corresponding-output-files has the answer.

Appending URL to output file with wget

I'm using wget to read a batch of urls from an input file and download everything to a single output file, and I'd like to append each url before its downloaded content, anyone knows how to do that?
Thanks!
afaik wget does not directly support the use case you are envisioning. however, using standard tools, you can emulate this feature.
we will proceed as follows:
call wget with logging enabled
let sed process the log executing the script detailed below
execute the transformation result as a shell/batch script
conventions: use the following filenames:
wgetin.txt: the file with the urls to fetch using wget
wgetout.sed: sed script
wgetout.final: the final result
wgetass.sh/.cmd: shell/batch script to assemble the downloaded files weaving in the url data
wget.log: the log file of the wget call
Linux
the sed script (linux):
# delete lines _not_ matching the regex
/^\(Saving to: .\|--[0-9: \-]\+-- \)/! { d; }
# turn remaining content into something else
s/^--[0-9: \-]\+-- \(.*\)$/echo '\1\n' >>wgetout.final/
s/^Saving to: .\(.*\).$/cat '\1' >>wgetout.final/
the command line (linux):
rm wgetout.final | rm wgetass.sh | wget -i wgetin.txt -o wget.log | sed -f wgetout.sed -r wget.log >wgetass.sh | chmod 755 wgetass.sh | ./wgetass.sh
Windows
the syntax for windows batch scripts is slightly different. of course, the windows ports of wget and sed have to be installed first.
the sed script (windows):
# delete lines _not_ matching the regex
/^\(Saving to: .\|--[0-9: \-]\+-- \)/! { d; }
# turn remaining content into something else
s/^--[0-9: \-]\+-- \(.*\)$/echo "\1" >>wgetout.final/
s/^Saving to: .\(.*\).$/type "\1" >>wgetout.final/
the command line (windows):
del wgetout.final && del wgetass.cmd && wget -i wgetin.txt -o wget.log && sed -f wgetout.sed -r wget.log >wgetass.cmd && wgetass.cmd