Can we wget with file list and renaming destination files? - wget

I have this wget command:
sudo wget --user-agent='some-agent' --referer=http://some-referrer.html -N -r -nH --cut-dirs=x --timeout=xxx --directory-prefix=/directory/for/downloaded/files -i list-of-files-to-download.txt
-N will check if there is actually a newer file to download.
-r will turn the recursive retrieving on.
-nH will disable the generation of host-prefixed directories.
--cut-dirs=X will avoid the generation of the host's subdirectories.
--timeout=xxx will, well, timeout :)
--directory-prefix will store files in the desired directorty.
This works nice, no problem.
Now, to the issue:
Let's say my files-to-download.txt has these kind of files:
http://website/directory1/picture-same-name.jpg
http://website/directory2/picture-same-name.jpg
http://website/directory3/picture-same-name.jpg
etc...
You can see the problem: on the second download, wget will see we already have a picture-same-name.jpg, so it won't download the second or any of the following ones with the same name. I cannot mirror the directory structure because I need all the downloaded files to be in the same directory. I can't use the -O option because it clashes with --N, and I need that. I've tried to use -nd, but doesn't seem to work for me.
So, ideally, I need to be able to:
a.- wget from a list of url's the way I do now, keeping my parameters.
b.- get all files at the same directory and being able to rename each file.
Does anybody have any solution to this?
Thanks in advance.

I would suggest 2 approaches -
Use the "-nc" or the "--no-clobber" option. From the man page -
-nc
--no-clobber
If a file is downloaded more than once in the same directory, >Wget's behavior depends on a few options, including -nc. In certain >cases, the local file will be
clobbered, or overwritten, upon repeated download. In other >cases it will be preserved.
When running Wget without -N, -nc, -r, or -p, downloading the >same file in the same directory will result in the original copy of file >being preserved and the second copy
being named file.1. If that file is downloaded yet again, the >third copy will be named file.2, and so on. (This is also the behavior >with -nd, even if -r or -p are in
effect.) When -nc is specified, this behavior is suppressed, >and Wget will refuse to download newer copies of file. Therefore, ""no->clobber"" is actually a misnomer in
this mode---it's not clobbering that's prevented (as the >numeric suffixes were already preventing clobbering), but rather the >multiple version saving that's prevented.
When running Wget with -r or -p, but without -N, -nd, or -nc, >re-downloading a file will result in the new copy simply overwriting the >old. Adding -nc will prevent this
behavior, instead causing the original version to be preserved >and any newer copies on the server to be ignored.
When running Wget with -N, with or without -r or -p, the >decision as to whether or not to download a newer copy of a file depends >on the local and remote timestamp and
size of the file. -nc may not be specified at the same time as >-N.
A combination with -O/--output-document is only accepted if the >given output file does not exist.
Note that when -nc is specified, files with the suffixes .html >or .htm will be loaded from the local disk and parsed as if they had been >retrieved from the Web.
As you can see from this man page entry, the behavior might be unpredictable/unexpected. You will need to see if it works for you.
Another approach would be to use a bash script. I am most comfortable using bash on *nix, so forgive the platform dependency. However the logic is sound, and with a bit of modifications, you can get it to work on other platforms/scripts as well.
Sample pseudocode bash script -
for i in `cat list-of-files-to-download.txt`;
do
wget <all your flags except the -i flag> $i -O /path/to/custom/directory/filename ;
done ;
You can modify the script to download each file to a temporary file, parse $i to get the filename from the URL, check if the file exists on the disk, and then take a decision to rename the temp file to the name that you want.
This offers much more control over your downloads.

Related

What am I screwing up trying to download particular file types with wget?

I am attempting to regularly archive a few file types hosted on a community website where our admin has been MIA for years, in case he dies or just stops paying for the hosting.
I am able to download all of the files I need using wget -r -np -nd -e robots=off -l 0 URL but this leaves me with about 60,000 extra files to waste time both downloading and deleting.
I am really only looking for files with the extensions "tbt" and "zip". When I add in -A tbt,zip to the input, wget then only downloads a single file, "index.html.tmp". It immediately deletes this file because it doesn't match the file type specified, and then the process stops entirely, with wget announcing that it is finished. It does not attempt to download any of the other files that it grabs when the -A flag is not included.
What am I doing wrong? Why does specifying file types in the way that I did cause it to finish after only looking at one file?
Possibly you're hitting the same problem I've hit when trying to do something similar. When using --accept, wget determines whether a links refers to a file or directory based on whether or not it ends with a /.
For example, say I have a directory named files, and a web page that has:
Lots o' files!
If I were to request this with wget -r, then I wget would happily GET /files, see that it was an HTML document containing a bunch of links, and continue to download those links.
However, if I add -A zip to my command line, and run wget with --debug, I see:
appending ‘http://localhost:8080/files’ to urlpos.
[...]
Deciding whether to enqueue "http://localhost:8080/files".
http://localhost:8080/files (files) does not match acc/rej rules.
Decided NOT to load it.
In other words, wget thinks this is a file (no trailing /) and it doesn't match our acceptance criteria, so it gets rejected.
If I modify the remote file so that it looks like...
Lots o' files!
...then wget will follow the link and download files as desired.
I don't think there's a great solution to this problem if you need to use wget. As I mentioned in my comment, there are other tools available that may handle this situation more gracefully.
It's also possible you're experiencing a different issue; the output of adding --debug to your command line clarify things in that case.
I also experienced this issue, on a page where all the download links looked something like this: filedownload.ashx?name=file.mp3. The solution was to match for both the linked file, and the downloaded file. So my wget accept flag looked like this: -A 'ashx,mp3'. I also used the --trust-server-names flag. This catches all the .ashx that are linked in the webpage, then when wget does the second check, all the mp3 files that were downloaded will stay.
As an alternative to --trust-server-names, you may also find the --content-disposition flag helpful. Both flags help rename the file that gets downloaded from filedownload.ashx?name=file.mp3 to just file.mp3.

Compare file sizes and download if they're different via wget

I'm downloading some .mp3 files (all legal) via wget :
wget -r -nc files.myserver.com
I have to stop the download sometimes and at that times the file is partially downloaded. For example a 10 minutes record.mp3 file become 4 minutes record.mp3 file. It's playing correctly but incomplete.
If I use the same command above, because the record.mp3 file is already exist in my local computer wget skips that file although it isn't complete.
I wonder if there is a way to check the file sizes and if the file size in the remote server and local computer isn't same re-download it. (I've learned the --spider command gives the file size but is there any other command that automatically check the file sizes and download or not).
I would go with wget's -N option for timestamping, but note that wget will only compare the file sizes if you also specify the --no-if-modified-since option. Without it, incomplete files are indeed skipped on the next run because they receive a timestamp of the current time, which is newer than that on the server.
The reason is probably that with only -N, a GET request is sent for the file with the If-Modified-Since field set. The server responds with either 200 or 304, but the 304 doesn't contain the file size so wget can't check it.
With --no-if-modified-since wget sends a HEAD request instead to get the timestamp and file size, and checks both.
What I use for recursive download of a folder:
wget -T 300 -nv -t 1 -r -nd -np -l 1 -N --no-if-modified-since -P $my_folder $my_url
With:
-T 300: Set the network timeout to 300 seconds
-nv: Turn off verbose without being completely quiet
-t 1: Set number of tries to 1
-r: Turn on recursive retrieving
-nd: Do not create a hierarchy of directories when retrieving recursively
-np: Do not ever ascend to the parent directory when retrieving recursively
-l 1: Specify recursion maximum depth 1
-N: Turn on time-stamping
--no-if-modified-since: Do not send If-Modified-Since header in ‘-N’ mode, send preliminary HEAD request instead
You may try the -c option to continue the download of partially downloaded files, however the manual gives an explicit warning:
You need to be especially careful of this when using -c in conjunction
with -r, since every file will be considered as an "incomplete
download" candidate.
While there is no perfect solution to this problem you could try to use -N option to turn on timestamping. This might prevent errors when the file has changed on the server but only if the server supports timestamping and partial downloads. Try it and see how it goes.
wget -r -N -c files.myserver.com
If you need check if file was partially downloaded (has different size) or updated on remote server by timestamp and must be in this case updated locally you need use -N option.
Here some additional info about -N (--timestamping) option from Wget docs:
If the local file does not exist, or the sizes of the files do not match, Wget will download the remote file no matter what the
time-stamps say.
Added From: https://www.gnu.org/software/wget/manual/wget.html (Chapter: 5 Time-Stamping)

GNU make: Can I delete obsolete files?

In my workflow, I have lots of xxx.smr files in a folder and I need to convert them into other file format xxx_step3.mat by importing some data from xxx_info.xlsx. I learned that GNU make is powerful in keep all the files up-to-date.
In a very simple "explicit" format (without sophisticated wild card usage), Makefile for this process would look like this. To handle multiple xxx.smr files and their descendants, I should be able to do that by modifying this file.
.PHONY: all clean
all: xxx_step3.mat
xxx_step3.mat: xxx_step2.mat xxx_info.xlsx
matlab -r "merge2files('xxx_step2.mat', 'xxx_info.xlsx')"
xxx_step2.mat: xxx_step1.mat
matlab -r "convertmat('xxx_step1.mat')"
xxx_info.xlsx: master.xslx
matlab -r "extractfromMasterxlsx('master.xlsx', 'xxx_info.xlsx')"
xxx_step1.mat: xxx_step0.smr
#echo "\nCreate " $#
# I can't do this step from the command line so I leave message
clean:
rm -f xxx_step1.mat xxx_step2.mat xxx_step3.mat xxx_info.xlsx
However, I realized that, when some of xxx.smr files were found to be surplus and deleted at some point, running GNU make with this Makefile does not delete the obsolete descendant files, including all the intermediate files and the final xxx_step3.mat files, that are dependent on those deleted xxx.smr files.
For example, I start with the three xxx.smr files and run Make.
A.smr, B.smr, C.smr
It will create all the descendants, including the final target files:
A_step3.mat, B_step3.mat, C_step3.mat
Later, say, I find the B.smr contained a fatal error and decided to delete from the folder.
A.smr, C.smr
Running Make at this stage will result in ... no change, because both A_step3.mat and C_step3.mat are newer than its direct prerequisites (and than A.smr and C.smr). However, actually I need to remove all the descendants of B.smr, such as B_step1.mat, B_step2.mat, B_step3.mat, and B_info.xlsx. If those obsolete files are kept, the final target B_step3.mat will be included in the subsequent analyses and affect the results.
I wonder if there is a "smart" way of removing xxx_step1.mat, xxx_step2.mat, xxx_step3.mat, xxx_info.xlsx files, when their corresponding xxx.smr files have been deleted.
Or should I just implement this with MATLAB or Python etc?
Since a Makefile is a collection of shell commands, on your clean: target, you can collect and remove all the files that correspond to your xxx.smr files using a for loop and parameter expansion/substring matching. To find all files that correspond to each xxx.smr file, find all xxx.smr files. Then for each xxx.smr, extract xxx and remove all xxx_step?.* and xxx_info.* files. After each of the step? and info files are removed, then remove xxx.smr. In multi-line form it would look like:
for i in *.smr; do
for j in ${i%.*}; do
rm -f "${j}_step?.*" "${j}_info.*"
done
rm -f "$i"
done
Or, in a single line:
for i in *.smr; do for j in ${i%.*}; do rm -f "${j}_step?.*" "${j}_info.*"; done; rm -f "$i"; done
Note this will remove all xxx_step... and xxx_info... files for each xxx.smr file. Make sure this is what you intend and run on a test directory first. You can tighten the extensions above to just remove xxx_info.xlsx by replacing xxx_info.* with xxx_info.xlsx, etc...

How to make a file executable using Makefile

I want to copy a particular file using Makefile and then make this file executable. How can this be done?
The file I want to copy is a .pl file.
For copying I am using the general cp -rp command. This is done successfully. But now I want to make this file executable using Makefile
Its a bad practice to use cp and chmod, instead use install command.
all:
install -m 0777 hello ../hello
You can use -m option with install to set the permission mode, and even note that by using the install you will preserve not only the permission but also the owner of the file.
You can still use chmod accordingly but it would be a bad practice
all:
cp hello ../hello
chmod +x ../hello
Update: install vs cp
cp would simply copy files with current permissions, install not only copies, but also can change perms/ownership as arg flags. (This is what your requirement was)
One significant difference is that cp truncates the destination file and starts copying data from the source into the destination file. install, on the other hand, removes the destination file first.
This is significant because if the destination file is already in use, bad things could happen to whomever is using that file in case you cp a new file on top of it. e.g. overwriting an executable that is running might fail. Truncating a data file that an existing process is busy reading/writing to could cause pretty weird behavior. If you just remove the destination file first, as install does, things continue much like normal - the removed file isn't actually removed until all processes close that file.[source]
For more details check these,
install vs. cp; and mmap
How is install -c different from cp

using wget to overwrite file but use temporary filename until full file is received, then rename

I'm using wget in a cron job to fetch a .jpg file into a web server folder once per minute (with same filename each time, overwriting). This folder is "live" in that the web server also serves that image from there. However if someone web-browses to that page during the time the image is being fetched, it is considered a jpg with errors and says so in the browser. So what I need to do is, similar to when Firefox is downloading a file, wget should write to a temporary file, either in /var or in the destination folder but with a temporary name, until it has the whole thing, then rename in an atomic (or at least negligible-duration) step.
I've read the wget man page and there doesn't seem to be a command line option for this. Have I missed it? Or do I need to do two commands in my cron job, a wget and a move?
There is no way to do this purely with GNU Wget.
wget's job is to download files and it does that. A simple one line script can achieve what you're looking for:
$ wget -O myfile.jpg.tmp example.com/myfile.jpg && mv myfile.jpg{.tmp,}
Since mv is atomic, atleast on Linux, you get the atomic update of a ready file.
Just wanted to share my solution:
alias wget='func(){ (wget --tries=0 --retry-connrefused --timeout=30 -O download_pkg.tmp "$1" && mv download_pkg.tmp "${1##*/}") || rm download_pkg.tmp; unset -f func; }; func
it creates a function that receives a parameter "url" to download the file to a temporary name. If it is successful, it is renamed to the correct filename extracted from parameter $1 with ${1##*/}. and if it fails, deletes the temp file. If the operation is aborted, the temp file will be replace on the next run. after all, unset -f removes the function definition as the alias is executed.