wget to exclude certain naming structures - wget

My company has a local production server I want to download files from that have a certain naming convention. However, I would like to exclude certain elements based on a portion of the name. Example:
folder client_1234
file 1234.jpg
file 1234.ai
file 1234.xml
folder client_1234569
When wget is ran I want it to bypass all folders and files with "1234". I have researched and ran across ‘--exclude list’ but that appears to be only for directories and ‘reject = rejlist’ which appears to be for file extensions. Am I missing something in the manual here
EDIT:
this should work.

wget has options -A <accept_list> and -R <reject_list>, which from the manual page, appear to allow either suffixes or patterns. These are separate from the -I <include_dirs> and -X <exclude_dirs> options, which, as you note, only deal with directories. Given the example you list, something along the lines of -A "folder client_1234*" -A "file 1234.*" might be what you need, although I'm not entirely sure that's exactly the naming convention you're after...

Related

How do I diff only certain files?

I have a list of files (a subset of the files in a directory) and I want to generate a patch that includes only the differences in those files.
From the diff manual, it looks like I can exclude (-x), but would need to specify that for every file that I don't want to include, which seems cumbersome and difficult to script cleanly.
Is there a way to just give diff a list of files? I've already isolated the files with the changes into a separate directory, and I also have a file with the list of filenames, so I can present it to diff in whichever way works best.
What I've tried:
cd filestodiff/
for i in `*`; do diff /fileswithchanges/$i /fileswithoutchanges/$i >> mypatch.diff; done
However patch doesn't see this as valid input because there's no filename header included.
patchutils provides filterdiff that can do this:
diff -ur old/ new/ | filterdiff -I filelist > patchfile
It is packaged for several linux distributions

How to get wget to go down and up host hierarchy

wget recurses to the second-bottom level and goes no further. If I specify the bottom level HTML file as the source, it parses it and goes further. I think this may be caused by the PDF files linked off the HTML document being in an different root file path on the server. I need it to retrieve all the PDF files off the leaves of this hierarchy since I am going to promote them together as part of a campaign for depression awareness.
I am using GNU Wget 1.19.4 built on linux-gnu.
I have tried, --exclude, --exclude-directory, -l2, -l10, --continue and many other switches. I need to use the --include commands or wget grabs the entire site. If I use -np it won't go "up" into /docs
This code gets me the HTML files but does not follow links in the "bottom most"
HTML files.
wget --mirror --include docs/default-source/research-project-files --include about-us/research-projects/research-projects/ https://www.beyondblue.org.au/about-us/research-projects/research-projects/
This code, when I manually specify the HTML file, gets the PDF files I want in it.
wget --mirror --include docs/default-source/research-project-files --include about-us/research-projects/research-projects https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research
I want it to visit all the HTML files in this branch, get out all the PDF links in them, and retrieve all the PDF files from /docs
https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research
Here is one of the PDFs. The /docs directory does not have a listing.
https://www.beyondblue.org.au/docs/default-source/research-project-files/online-forums-2015-report.pdf?sfvrsn=3d00adea_2
The best I can get wget to do is walk the site and get HTML files down to this level:
https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research
https://www.beyondblue.org.au/about-us/research-projects/research-projects/networks-of-advocacy-and-influence-peer-mentors-in-beyond-blue-s-mental-health-forums
...
150 of them
It seems like a depth-limiting setting or a path traversal limitation or something. I suspect it's an easy one to spot.
Thanks again!
Alright it looks like wget might be breadth first. This means gets everything in the directory before recursing into pages. I'm not sure of this but I let the below run and it seemed to get all the leaf HTML files, but then recurse into them after it had got all of them.
wget -r --verbose --include /docs/default-source/research-project-files/,/about-us/research-projects/research-projects/ https://www.beyondblue.org.au/about-us/research-projects/research-projects/
Certainly running this and stopping it when it seemed to halt at the bottom HTML layer and not get the PDFs was stopping it too early.

Using wget to recursively fetch .txt files in .php file, but filters break the command

I am looking to download all quality_variant_[accession_name].txt files from the Salk Arabidopsis 1001 Genomes site using wget in Bash shell.
Main page with list of accessions: http://signal.salk.edu/atg1001/download.php
Each accession links to a page (e.g., http://signal.salk.edu/atg1001/data/Salk/accession.php?id=Aa_0 where Aa_0 is the accession ID) containing three more links: unsequenced_[accession], quality_variant_[accession], and quality_variant_filtered_[accession]
I am only interested in the quality_variant_[accession] link (not quality_variant_filtered_[accession] link), which takes you to to a .txt file with sequence data (e.g., http://signal.salk.edu/atg1001/data/Salk/quality_variant_Aa_0.txt)
Running the command below, the files of interest are eventually outputted (but not downloaded because of the --spider argument), demonstrating that wget can move through the page's hyperlinks to the files I want.
wget --spider --recursive "http://signal.salk.edu/atg1001/download.php
I have not let the command run long enough to determine whether the files of interest are downloaded, but the command below does begin to download the site recursively.
# Arguments in brackets do not impact the performance of the command
wget -r [-e robots=off] [-m] [-np] [-nd] "http://signal.salk.edu/atg1001/download.php"
However, whenever I try to apply filters to pull out the .txt files of interest, whether with --accept-regex, --accept, or many other variants, I cannot get past the initial .php file.
# This and variants thereof do not work
wget -r -A "quality_variant_*.txt" "http://signal.salk.edu/atg1001/download.php"
# Returns:
# Saving to: ‘signal.salk.edu/atg1001/download.php.tmp’
# Removing signal.salk.edu/atg1001/download.php.tmp since it should be rejected.
I could make a list of the accession names and loop through those names modifying the URL in the wget command, but I was hoping for a dynamic one-liner that could extract all files of interest even if accession IDs are added over time.
Thank you!
Note: the data files of interest are contained in the directory http://signal.salk.edu/atg1001/data/Salk/, which is also home to a .php or static HTML page that is displayed when that URL is visited. This URL cannot be used in the wget command because, although the data files of interest are contained here server side, the HTML page contains no reference to these files but rather links to a different set of .txt files that I don't want.

What am I screwing up trying to download particular file types with wget?

I am attempting to regularly archive a few file types hosted on a community website where our admin has been MIA for years, in case he dies or just stops paying for the hosting.
I am able to download all of the files I need using wget -r -np -nd -e robots=off -l 0 URL but this leaves me with about 60,000 extra files to waste time both downloading and deleting.
I am really only looking for files with the extensions "tbt" and "zip". When I add in -A tbt,zip to the input, wget then only downloads a single file, "index.html.tmp". It immediately deletes this file because it doesn't match the file type specified, and then the process stops entirely, with wget announcing that it is finished. It does not attempt to download any of the other files that it grabs when the -A flag is not included.
What am I doing wrong? Why does specifying file types in the way that I did cause it to finish after only looking at one file?
Possibly you're hitting the same problem I've hit when trying to do something similar. When using --accept, wget determines whether a links refers to a file or directory based on whether or not it ends with a /.
For example, say I have a directory named files, and a web page that has:
Lots o' files!
If I were to request this with wget -r, then I wget would happily GET /files, see that it was an HTML document containing a bunch of links, and continue to download those links.
However, if I add -A zip to my command line, and run wget with --debug, I see:
appending ‘http://localhost:8080/files’ to urlpos.
[...]
Deciding whether to enqueue "http://localhost:8080/files".
http://localhost:8080/files (files) does not match acc/rej rules.
Decided NOT to load it.
In other words, wget thinks this is a file (no trailing /) and it doesn't match our acceptance criteria, so it gets rejected.
If I modify the remote file so that it looks like...
Lots o' files!
...then wget will follow the link and download files as desired.
I don't think there's a great solution to this problem if you need to use wget. As I mentioned in my comment, there are other tools available that may handle this situation more gracefully.
It's also possible you're experiencing a different issue; the output of adding --debug to your command line clarify things in that case.
I also experienced this issue, on a page where all the download links looked something like this: filedownload.ashx?name=file.mp3. The solution was to match for both the linked file, and the downloaded file. So my wget accept flag looked like this: -A 'ashx,mp3'. I also used the --trust-server-names flag. This catches all the .ashx that are linked in the webpage, then when wget does the second check, all the mp3 files that were downloaded will stay.
As an alternative to --trust-server-names, you may also find the --content-disposition flag helpful. Both flags help rename the file that gets downloaded from filedownload.ashx?name=file.mp3 to just file.mp3.

Can we wget with file list and renaming destination files?

I have this wget command:
sudo wget --user-agent='some-agent' --referer=http://some-referrer.html -N -r -nH --cut-dirs=x --timeout=xxx --directory-prefix=/directory/for/downloaded/files -i list-of-files-to-download.txt
-N will check if there is actually a newer file to download.
-r will turn the recursive retrieving on.
-nH will disable the generation of host-prefixed directories.
--cut-dirs=X will avoid the generation of the host's subdirectories.
--timeout=xxx will, well, timeout :)
--directory-prefix will store files in the desired directorty.
This works nice, no problem.
Now, to the issue:
Let's say my files-to-download.txt has these kind of files:
http://website/directory1/picture-same-name.jpg
http://website/directory2/picture-same-name.jpg
http://website/directory3/picture-same-name.jpg
etc...
You can see the problem: on the second download, wget will see we already have a picture-same-name.jpg, so it won't download the second or any of the following ones with the same name. I cannot mirror the directory structure because I need all the downloaded files to be in the same directory. I can't use the -O option because it clashes with --N, and I need that. I've tried to use -nd, but doesn't seem to work for me.
So, ideally, I need to be able to:
a.- wget from a list of url's the way I do now, keeping my parameters.
b.- get all files at the same directory and being able to rename each file.
Does anybody have any solution to this?
Thanks in advance.
I would suggest 2 approaches -
Use the "-nc" or the "--no-clobber" option. From the man page -
-nc
--no-clobber
If a file is downloaded more than once in the same directory, >Wget's behavior depends on a few options, including -nc. In certain >cases, the local file will be
clobbered, or overwritten, upon repeated download. In other >cases it will be preserved.
When running Wget without -N, -nc, -r, or -p, downloading the >same file in the same directory will result in the original copy of file >being preserved and the second copy
being named file.1. If that file is downloaded yet again, the >third copy will be named file.2, and so on. (This is also the behavior >with -nd, even if -r or -p are in
effect.) When -nc is specified, this behavior is suppressed, >and Wget will refuse to download newer copies of file. Therefore, ""no->clobber"" is actually a misnomer in
this mode---it's not clobbering that's prevented (as the >numeric suffixes were already preventing clobbering), but rather the >multiple version saving that's prevented.
When running Wget with -r or -p, but without -N, -nd, or -nc, >re-downloading a file will result in the new copy simply overwriting the >old. Adding -nc will prevent this
behavior, instead causing the original version to be preserved >and any newer copies on the server to be ignored.
When running Wget with -N, with or without -r or -p, the >decision as to whether or not to download a newer copy of a file depends >on the local and remote timestamp and
size of the file. -nc may not be specified at the same time as >-N.
A combination with -O/--output-document is only accepted if the >given output file does not exist.
Note that when -nc is specified, files with the suffixes .html >or .htm will be loaded from the local disk and parsed as if they had been >retrieved from the Web.
As you can see from this man page entry, the behavior might be unpredictable/unexpected. You will need to see if it works for you.
Another approach would be to use a bash script. I am most comfortable using bash on *nix, so forgive the platform dependency. However the logic is sound, and with a bit of modifications, you can get it to work on other platforms/scripts as well.
Sample pseudocode bash script -
for i in `cat list-of-files-to-download.txt`;
do
wget <all your flags except the -i flag> $i -O /path/to/custom/directory/filename ;
done ;
You can modify the script to download each file to a temporary file, parse $i to get the filename from the URL, check if the file exists on the disk, and then take a decision to rename the temp file to the name that you want.
This offers much more control over your downloads.