I'm trying to get some files in a specific website directory using wget:
wget -r http://www.example.com/Item/<folder>/<files>
Knowing that the page http://www.example.com/Item/ causes a 404 page, how can I get the files and folders inside it and neglect other folders and extra files such as the following:
http://www.example.com/b/something/index.html
http://www.example.com/Category/file.html
http://www.example.com/Collection/demo.html
http://www.example.com/Javascript/data.js
http://www.example.com/Stylesheet/type.css
etc.
Related
wget recurses to the second-bottom level and goes no further. If I specify the bottom level HTML file as the source, it parses it and goes further. I think this may be caused by the PDF files linked off the HTML document being in an different root file path on the server. I need it to retrieve all the PDF files off the leaves of this hierarchy since I am going to promote them together as part of a campaign for depression awareness.
I am using GNU Wget 1.19.4 built on linux-gnu.
I have tried, --exclude, --exclude-directory, -l2, -l10, --continue and many other switches. I need to use the --include commands or wget grabs the entire site. If I use -np it won't go "up" into /docs
This code gets me the HTML files but does not follow links in the "bottom most"
HTML files.
wget --mirror --include docs/default-source/research-project-files --include about-us/research-projects/research-projects/ https://www.beyondblue.org.au/about-us/research-projects/research-projects/
This code, when I manually specify the HTML file, gets the PDF files I want in it.
wget --mirror --include docs/default-source/research-project-files --include about-us/research-projects/research-projects https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research
I want it to visit all the HTML files in this branch, get out all the PDF links in them, and retrieve all the PDF files from /docs
https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research
Here is one of the PDFs. The /docs directory does not have a listing.
https://www.beyondblue.org.au/docs/default-source/research-project-files/online-forums-2015-report.pdf?sfvrsn=3d00adea_2
The best I can get wget to do is walk the site and get HTML files down to this level:
https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research
https://www.beyondblue.org.au/about-us/research-projects/research-projects/networks-of-advocacy-and-influence-peer-mentors-in-beyond-blue-s-mental-health-forums
...
150 of them
It seems like a depth-limiting setting or a path traversal limitation or something. I suspect it's an easy one to spot.
Thanks again!
Alright it looks like wget might be breadth first. This means gets everything in the directory before recursing into pages. I'm not sure of this but I let the below run and it seemed to get all the leaf HTML files, but then recurse into them after it had got all of them.
wget -r --verbose --include /docs/default-source/research-project-files/,/about-us/research-projects/research-projects/ https://www.beyondblue.org.au/about-us/research-projects/research-projects/
Certainly running this and stopping it when it seemed to halt at the bottom HTML layer and not get the PDFs was stopping it too early.
I can not download files from a server of my work because the names of the files have reserved characters (error not controlled by the company and by the erroneous named by the clients that uploads attachments) and for some reason the 404 error even though the files exist on the server, by the way I use wget for this task.
This is the executing line that starts the download (list.txt contains url lines from the server to the file in question- example: https://example.com/files/122301/8+.pdf)
wget.exe -x -i "C:\clon\list.txt" -P "C:\clon\destino" -nv -o "C:\clon\log.txt"
I do not know the functionality of the parameters given in wget in addition to the source / destination routes such as the log but some files contain '}' or '+' in their file names and therefore (I think) the missing files are not downloaded ( I have 93% downloaded from all files)
Examples of files including these characters:
/FC04-6198}+.pdf
/8+.pdf
/PT05+2236.pdf
Try placing these parameters "--content-disposition" or "--restrict-file-names" but nothing.
I expect to get a way to ignore the reserved characters to be able to download them.
I am looking to download all quality_variant_[accession_name].txt files from the Salk Arabidopsis 1001 Genomes site using wget in Bash shell.
Main page with list of accessions: http://signal.salk.edu/atg1001/download.php
Each accession links to a page (e.g., http://signal.salk.edu/atg1001/data/Salk/accession.php?id=Aa_0 where Aa_0 is the accession ID) containing three more links: unsequenced_[accession], quality_variant_[accession], and quality_variant_filtered_[accession]
I am only interested in the quality_variant_[accession] link (not quality_variant_filtered_[accession] link), which takes you to to a .txt file with sequence data (e.g., http://signal.salk.edu/atg1001/data/Salk/quality_variant_Aa_0.txt)
Running the command below, the files of interest are eventually outputted (but not downloaded because of the --spider argument), demonstrating that wget can move through the page's hyperlinks to the files I want.
wget --spider --recursive "http://signal.salk.edu/atg1001/download.php
I have not let the command run long enough to determine whether the files of interest are downloaded, but the command below does begin to download the site recursively.
# Arguments in brackets do not impact the performance of the command
wget -r [-e robots=off] [-m] [-np] [-nd] "http://signal.salk.edu/atg1001/download.php"
However, whenever I try to apply filters to pull out the .txt files of interest, whether with --accept-regex, --accept, or many other variants, I cannot get past the initial .php file.
# This and variants thereof do not work
wget -r -A "quality_variant_*.txt" "http://signal.salk.edu/atg1001/download.php"
# Returns:
# Saving to: ‘signal.salk.edu/atg1001/download.php.tmp’
# Removing signal.salk.edu/atg1001/download.php.tmp since it should be rejected.
I could make a list of the accession names and loop through those names modifying the URL in the wget command, but I was hoping for a dynamic one-liner that could extract all files of interest even if accession IDs are added over time.
Thank you!
Note: the data files of interest are contained in the directory http://signal.salk.edu/atg1001/data/Salk/, which is also home to a .php or static HTML page that is displayed when that URL is visited. This URL cannot be used in the wget command because, although the data files of interest are contained here server side, the HTML page contains no reference to these files but rather links to a different set of .txt files that I don't want.
I have an wget command running, but due to the website it regularly downloads "fake data". So, I am now downloading it dozens of times to get the correct content.
I now want the following:
use the wget command
check if in that directory are files <100kB
if yes ->use robocopy to save the >100kB files to another folder; repeat with wget
if no -> stop
I have the wget command and the robocopy command.
I only need now the function to check if an directory has a file <100kB and how to put this together into an cmd/powershell/bat or whatever.
How would the command for that look like? I have especially no idea how to check the files.
I'd like to write a function that, given a URL, returns the name of the file downloaded by wget URL.
I don't understand the behavior of wget very well. If I do wget on python.org, www.python.org, http://www.python.org, or http://www.python.org/, the name of the file downloaded is index.html.
However, if I do www.python.org/about, the name of the file downloaded is about, instead of index.html.
The reason your wget fetches index.html in the first cases is because that's the default "home page" that the server points to. python.org, www.python.org, http://www.phython.org, and http://www.python.org/ aren't files, so the server points wget to index.html. It points your browser there, too, though you don't usually see it. www.python.org/about is a different page, so it makes sense that the file it downloads has a different name.
Might I recommend the man page for wget if you want to know how it works? If it's the name of the downloaded file that concerns you, you have the option to change it via the -O option.