Download all pdf files using wget - wget

I have the following site http://www.asd.com.tr. I want to download all PDF files into one directory. I've tried a couple of commands but am not having much luck.
$ wget --random-wait -r -l inf -nd -A pdf http://www.asd.com.tr/
With this code only four PDF files were downloaded. Check this link, there are over several thousand PDFs available:
http://www.asd.com.tr/Default.aspx
For instance, hundreds of files are in the following folder:
http://www.asd.com.tr/Folders/asd/…
But I can't figure out how to access them correctly to see and download them all, there are some of folders in this subdirectory, http://www.asd.com.tr/Folders/, and thousands of PDFs in these folders.
I've tried to mirror site using -m command but it failed too.
Any more suggestions?

First, verify that the TOS of the web site permit to crawl it. Then, one solution is :
mech-dump --links 'http://domain.com' |
grep pdf$ |
sed 's/\s+/%20/g' |
xargs -I% wget http://domain.com/%
The mech-dump command comes with Perl's module WWW::Mechanize (libwww-mechanize-perl package on debian & debian likes distros)

Related

Recursive file download with wget not working

From what I can tell by studying the wget manual, the following should work:
wget -r -l1 -np -nd -A.* -R "index.html*" http://s3.amazonaws.com/gp.gms/blat/
However, instead of getting all of the files in the blat folder without the apparently auto-generated index.html file, I get a 404 not found error on this and several dozen variations that I've tried.
I can easily download any of the 4 files but trying to do it recursively fails.
Any pointers would be greatly appreciated.
Try replacing -r -l1 with -r -l 1. You need a space between the l and the 1. Also, try adding -k with your options. This will convert the links to point to the corresponding files on your computer.

How to force wget to overwrite an existing file ignoring timestamp?

I tried '-N' and '--no-clobber' but the only result that I get is to retrieve a new copy of the existing example.exe with number a number added using this synax 'example.exe.1'. This is not what I'd like to get. I just need to download and overwrite the file example.exe in the same folder where I already saved a copy of example.com without that wget verifies if the mine is older or newer respect the on example.exe file already present in my download folder. Do you think is i possible or I need to create a script that delete the example.exe file or maybe something that change his modification date etc?
If you specify the output file using the -O option it will overwrite any existing file.
For example:
wget -O index.html bbc.co.uk
Run multiple times will keep over-writting index.html.
wget doesn't let you overwrite an existing file unless you explicitly name the output file on the command line with option -O.
I'm a bit lazy and I don't want to type the output file name on the command line when it is already known from the downloaded file. Therefore, I use curl like this:
curl -O http://ftp.vim.org/vim/runtime/spell/fr.utf-8.spl
Be careful when downloading files like this from unsafe sites. The above command will write a file named as the connected web site wishes to name it (inside the current directory though). The final name may be hidden through redirections and php scripts or be obfuscated in the URL. You might end up overwriting a file you don't want to overwrite.
And if you ever find a file named ls or any other enticing name in the current directory after using curl that way, refrain from executing the downloaded file. It may be a trojan downloaded from a rogue or corrupted web site!
wget --backups=1 google.com
renames original file with .1 suffix and writes new file to the intended filename.
Not exactly what was requested, but could be handy in some cases.
-c or --continue
From the manual:
If you use ‘-c’ on a non-empty file, and the server does not support
continued downloading, Wget will restart the download from scratch and
overwrite the existing file entirely.
I like the -c option. I started with the man page then the web but I've searched for this several times. Like if you're relaying a webcam so the image needs to always be named image.jpg. Seems like it should be more clear in the man page.
I've been using this for a couple years to download things in the background, sometimes combined with "limit-rate = " in my wgetrc file
while true
do
wget -c -i url.txt && break
echo "Restarting wget"
sleep 2
done
Make a little file called url.txt and paste the file's URL into it. Set this script up in your path or maybe as an alias and run it. It keeps retrying the download until there's no error. Sometimes at the end it gets into a loop displaying
416 Requested Range Not Satisfiable
The file is already fully retrieved; nothing to do.
but that's harmless, just ctrl-c it. I think it's always gotten the file I wanted even if wget runs out of retries or the connection temporarily goes away. I've downloaded things for days at a time with it. A CD image on dialup, yes, always with wget.
My use case involves two different URLs, sometimes the second one doesn't exist, but if it DOES exist, I want it to overwrite the first file.
The problem of using wget -O is that, when the second file DOESN'T exist, it will overwrite the first file with a BLANK file.
So the only way I could find is with an if statement:
--spider checks if a file exists, and returns 0 if it does
--quiet fail quietly, with no output
-nv is quiet, but still reports errors
wget -nv https://example.com/files/file01.png -O file01.png
# quietly check if a different version exists
wget --quiet --spider https://example.com/custom-files/file01.png
if [ $? -eq 0 ] ; then
# A different version exists, so download and overwrite the first
wget -nv https://example.com/custom-files/file01.png -O file01.png
fi
It's verbose, but I found it necessary. I hope this is helpful for someone.
Here is an easy way to get it done with parameter trimming
url=https://example.com/example.exe ; wget -nv $url -O ${url##*/}
Or you can use basename
url=https://example.com/example.exe ; wget -nv $url -O $( basename $url )
For those who do not want to use -O and want to specify the output directory only, the following command can be used.
wget \
--directory-prefix "$dest" \
--backups 0 \
-- "$link"
the first command will download from the source with the wget command
the second command will remove the older file
wget \
--directory-prefix "$dest" \
--backups 0 \
-- "$link"; \
rm '$file.1' -f;

Using wget (for windows) to download all MIDI files

I've been trying to use wget to download all midi files from a website (http://cyberhymnal.org/) using:
wget64 -r -l1 H -t1 -nd -N -np -A.mid -erobots=off http://cyberhymnal.org/
I got the syntax from various sites which all suggest the same thing, but it doesn't download anything. I've tried various variations on the theme, such as different values for '-l' etc.
Does anybody have any suggestions as to what I am doing wrong? Is it the fact that I am using Windows?
Thanks in advance.
I don't know much about all the parameters you are using like H, -t1, -N etc though we can find it online. But I also had to download files from a url matching a wildcard. So command that worked for me:
wget -r -l1 -nH --cut-dirs=100 -np "$url" -P "${newLocalLib/$tokenFind}" -A "com.iontrading.arcreporting.*.jar"
after -P you specify the path where you wanna save the files to and after -A you provide the wild card token. Like in your case that would be "*.mid".
-A means Accept. So here we provide the files to accept from the provided URL. Similarly -R for reject list.
You may have better luck (at least, you'll get more MIDI files), if you try the actual Cyber Hymnal™, which moved over 10 years ago. The current URL is now http://www.hymntime.com/tch/.

recursive wget with hotlinked requisites

I often use wget to mirror very large websites. Sites that contain hotlinked content (be it images, video, css, js) pose a problem, as I seem unable to specify that I would like wget to grab page requisites that are on other hosts, without having the crawl also follow hyperlinks to other hosts.
For example, let's look at this page
https://dl.dropbox.com/u/11471672/wget-all-the-things.html
Let's pretend that this is a large site that I would like to completely mirror, including all page requisites – including those that are hotlinked.
wget -e robots=off -r -l inf -pk
^^ gets everything but the hotlinked image
wget -e robots=off -r -l inf -pk -H
^^ gets everything, including hotlinked image, but goes wildly out of control, proceeding to download the entire web
wget -e robots=off -r -l inf -pk -H --ignore-tags=a
^^ gets the first page, including both hotlinked and local image, does not follow the hyperlink to the site outside of scope, but obviously also does not follow the hyperlink to the next page of the site.
I know that there are various other tools and methods of accomplishing this (HTTrack and Heritrix allow for the user to make a distinction between hotlinked content on other hosts vs hyperlinks to other hosts) but I'd like to see if this is possible with wget. Ideally this would not be done in post-processing, as I would like the external content, requests, and headers to be included in the WARC file I'm outputting.
You can't specify to span hosts for page-reqs only; -H is all or nothing. Since -r and -H will pull down the entire Internet, you'll want to split the crawls that use them. To grab hotlinked page-reqs, you'll have to run wget twice: once to recurse through the site's structure, and once to grab hotlinked reqs. I've had luck with this method:
1) wget -r -l inf [other non-H non-p switches] http://www.example.com
2) build a list of all HTML files in the site structure (find . | grep html) and pipe to file
3) wget -pH [other non-r switches] -i [infile]
Step 1 builds the site's structure on your local machine, and gives you any HTML pages in it. Step 2 gives you a list of the pages, and step 3 wgets all assets used on those pages. This will build a complete mirror on your local machine, so long as the hotlinked assets are still live.
I've managed to do this by using regular expressions. Something like this to mirror http://www.example.com/docs
wget --mirror --convert-links --adjust-extension \
--page-requisites --span-hosts \
--accept-regex '^http://www\.example\.com/docs|\.(js|css|png|jpeg|jpg|svg)$' \
http://www.example.com/docs
You'll probably have to tune the regexs for each specific site. For example some sites like to use parameters on css files (e.g. style.css?key=value), which this example will exclude.
The files you want to include from other hosts will probably include at least
Images: png jpg jpeg gif
Fonts: ttf otf woff woff2 eot
Others: js css svg
Anybody know any others?
So the actual regex you want will probably look more like this (as one string with no linebreaks):
^http://www\.example\.org/docs|\.([Jj][Ss]|[Cc][Ss][Ss]|[Pp][Nn][Gg]|[Jj]
[Pp][Ee]?[Gg]|[Ss][Vv][Gg]|[Gg][Ii][Ff]|[Tt][Tt][Ff]|[Oo][Tt][Ff]|[Ww]
[Oo][Ff][Ff]2?|[Ee][Oo][Tt])(\?.*)?$

How to download all files from a specific Sourceforge project?

After spending about an hour downloading almost every Msys package from sourceforge I'm wondering whether there is a more clever way to do this. Is it possible to use wget for this purpose?
I've used this script successfully:
https://github.com/SpiritQuaddicted/sourceforge-file-download
For your use run:
sourceforge-file-downloader.sh msys
It should download all the pages first then find the actual links in the pages and download the final files.
From the project description:
Allows you to download all of a sourceforge project's files. Downloads to the current directory into a directory named like the project. Pass the project's name as first argument, eg ./sourceforge-file-download.sh inkscape to download all of http://sourceforge.net/projects/inkscape/files/
Just in case the repo ever gets removed I'll post it here since it's short enough:
#!/bin/sh
project=$1
echo "Downloading $project's files"
# download all the pages on which direct download links are
# be nice, sleep a second
wget -w 1 -np -m -A download http://sourceforge.net/projects/$project/files/
# extract those links
grep -Rh direct-download sourceforge.net/ | grep -Eo '".*" ' | sed 's/"//g' > urllist
# remove temporary files, unless you want to keep them for some reason
rm -r sourceforge.net/
# download each of the extracted URLs, put into $projectname/
while read url; do wget --content-disposition -x -nH --cut-dirs=1 "${url}"; done < urllist
rm urllist
In case of no wget or shell install do it with FileZilla: sftp://yourname#web.sourceforge.net you open the connection with sftp and your password then you browse to the /home/pfs/
after that path (could be a ? mark sign) you fill in with your folder path you want to download in remote site, in my case
/home/pfs/project/maxbox/Examples/
this is the access path of the frs: File Release System: /home/frs/project/PROJECTNAME/