How to make correct wget request? - wget

I need to copy xml file from server to my folder and name it as daily.xml. Here is my code.
The problem is that every new file has name daily.xml.1, daily.xml.2 etc
How to name new file as daily.xml, and previous file as previous-daily.xml? As I know I need to use -O but I don't understand how to use it
wget -P /home/name/name2/docs/xml/ http://www.domain.com/XML/daily.xml
How to make correct request?

What about
wget -P /home/name/name2/docs/xml/ http://www.domain.com/XML/daily.xml -O daily$(date +'%Y%m%d%H%M%S').xml
Maybe the resolution by seconds is not fine enough and you need to have a count variable.
This dose not, however, rename your previous files.
In case your only original problem was the your system does not recognize *.xml.7 as xml-file, the command above should fix this.
Edit: as for your comment, you could do
mv daily.xml previous-daily.xml;wget -P /home/name/name2/docs/xml/ http://www.domain.com/XML/daily.xml -O daily.xml

Related

How do I wget a page from archive.org without the directory?

I'm trying to download a webpage from archive.org (ie http://wayback.archive.org/web/20110410223952id_/http://www.goldalert.com/gold-price-hovers-at-1460-as-ecb-hikes-rates-2/ ) with wget. I want to download it in /00001/index.html. How would I go about doing this?
I tried wget -p -k http://wayback.archive.org/web/20110410223952id_/http://www.goldalert.com/gold-price-hovers-at-1460-as-ecb-hikes-rates-2/ -O 00001/index.html but that didn't work. I than cd into the directory and removed the 00001 from the O flag. It didn't work either. I than just removed the -O flag. It worked but I get the whole archive.org directory (ie wayback.archive.org new directory web new diretory etc...) and the filename's not changed :(
What do I do?
Sorry for the obviously noob question.
wget http://wayback.archive.org/web/20110410223952id_/http://www.goldalert.com/gold-price-hovers-at-1460-as-ecb-hikes-rates-2/ -O 00001/index.html
Solve my own question. So simple.

using wget to get selected subdirectories

I want to set up a cron job to download data from a server (http). Each directory is date/time-stamped in the format YYYYMMDDHH, where there are 2 versions daily, so HH is either 00 or 12. I only want a few of the subdirectories in each of those. eg the directory structure is website/2013121800/subdir/moresubdirs/file.
I tried using wget -A "*/subdir/*" but it started getting everything else. Is there a way to use wget to get only the desired subdirectories without explicitly setting the date/time?
Thanks.
wget -X '/*/subdir'
should do the trick for you ;)
You need to use Directory-Based Limits instead of Types of Files contraints.
Use:
wget -X '/*/subdir'

Can't resume "wget --mirror" with --no-clobber (-c -F -B unhelpful)

I started a wget mirror with "wget --mirror [sitename]", and it was
working fine, but accidentally interrupted the process.
I now want to resume the mirror with the following caveats:
If wget has already downloaded a file, I don't want it downloaded
it again. I don't even want wget to check the timestamp: I know the
version I have is "recent enough".
I do want wget to read the files it's already downloaded and
follow links inside those files.
I can use "-nc" for the first point above, but I can't seem to coerce
wget to read through files it's already downloaded.
Things I've tried:
The obvious "wget -c -m" doesn't work, because it wants
to compare timestamps, which requires making at least a HEAD request
to the remote server.
"wget -nc -m" doesn't work, since -m implies -N, and -nc is
incompatible with -N.
"wget -F -nc -r -l inf" is the best I could come up with, but it
still fails. I was hoping "-F" would coerce wget into reading local,
already-downloaded files as HTML, and thus follow links, but this
doesn't appear to happen.
I tried a few other options (like "-c" and "-B [sitename]"), but
nothing works.
How do I get wget to resume this mirror?
Apparently this works:
Solved: Wget error “Can’t timestamp and not clobber old files at the
same time.” Posted on February 4, 2012 While trying to resume a
site-mirror operation I was running through Wget, I ran into the error
“Can’t timestamp and not clobber old files at the same time”. It turns
out that running Wget with the -N and -nc flags set at the same time
can’t happen, so if you want to resume a recursive download with
noclobber you have to disable -N. The -m attribute (for mirroring)
intrinsically sets the -N attribute, so you’ll have to switch from -m
to -r in order to use noclobber as well.
From: http://www.marathon-studios.com/blog/solved-wget-error-cant-timestamp-and-not-clobber-old-files-at-the-same-time/
-m, according to the wget manual is equivalent to this longer series of settings: -r -N -l inf --no-remove-listing. Just use those settings instead of -m, and without -N (timestamping).
Now I'm not sure if there is a way to get wget to download urls from existing html files. There probably is a solution, I know it can take html files as inputs and scrape all the links in them. Perhaps you could use a bash command to concatenate all the html files together into one big file.
I solved this problem by just deleting all the html files, because I didn't mind only redownloading them. But this might not work for everyone's use case.

updating data from different URL using wget

What's the best way of updating data files from a website that has moved on to a new domain, with changes in their folder structure.
The old URL for example is http://folder.old-domain.com while the new URL is http://new-domain.com/directory1/directory2. My data is stored locally in ~/Data_Backup/folder.old-domain.com folder.
Data was originally downloaded using:
$ wget -S -t 0 -c --mirror –w 2 –k http://folder.old-domain.com
I was thinking of using mv to rename the old folder to follow the new URL pattern, but is there a better way of doing this?
Will this work? I'm not particular with the directory structure. What's important is to update the contents of the target folder (and its sub-folders.)
$ wget -S -t 0 -c -m –w 2 –k -N -np -P ~/Data_Backup/folder.old-domain.com http://new-domain.com/directory/directory
Thanks in advance.
Got it!
I need to add the following options:
-nH --cut-dirs=2
and now it works.

How to specify the download location with wget?

I need files to be downloaded to /tmp/cron_test/. My wget code is
wget --random-wait -r -p -nd -e robots=off -A".pdf" -U mozilla http://math.stanford.edu/undergrad/
So is there some parameter to specify the directory?
From the manual page:
-P prefix
--directory-prefix=prefix
Set directory prefix to prefix. The directory prefix is the
directory where all other files and sub-directories will be
saved to, i.e. the top of the retrieval tree. The default
is . (the current directory).
So you need to add -P /tmp/cron_test/ (short form) or --directory-prefix=/tmp/cron_test/ (long form) to your command. Also note that if the directory does not exist it will get created.
-O is the option to specify the path of the file you want to download to:
wget <uri> -O /path/to/file.ext
-P is prefix where it will download the file in the directory:
wget <uri> -P /path/to/folder
Make sure you have the URL correct for whatever you are downloading. First of all, URLs with characters like ? and such cannot be parsed and resolved. This will confuse the cmd line and accept any characters that aren't resolved into the source URL name as the file name you are downloading into.
For example:
wget "sourceforge.net/projects/ebosse/files/latest/download?source=typ_redirect"
will download into a file named, ?source=typ_redirect.
As you can see, knowing a thing or two about URLs helps to understand wget.
I am booting from a hirens disk and only had Linux 2.6.1 as a resource (import os is unavailable). The correct syntax that solved my problem downloading an ISO onto the physical hard drive was:
wget "(source url)" -O (directory where HD was mounted)/isofile.iso"
One could figure the correct URL by finding at what point wget downloads into a file named index.html (the default file), and has the correct size/other attributes of the file you need shown by the following command:
wget "(source url)"
Once that URL and source file is correct and it is downloading into index.html, you can stop the download (ctrl + z) and change the output file by using:
-O "<specified download directory>/filename.extension"
after the source url.
In my case this results in downloading an ISO and storing it as a binary file under isofile.iso, which hopefully mounts.
"-P" is the right option, please read on for more related information:
wget -nd -np -P /dest/dir --recursive http://url/dir1/dir2
Relevant snippets from man pages for convenience:
-P prefix
--directory-prefix=prefix
Set directory prefix to prefix. The directory prefix is the directory where all other files and subdirectories will be saved to, i.e. the top of the retrieval tree. The default is . (the current directory).
-nd
--no-directories
Do not create a hierarchy of directories when retrieving recursively. With this option turned on, all files will get saved to the current directory, without clobbering (if a name shows up more than once, the
filenames will get extensions .n).
-np
--no-parent
Do not ever ascend to the parent directory when retrieving recursively. This is a useful option, since it guarantees that only the files below a certain hierarchy will be downloaded.
man wget:
-O file
--output-document=file
wget "url" -O /tmp/cron_test/<file>