Wget missing URL and - wget

I'm new to Wget. Following online examples, I am trying to log in to a simple page using the following command:
wget --post-data='entry=85482564&submit3=LOGIN' \ --save-cookies=my-cookies.txt --keep-session-cookies \ https://www.abczyx.com
I get the following error:
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
wget: missing URL
Usage: wget [OPTION]... [URL]...
Try `wget --help' for more options.
'submit3' is not recognized as an internal or external command, operable program or batch file.
I'm guessing that it doesn't quite recognize the &, but I am not sure how to fix it. I'm running Windows 7 cmd line. A side question, why use "\"? I see some examples with it, and some without it. I get issues with it.

After doing some reading, I found that because it is MS DOS, they do not interpret the special characters correctly. Adding quotes around it ("&") did the trick.

In Windows the escape sign is the caret, ^, not backslash, \. So in the batch file it should look like 'entry=85482564^&submit3=LOGIN'.

For me what worked was change & to %26
as in
--post-data 'login=foo%26pass=bar'
also if you are posting an email addrress be sure to change the # to %40
Other codes:
https://en.wikipedia.org/wiki/Percent-encoding

Yes, there is a mistake(I'd say a very serious mistake) in wget's manual. In the manual it says:
Log in to the server.
This can be done only once. wget --save-cookies cookies.txt
--post-data 'user=foo&password=bar'
http://example.com/auth.php
So you do something like
wget --save-cookies cookies.txt \
--post-data 'user=yourUser12%23125&password=yourPassword12%241' \
http://www.websitelink.com/
Which ovbiously doesn't work for multiple reasons. First, you have to remove the \ symbols because they get in the way, second, you have to remove line breaks themselves because when you paste it in your command line tool, it will execute them just as if you pressed enter after each of the lines, which will result in trying to execute that command as 3 separate commands:
First:
wget --save-cookies cookies.txt \
Second:
--post-data 'user=yourUser12%23125&password=yourPassword12%241' \
Third:
http://www.websitelink.com/
Ok, so you remove the slashes and then realize that you have to also remove line breaks by yourself, but it still doesn't work. At this point it's pepehands in the air. So what do you do now? Somehow you have to automagically realize that the & symbol should be also percent-encoded. So you turn
Log in to the server.
This can be done only once. wget --save-cookies cookies.txt
--post-data 'user=foo&password=bar'
http://example.com/auth.php
To this:
wget --save-cookies cookies.txt --post-data 'user=yourUser12%23125%26password=yourPassword12%241' http://www.websitelink.com/
And it starts working!

Related

Recursive file download with wget not working

From what I can tell by studying the wget manual, the following should work:
wget -r -l1 -np -nd -A.* -R "index.html*" http://s3.amazonaws.com/gp.gms/blat/
However, instead of getting all of the files in the blat folder without the apparently auto-generated index.html file, I get a 404 not found error on this and several dozen variations that I've tried.
I can easily download any of the 4 files but trying to do it recursively fails.
Any pointers would be greatly appreciated.
Try replacing -r -l1 with -r -l 1. You need a space between the l and the 1. Also, try adding -k with your options. This will convert the links to point to the corresponding files on your computer.

wget - only output redirect url but no download

I have a download link to a large file.
You need to be logged in to the site, so a cookie is used.
The download link redirects to another URL.
I'm able to download the file with wget but I only want the output of the "real" direct download link.
wget does exactly this before starting the download
Location: https://foo.com/bar.zip [following]
Is there a way to make wget stop and not actually downloading the file?
The solutions I found recommend redirecting to dev/null but this would still download the file. What I want is wget following the redirects but not actually starting the download.
I couldn't find a way to do it with wget, but I found a way to do it with curl:
curl https://openlibrary.org/data/ol_dump_latest.txt.gz -s -L -I -o /dev/null -w '%{url_effective}'
This only downloads the HEAD of the page (and sends it to /dev/null), so the file itself is never downloaded.
(src: https://stackoverflow.com/a/5300429/2317712 )
Going off of #qqilihq's comment to the curl answer, this will first strip out the line starting with "Location:" then remove the "Location: " from the beginning and the " [following]" from the end using awk. Not sure if I would use this as it looks like a small change in the wget output could make it blow up. I would use the curl answer myself.
wget --max-redirect=0 http://example.com/link-to-get-redirec-url-from 2>&1 | awk '/Location: /,// { print }' | awk '{print $2}'

How to force wget to overwrite an existing file ignoring timestamp?

I tried '-N' and '--no-clobber' but the only result that I get is to retrieve a new copy of the existing example.exe with number a number added using this synax 'example.exe.1'. This is not what I'd like to get. I just need to download and overwrite the file example.exe in the same folder where I already saved a copy of example.com without that wget verifies if the mine is older or newer respect the on example.exe file already present in my download folder. Do you think is i possible or I need to create a script that delete the example.exe file or maybe something that change his modification date etc?
If you specify the output file using the -O option it will overwrite any existing file.
For example:
wget -O index.html bbc.co.uk
Run multiple times will keep over-writting index.html.
wget doesn't let you overwrite an existing file unless you explicitly name the output file on the command line with option -O.
I'm a bit lazy and I don't want to type the output file name on the command line when it is already known from the downloaded file. Therefore, I use curl like this:
curl -O http://ftp.vim.org/vim/runtime/spell/fr.utf-8.spl
Be careful when downloading files like this from unsafe sites. The above command will write a file named as the connected web site wishes to name it (inside the current directory though). The final name may be hidden through redirections and php scripts or be obfuscated in the URL. You might end up overwriting a file you don't want to overwrite.
And if you ever find a file named ls or any other enticing name in the current directory after using curl that way, refrain from executing the downloaded file. It may be a trojan downloaded from a rogue or corrupted web site!
wget --backups=1 google.com
renames original file with .1 suffix and writes new file to the intended filename.
Not exactly what was requested, but could be handy in some cases.
-c or --continue
From the manual:
If you use ‘-c’ on a non-empty file, and the server does not support
continued downloading, Wget will restart the download from scratch and
overwrite the existing file entirely.
I like the -c option. I started with the man page then the web but I've searched for this several times. Like if you're relaying a webcam so the image needs to always be named image.jpg. Seems like it should be more clear in the man page.
I've been using this for a couple years to download things in the background, sometimes combined with "limit-rate = " in my wgetrc file
while true
do
wget -c -i url.txt && break
echo "Restarting wget"
sleep 2
done
Make a little file called url.txt and paste the file's URL into it. Set this script up in your path or maybe as an alias and run it. It keeps retrying the download until there's no error. Sometimes at the end it gets into a loop displaying
416 Requested Range Not Satisfiable
The file is already fully retrieved; nothing to do.
but that's harmless, just ctrl-c it. I think it's always gotten the file I wanted even if wget runs out of retries or the connection temporarily goes away. I've downloaded things for days at a time with it. A CD image on dialup, yes, always with wget.
My use case involves two different URLs, sometimes the second one doesn't exist, but if it DOES exist, I want it to overwrite the first file.
The problem of using wget -O is that, when the second file DOESN'T exist, it will overwrite the first file with a BLANK file.
So the only way I could find is with an if statement:
--spider checks if a file exists, and returns 0 if it does
--quiet fail quietly, with no output
-nv is quiet, but still reports errors
wget -nv https://example.com/files/file01.png -O file01.png
# quietly check if a different version exists
wget --quiet --spider https://example.com/custom-files/file01.png
if [ $? -eq 0 ] ; then
# A different version exists, so download and overwrite the first
wget -nv https://example.com/custom-files/file01.png -O file01.png
fi
It's verbose, but I found it necessary. I hope this is helpful for someone.
Here is an easy way to get it done with parameter trimming
url=https://example.com/example.exe ; wget -nv $url -O ${url##*/}
Or you can use basename
url=https://example.com/example.exe ; wget -nv $url -O $( basename $url )
For those who do not want to use -O and want to specify the output directory only, the following command can be used.
wget \
--directory-prefix "$dest" \
--backups 0 \
-- "$link"
the first command will download from the source with the wget command
the second command will remove the older file
wget \
--directory-prefix "$dest" \
--backups 0 \
-- "$link"; \
rm '$file.1' -f;

how to use -o flag in wget with -i?

I understand that -i flag takes a file (which may contain list of URLs) and I know that -o followed by a name can be specified to rename a item being downloaded using wget.
example:
wget -i list_of_urls.txt
wget -o my_custom_name.mp3 http://example.com/some_file.mp3
I have a file that looks like this:
file name: list_of_urls.txt
http://example.com/some_file.mp3
http://example.com/another_file.mp3
http://example.com/yet_another_file.mp3
I want to use wget to download these files with the -i flag but also save each file as 1.mp3, 2.mp3 and so on.
Can this be done?
You can use any script language (PHP or Python) for generate batch file. In thin batch file each line will contains run wget with url and -O options.
Or you can try write cycle in bash script.
I ran a web search again and found https://superuser.com/questions/336669/downloading-multiple-files-and-specifying-output-filenames-with-wget
Wget can't seem to do it but Curl can with -K flag, the file supplied can contain url and output name. See http://curl.haxx.se/docs/manpage.html#-K
If you are willing to use some shell scripting then https://unix.stackexchange.com/questions/61132/how-do-i-use-wget-with-a-list-of-urls-and-their-corresponding-output-files has the answer.

How to use Rsync to copy only specific subdirectories (same names in several directories)

I have such directories structure on server 1:
data
company1
unique_folder1
other_folder
...
company2
unique_folder1
...
...
And I want duplicate this folder structure on server 2, but copy only directories/subdirectories of unique_folder1. I.e. as result must be:
data
company1
unique_folder1
company2
unique_folder1
...
I know that rsync is very good for this.
I've tried 'include/exclude' options without success.
E.g. I've tried:
rsync -avzn --list-only --include '*/unique_folder1/**' --exclude '*' -e ssh user#server.com:/path/to/old/data/ /path/to/new/data/
But, as result, I don't see any files/directories:
receiving file list ... done
sent 43 bytes received 21 bytes 42.67 bytes/sec
total size is 0 speedup is 0.00 (DRY RUN)
What's wrong? Ideas?
Additional information:
I have sudo access to both servers. One idea I have - is to use find command and cpio together to copy to new directory with content I need and after that use Rsync. But this is very slow, there are a lot of files, etc.
I've found the reason. As for me - it wasn't clear that Rsync works in this way.
So correct command (for company1 directory only) must be:
rsync -avzn --list-only --include 'company1/' --include 'company1/unique_folder1/***' --exclude '*' -e ssh user#server.com:/path/to/old/data/ /path/to/new/data
I.e. we need include each parent company directory. And of course we cannot write manually all these company directories in the command line, so we save the list into the file and use it.
Final things we need to do:
1.Generate include file on server 1, so its content will be (I've used ls and awk):
+ company1/
+ company1/unique_folder1/***
...
+ companyN/
+ companyN/unique_folder1/***
2.Copy include.txt to server 2 and use such command:
rsync -avzn \
--list-only \
--include-from '/path/to/new/include.txt' \
--exclude '*' \
-e ssh user#server.com:/path/to/old/data/ \
/path/to/new/data
If the first matching pattern excludes a directory, then all its descendants will never be traversed. When you want to include a deep directory e.g. company*/unique_folder1/** but exclude everything else *, you need to tell rsync to include all its ancestors too:
rsync -r -v --dry-run \
--include='/' \
--include='/company*/' \
--include='/company*/unique_folder1/' \
--include='/company*/unique_folder1/**' \
--exclude='*'
You can use bash’s brace expansion to save some typing. After brace expansion, the following command is exactly the same as the previous one:
rsync -r -v --dry-run --include=/{,'company*/'{,unique_folder1/{,'**'}}} --exclude='*'
An alternative to Andron's Answer which is simpler to both understand and implement in many cases is to use the --files-from=FILE option. For the current problem,
rsync -arv --files-from='list.txt' old_path/data new_path/data
Where list.txt is simply
company1/unique_folder1/
company2/unique_folder1/
...
Note the -r flag must be included explicitly since --files-from turns off this behaviour of the -a flag. It also seems to me that the path construction is different from other rsync commands, in that company1/unique_folder1/ matches but /data/company1/unique_folder1/ does not.
For example, if you only want to sync target/classes/ and target/lib/ to a remote system, do
rsync -vaH --delete --delete-excluded --include='classes/***' --include='lib/***' \
--exclude='*' target/ user#host:/deploy/path/
The important things to watch:
Don't forget the "/" from the end of the pathes, or you will get a copy into subdirectory.
The order of the --include, --exclude counts.
Contrary the other answers, starting with "/" an include/exclude parameter is unneeded, they will automatically appended to the source directory (target/ in the example).
To test, what exactly will happen, we can use a --dry-run flags, as the other answers say.
--delete-excluded will delete all content in the target directory, except the subdirectories we specifically included. It should be used wisely! On this reason, a --delete is not enough, it does not deletes the excluded files on the remote side by default (every other, yes), it should be given beside the ordinary --delete, again.