Default user agent in Wget - wget

I want to know what default user agent is passed if I use wget from command line without specifying explicit user agent.
I have some code which cahnges output based on user agent .
wget http://www.google.com -O test.html

"wget -d" will show the request made to the server.
$ wget -d http://www.google.com -O/dev/null 2>&1 |grep ^User-Agent
User-Agent: Wget/1.13.4 (linux-gnu)
User-Agent: Wget/1.13.4 (linux-gnu)
User-Agent: Wget/1.13.4 (linux-gnu)

At your shell prompt, do:
> man wget
scroll down to -U agent-string, which states:
"Wget normally identifies as Wget/version, version being the current version number of Wget".
So do:
> wget --version
which will give you the version, and thus your user-agent.
Incidently, you may find that some sites block wget, so depending on what you're doing you may need to change this.

On my Fedora 13 system, it shows Wget/1.12 (linux-gnu)

Run wget and sniff the communication.
You can also check the web server's log, usually it containts the user agent of the connecting clients.
This is what i got off the latest wget for windows: Wget/1.11.4

You could verify this with a network protocol analyzer such as Wireshark.
With Wireshark you can inspect the headers and every other detail of the whole protocol stack involved.
Wireshark is both free and open source: http://www.wireshark.org/

Related

How to retrieve external files if wget erobots=off is not working?

I would like to download all pdf files linked on a website using wget on Mac OS (zsh).
I have tried:
wget -r -p -k --random-wait --limit-rate=50k -A .pdf -erobots=off https://unfccc.int/process-and-meetings/bodies/constituted-bodies/executive-committee-of-the-warsaw-international-mechanism-for-loss-and-damage-wim-excom/task-force-on-displacement/implementation-updates-task-force-on-displacement\#eq-1
and I have also added the following options to no avail:
--span-hosts
--no-check-certificate
--no-cookies
-H
The error is always the same:
no-follow attribute found in unfccc.int/process-and-meetings/bodies/constituted-bodies/executive-committee-of-the-warsaw-international-mechanism-for-loss-and-damage-wim-excom/task-force-on-displacement/implementation-updates-task-force-on-displacement. Will not follow any links on this page
First make sure you has permission to crawl the pages, I'm not gonna be responsible for bad thing happend to you or anyone after bypassing the robot no-follow attribute !
The tricks to ignore all restrictions and be a bad crawler bot you just need to include :
-e robots=off on your command.
Yes, you've included it on your commands but you got some typos !
Just like #x00 say it need to be -e robots=off instead of -erobots=off.
I Don't know why #x00 doesn't answer the question with his answer :/
Note :
For x00, if you want to me to delete my answer because its nearly similar with yours comment , just command under my answer and i will delete it for you anytime !

How to save Charles session to a file from command line?

I wonder how to save charles session to a file.
Consider following script :
open -ga Charles --args -headless -config charles.xml results.chls
#...some interactions here
pgrep -f Charles | xargs kill
I'm expecting to see something in the results.chls but file is empty....
I figured it out myself. Seems like the only way is enable web control access for Charles and use http like this:
curl --silent -x localhost:8888 http://control.charles/session/export-har -o "${EXPORT_FILE}" > /dev/null

How to set up cron using curl command?

After apache rebuilt my cron jobs stopped working.
I used the following command:
wget -O - -q -t 1 http://example.com/cgi-bin/loki/autobonus.pl
Now my DC support suggests me to change the wget method to curl. What would be the correct value in this case?
-O - is equivalent to curl's default behavior, so that's easy.
-q is curl's -s (or --silent)
--retry N will substitute for wget's -t N
All in all:
curl -s --retry 1 http://example.com/cgi-bin/loki/autobonus.pl
try run change with the full path of wget
/usr/bin/wget -O - -q -t 1 http://example.com/cgi-bin/loki/autobonus.pl
you can find the full path with:
which wget
and more, check if you can reach the destination domain with ping or other methods:
ping example.com
Update:
based on the comments, seems to be caused by the line in /etc/hosts:
127.0.0.1 example.com #change example.com to the real domain
It seems that you have restricted options in terms that on the server where the cron should run you have the domain pinned to 127.0.0.1 but the virtual host configuration does not work with that.
What you can do is to let wget connect by IP but send the Host header so that the virtual host matching would work:
wget -O - -q -t 1 --header 'Host: example.com' http://xx.xx.35.162/cgi-bin/loki/autobonus.pl
Update
Also probably you don't need to run this over the web server, so why not just run:
perl /path/to/your/script/autobonus.pl

Why does wget only download the index.html for some websites?

I'm trying to use wget command:
wget -p http://www.example.com
to fetch all the files on the main page. For some websites it works but in most of the cases, it only download the index.html. I've tried the wget -r command but it doesn't work. Any one knows how to fetch all the files on a page, or just give me a list of files and corresponding urls on the page?
Wget is also able to download an entire website. But because this can put a heavy load upon the server, wget will obey the robots.txt file.
wget -r -p http://www.example.com
The -p parameter tells wget to include all files, including images. This will mean that all of the HTML files will look how they should do.
So what if you don't want wget to obey by the robots.txt file? You can simply add -e robots=off to the command like this:
wget -r -p -e robots=off http://www.example.com
As many sites will not let you download the entire site, they will check your browsers identity. To get around this, use -U mozilla as I explained above.
wget -r -p -e robots=off -U mozilla http://www.example.com
A lot of the website owners will not like the fact that you are downloading their entire site. If the server sees that you are downloading a large amount of files, it may automatically add you to it's black list. The way around this is to wait a few seconds after every download. The way to do this using wget is by including --wait=X (where X is the amount of seconds.)
you can also use the parameter: --random-wait to let wget chose a random number of seconds to wait. To include this into the command:
wget --random-wait -r -p -e robots=off -U mozilla http://www.example.com
Firstly, to clarify the question, the aim is to download index.html plus all the requisite parts of that page (images, etc). The -p option is equivalent to --page-requisites.
The reason the page requisites are not always downloaded is that they are often hosted on a different domain from the original page (a CDN, for example). By default, wget refuses to visit other hosts, so you need to enable host spanning with the --span-hosts option.
wget --page-requisites --span-hosts 'http://www.amazon.com/'
If you need to be able to load index.html and have all the page requisites load from the local version, you'll need to add the --convert-links option, so that URLs in img src attributes (for example) are rewritten to relative URLs pointing to the local versions.
Optionally, you might also want to save all the files under a single "host" directory by adding the --no-host-directories option, or save all the files in a single, flat directory by adding the --no-directories option.
Using --no-directories will result in lots of files being downloaded to the current directory, so you probably want to specify a folder name for the output files, using --directory-prefix.
wget --page-requisites --span-hosts --convert-links --no-directories --directory-prefix=output 'http://www.amazon.com/'
The link you have provided is the homepage or /index.html, Therefore it's clear that you are getting only a index.html page. For an actual download, for example, for "test.zip" file, you need to add the exact file name at the end. For example use the following link to download test.zip file:
wget -p domainname.com/test.zip
Download a Full Website Using wget --mirror
Following is the command line which you want to execute when you want to download a full website and made available for local viewing.
wget --mirror -p --convert-links -P ./LOCAL-DIR
http://www.example.com
–mirror: turn on options suitable for mirroring.
-p: download all files that are necessary to properly display a given HTML page.
–convert-links: after the download, convert the links in document
for local viewing.
-P ./LOCAL-DIR: save all the files and directories to the specified directory
Download Only Certain File Types Using wget -r -A
You can use this under following situations:
Download all images from a website,
Download all videos from a website,
Download all PDF files from a website
wget -r -A.pdf http://example.com/test.pdf
Another problem might be that the site you're mirroring uses links without www. So if you specify
wget -p -r http://www.example.com
it won't download any linked (intern) pages because they are from a "different" domain. If this is the case then use
wget -p -r http://example.com
instead (without www).
I had the same problem downloading files of CFSv2 model. I solved it using mixing of the above answers, but adding the parameter --no-check-certificate
wget -nH --cut-dirs=2 -p -e robots=off --random-wait -c -r -l 1 -A "flxf*.grb2" -U Mozilla --no-check-certificate https://nomads.ncdc.noaa.gov/modeldata/cfsv2_forecast_6-hourly_9mon_flxf/2018/201801/20180101/2018010100/
Here a brief explanation of every parameter used, for a further explanation go to the GNU wget 1.2 Manual
-nH equivalent to --no-host-directories: Disable generation of host-prefixed directories. In this case, avoid the generation of the directory ./https://nomads.ncdc.noaa.gov/
--cut-dirs=<number>: Ignore directory components. In this case, avoid the generation of the directories ./modeldata/cfsv2_forecast_6-hourly_9mon_flxf/
-p equivalent to --page-requisites: This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes such things as inlined images, sounds, and referenced stylesheets.
-e robots=off: avoid download robots.txt file
-random-wait: Causes the time between the request to vary between 0.5 and 1.5 * seconds, where was specified using the --wait option.
-c equivalent to --continue: continue getting a partially-downloaded file.
-r equivalent to --recursive: Turn on recursive retrieving. The default maximum depth is 5
-l <depth> equivalent to --level <depth>: Specify recursion maximum depth level
-A <acclist> equivalent to --accept <acclist>: specify a comma-separated list of the name suffixes or patterns to accept.
-U <agent-string> equivalent to --user-agent=<agent-string>: The HTTP protocol allows the clients to identify themselves using a User-Agent header field. This enables distinguishing the WWW software, usually for statistical purposes or for tracing of protocol violations. Wget normally identifies as ‘Wget/version’, the version being the current version number of Wget.
--no-check-certificate: Don't check the server certificate against the available certificate authorities.
I know that this thread is old, but try what is mentioned by Ritesh with:
--no-cookies
It worked for me!
If you look for index.html in the wget manual you can find an option --default-page=name which is index.html by default. You can change to index.php for example.
--default-page=index.php
If you only get the index.html and that file looks like it only contains binary data (i.e. no readable text, only control characters), then the site is probably sending the data using gzip compression.
You can confirm this by running cat index.html | gunzip to see if it outputs readable HTML.
If this is the case, then wget's recursive feature (-r) won't work. There is a patch for wget to work with gzip compressed data, but it doesn't seem to be in the standard release yet.

Can I use wget to check , but not download

Can I use wget to check for a 404 and not actually download the resource?
If so how?
Thanks
There is the command line parameter --spider exactly for this. In this mode, wget does not download the files and its return value is zero if the resource was found and non-zero if it was not found. Try this (in your favorite shell):
wget -q --spider address
echo $?
Or if you want full output, leave the -q off, so just wget --spider address. -nv shows some output, but not as much as the default.
If you want to check quietly via $? without the hassle of grep'ing wget's output you can use:
wget -q "http://blah.meh.com/my/path" -O /dev/null
Works even on URLs with just a path but has the disadvantage that something's really downloaded so this is not recommended when checking big files for existence.
You can use the following option to check for the files:
wget --delete-after URL
Yes easy.
wget --spider www.bluespark.co.nz
That will give you
Resolving www.bluespark.co.nz... 210.48.79.121
Connecting to www.bluespark.co.nz[210.48.79.121]:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
200 OK
Yes, to use the wget to check , but not download the target URL/file, just run:
wget --spider -S www.example.com
If you are in a directory where only root have access to write in system. Then you can directly use wget www.example.com/wget-test using a standard user account. So it will hit the url but because of having no write permission file won't be saved..
This method is working fine for me as i am using this method for a cronjob.
Thanks.
sthx