wget recursive returns BAD REQUEST - wget

I've tried downloading files from different subdirs using wget
wget -r -nd --no-parent -e robots=off --auth-no-challenge --user=myusr --ask-password https://myexdomain.com/dir/subdir/
and it works without any problems.
When I however, make the wget call to one directory-level up
wget -r -nd --no-parent -e robots=off --auth-no-challenge --user=myusr --ask-password https://myexdomain.com/dir/
wget returns 400: BAD REQUEST
Any ideas why?

HTTP error 400 is fairly generic and usually related to the formatting of the client request. Without you providing the actual websites in question for us to test, it's a bit difficult to guess. However, here are a few guesses:
Does it work fine on plain http but not on https? If so, it may be related to a security policy.
Are you using a firewall, router, or resident antivirus that inspects packets? If so, it may be changing the packet structure rendering it invalid. If you're on a bigger network, is there such a device between you and the host in question (a tracert would help you figure that out).
Do you have access to this host server? If so, can you create new directories and test the exact conditions under which it fails? Is it always at the one-directory-below-root level?
You're excluding robots, which can be fine sometimes, but can also trigger different responses from servers depending on context. Have you tested without the robots exclusion?
Have you tried with a different user-agent? Some servers respond differently to wget user-agents. Can you access these files in a web-browser? Using curl? Or with a different user-agent specified with --user-agent in your wget command?
If none of this works, let us know the results of additional testing.

Related

how to read and automatically analyse a pcap from STDIN

I'am about to build an automatic intrusion detection system (IDS) behind my FritzBox Router in my home LAN.
I'm using a Raspberry Pi with Raspbian Jessie, but any dist would be ok.
After some searches and tryouts I found ntop (ntopng to be honest, but I guess my questions aims to any version).
ntop can capture network traffic on its own, but thats not what I want because I want to get all the traffic without putting the Pi between the devices or let him act as a gateway (for performance reasons). Fortunately my FritzBox OS has a function to simulate a mirror port. You can download a .pcap which is continously written in realtime. I do it with a script from this link.
The problem is that I can't pipe the wget download to ntop like I could do it with e.g. tshark.
I'm looking for:
wget -O - http://fritz.box/never_ending.pcap | ntopng -f -
While this works fine:
wget -O - http://fritz.box/never_ending.pcap | tshark -i -
Suggestions of other analyzing software is ok (if pretty enough ;) ) but I want to use the FritzBox-pcap-thing...
Thanks for saving another day of mine :)
Edit:
So I'm comming to this approaches:
Make chunks of pcaps an run a script to analyse every pcap after another. Problem ntop do not merge the results, and I could get a storage problem if traffic running hot
Pipe wget to tshark and overwrite one pcap every time. Then analyse it with ntop. Problem again, the storage
Pipe wget to tshark cut some information out and store them to a database. Problem which info should I store and what programm likes dbs more than pcaps ?
The -i option in tshark is to specify an interface, whereas the -f option in ntop is to specify a name for the dump-file.
In ntopng I didn't even know there was a -f option!?
Does this solve your problem?

what's the purpose of the '--delete-after' option of wget?

I came across the "--delete-after" option when I was reading the manpage of wget ?
what's the purpose of providing such an option ? Is it just for testing the page is ok for downloading ? Or maybe there are other situations where this option is useful, I hope you guys may give me some hints.
With reference to your comments above. I'm providing some examples of how we use it. We have a few websites running on Rackspace Cloud Sites which is a managed cloud hosting solution. We don't have access to regular cron.
We had an issue with runaway usage on a site using WordPress because WP kept calling wp-cron.php. To give you a sense of runaway usage, it used up in one day the allotted CPU cycles for a month. Anyway what I did was disable wp-cron.php being called within the WordPress system and manually call it through wget. I'm not interested in the output from the process so if I don't use --delete-after with wget (wget ... > /dev/null 2>&1 works well too) the folder where wget runs would get filled with hundreds of useless logs and output of each time the script was called.
We also have SugarCRM installed and that system requires its cron script to be called to handle system maintenance. We use wget silently for that as well. Basically a lot of these kinds of web-based systems have cron scripts. If you can't call your scripts directly say using php on the machine then the other option is calling it silently with wget.
The command to call these cron scripts is quite basic - wget --delete-after http://example.com/cron.php?parameters=if+needed
I'm using wget (with cron) to automate commands to a web application, so I have no interest in the contents of the pages. --delete-after is ideal for this.
You can use it for testing if a page is downloading ok, but usually it's used to force proxy servers to cache their contents.
If your sitting on a connection where there's a network appliance caching content between the site and your endpoint, and you have a site that's popular among users on that network, then what you may want to do as a sysadmin, is to use a down level machine just after the proxy to script a recursive "-r" or mirror "-m" wget operation.
The proxy appliance will see this and pre-cache the site and it's assets, thus making site accesses for uses after said proxy a bit faster.
You'd then want to specify "--delete-after" to free up the disk space used unless your wanting to keep a local copy of all sites you force to cache.
Sometimes you only need to visit a website to set an IP address - say if you are rolling your own dyn dns service.

Problem with the -N option of wget

I am trying to scrape a website using wget. Here is my command:
wget -t 3 -N -k -r -x
The -N means "don't download file if server version older than local version". But this isn't working. The same files get downloaded over and over again when I restart the above scraping operation - even though the files have no changes.
Many of the downloaded pages report:
Last-modified header missing -- time-stamps turned off.
I've tried scraping several web sites but all tried so far give this problem.
Is this a situation controlled by the remote server? Are they choosing not so send those timestamp headers? If so, there may not be much I can do about it?
I am aware of the -NC (no clobber) option, but that will prevent an existing file not being overwritten even if the server file is newer, resulting in stale local data accumulating.
Thanks
Drew
The wget -N switch does work, but a lot of web servers don't send the Last-Modified header for various reasons. For example, dynamic pages (PHP or any CMS, etc.) have to actively implement the functionality (figure out when the content was last modified, and send the header). Some do, while some don't.
There really isn't another reliable way to check if a file has been changed, either.

How to replay traffic to web server from logs to profile / benchmark web app under real load?

Is there a way to get recorder real network traffic to web server, e.g. from web server logs (Apache), and replay this traffic to either profile web application (in Perl) under real load, or benchmark and compare speed of different implementations before choosing one or the other?
If it matters, webapp is written in Perl, and runs under plain CGI, FastCGI, mod_perl (via ModPerl::Registry), PSGI (via Plack::App::WrapCGI).
Crossposted to Pro Webmasters
Similar questions on Server Fault:
How can I replay Apache access logs back at my servers to do real world load testing?
A quick scan on Google for this yielded an interesting blog entry with subsequent, useful comments are at http://www.igvita.com/2008/09/30/load-testing-with-log-replay/. A commenter also mentioned Tsung by Process-One that allows for recording sessions real-time, with the obvious note that you should be able to replay it back. That doesn't help so much with existing Apache access logs though.
Been here lately. I figured that if I dumped tcp traffic with tcpdump I could rewrite the destination of the packages and then replay it to the new app servers. So I started out with something like this:
tcpdump -i eth1 dst -s 0 -w - port 80 | \
tcprewrite --mtu-trunc --infile=- --outfile=- \
--dstipmap=<source_ip>:<destination_ip> | \
tcpslice -w - - | tcpreplay --intf1=eth1 -
It did not work for various reasons, so I started digging some more and found Gor: a small Go project by Leonid Bugaev from Granify, written for exactly what we wanted to accomplish here.
This is how we ended up using Gor: http://devblog.springest.com/testing-big-infrastructure-changes-at-springest/
We have a Chef cookbook for it as well: https://github.com/Springest/gor-chef
Hope this helps.
Short answer was given on the otherside.
Longer answer is that you can't: you will be missing request headers and POST bodies.
Here's a simple perl way to record real http traffic and play it back:
http://patrick.net/sprocket/rwt.html
If only GET requests are needed and there is no session-tracking implemented via query parameters, then this is possible.
One question: do you want to do it this way because (1) you want to emulate real-world distribution of traffic among your pages or (2) there are too many pages to even consider building any sort of test scripts?

Slow wget speeds when connecting to https pages

I'm using wget to connect to a secure site like this:
wget -nc -i inputFile
where inputeFile consists of URLs like this:
https://clientWebsite.com/TheirPageName.asp?orderValue=1.00&merchantID=36&programmeID=92&ref=foo&Ofaz=0
This page returns a small gif file. For some reason, this is taking around 2.5 minutes. When I paste the same URL into a browser, I get back a response within seconds.
Does anyone have any idea what could be causing this?
The version of wget, by the way, is "GNU Wget 1.9+cvs-stable (Red Hat modified)"
I know this is a year old but this exact problem plagued us for days.
Turns out it was our DNS server but I got around it by disabling IP6 on my box.
You can test it out prior to making the system change by adding "--inet4-only" to the end of the command (w/o quotes).
Try forging your UserAgent
-U "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-GB; rv:1.9.0.1) Gecko/2008070206 Firefox/3.0.1"
Disable Ceritificate Checking ( slow )
--no-check-certificate
Debug whats happening by enabling verbostity
-v
Eliminate need for DNS lookups:
Hardcode thier IP address in your HOSTS file
/etc/hosts
123.122.121.120 foo.bar.com
Have you tried profiling the requests using strace/dtrace/truss (depending on your platform)?
There are a wide variety of issues that could be causing this. What version of openssl is being used by wget - there could be an issue there. What OS is this running on (full information would be useful there).
There could be some form of download slowdown being enforced due to the agent ID being passed by wget implemented on the site to reduce the effects of spiders.
Is wget performing full certificate validation? Have you tried using --no-check-certificate?
Is the certificate on the client site valid? You may want to specify --no-certificate-check if it is a self-signed certificate.
HTTPS (SSL/TLS) Options for wget
One effective solution is to delete https:\\.
This accelerate my download for around 100 times.
For instance, you wanna download via:
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
You can use the following command alternatively to speed up.
wget data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2