Slow wget speeds when connecting to https pages - wget

I'm using wget to connect to a secure site like this:
wget -nc -i inputFile
where inputeFile consists of URLs like this:
https://clientWebsite.com/TheirPageName.asp?orderValue=1.00&merchantID=36&programmeID=92&ref=foo&Ofaz=0
This page returns a small gif file. For some reason, this is taking around 2.5 minutes. When I paste the same URL into a browser, I get back a response within seconds.
Does anyone have any idea what could be causing this?
The version of wget, by the way, is "GNU Wget 1.9+cvs-stable (Red Hat modified)"

I know this is a year old but this exact problem plagued us for days.
Turns out it was our DNS server but I got around it by disabling IP6 on my box.
You can test it out prior to making the system change by adding "--inet4-only" to the end of the command (w/o quotes).

Try forging your UserAgent
-U "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-GB; rv:1.9.0.1) Gecko/2008070206 Firefox/3.0.1"
Disable Ceritificate Checking ( slow )
--no-check-certificate
Debug whats happening by enabling verbostity
-v
Eliminate need for DNS lookups:
Hardcode thier IP address in your HOSTS file
/etc/hosts
123.122.121.120 foo.bar.com

Have you tried profiling the requests using strace/dtrace/truss (depending on your platform)?
There are a wide variety of issues that could be causing this. What version of openssl is being used by wget - there could be an issue there. What OS is this running on (full information would be useful there).
There could be some form of download slowdown being enforced due to the agent ID being passed by wget implemented on the site to reduce the effects of spiders.
Is wget performing full certificate validation? Have you tried using --no-check-certificate?

Is the certificate on the client site valid? You may want to specify --no-certificate-check if it is a self-signed certificate.
HTTPS (SSL/TLS) Options for wget

One effective solution is to delete https:\\.
This accelerate my download for around 100 times.
For instance, you wanna download via:
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
You can use the following command alternatively to speed up.
wget data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2

Related

Unable to install LMD on CentOS 7.9.2009 (core)

Can someone please help me with this? I'm attempting to follow the below guide on installing LMD (Linux Malware Detect) on CentOS.
https://www.tecmint.com/install-linux-malware-detect-lmd-in-rhel-centos-and-fedora/
The issue that I am having is that whenever I attempt to use "wget" on the specified link to LMD, it always pulls an HTML file instead of a .gz file.
Troubleshooting: I've attempted HTTPS instead of HTTP, but that results in an "unable to establish SSL connection" error message (see below). I've already looked around the internet for other guides on installing LMD on Cent and every one of them advised to "wget" the .gz at the below link. I'm hoping that someone can help me to work through this.
http://www.rfxn.com/downloads/maldetect-current.tar.gz
SSL error below
If you need further information from me, please let me know. Thank you.
Best,
B
wget --spider: enter image description here
wget --spider: enter image description here
This is interesting, you requested asset from http://www.rfxn.com but was redirected finally to https://block.charter-prod.hosted.cujo.io which seems to page with text like
Let's stop for a moment
This website has been blocked as it may contain inappropriate content
I am unable to fathom why exactly this happend, but this probably something to do with your network, as I run wget --spider and it did detect (1,5M) [application/x-gzip].
You replace in command http with https. Try wget as it is mentioned in the manual:
wget http://www.rfxn.com/downloads/maldetect-current.tar.gz
Here is what I get with --spider option:
# wget --spider http://www.rfxn.com/downloads/maldetect-current.tar.gz
Spider mode enabled. Check if remote file exists.
--2022-07-06 22:04:57-- http://www.rfxn.com/downloads/maldetect-current.tar.gz
Resolving www.rfxn.com... 172.67.144.156, 104.21.28.71
Connecting to www.rfxn.com|172.67.144.156|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1549126 (1.5M) [application/x-gzip]
Remote file exists.
It was my ISP. They had router-based software preventing Linux extra-network commands from getting past the gateway.

wget recursive returns BAD REQUEST

I've tried downloading files from different subdirs using wget
wget -r -nd --no-parent -e robots=off --auth-no-challenge --user=myusr --ask-password https://myexdomain.com/dir/subdir/
and it works without any problems.
When I however, make the wget call to one directory-level up
wget -r -nd --no-parent -e robots=off --auth-no-challenge --user=myusr --ask-password https://myexdomain.com/dir/
wget returns 400: BAD REQUEST
Any ideas why?
HTTP error 400 is fairly generic and usually related to the formatting of the client request. Without you providing the actual websites in question for us to test, it's a bit difficult to guess. However, here are a few guesses:
Does it work fine on plain http but not on https? If so, it may be related to a security policy.
Are you using a firewall, router, or resident antivirus that inspects packets? If so, it may be changing the packet structure rendering it invalid. If you're on a bigger network, is there such a device between you and the host in question (a tracert would help you figure that out).
Do you have access to this host server? If so, can you create new directories and test the exact conditions under which it fails? Is it always at the one-directory-below-root level?
You're excluding robots, which can be fine sometimes, but can also trigger different responses from servers depending on context. Have you tested without the robots exclusion?
Have you tried with a different user-agent? Some servers respond differently to wget user-agents. Can you access these files in a web-browser? Using curl? Or with a different user-agent specified with --user-agent in your wget command?
If none of this works, let us know the results of additional testing.

how to read and automatically analyse a pcap from STDIN

I'am about to build an automatic intrusion detection system (IDS) behind my FritzBox Router in my home LAN.
I'm using a Raspberry Pi with Raspbian Jessie, but any dist would be ok.
After some searches and tryouts I found ntop (ntopng to be honest, but I guess my questions aims to any version).
ntop can capture network traffic on its own, but thats not what I want because I want to get all the traffic without putting the Pi between the devices or let him act as a gateway (for performance reasons). Fortunately my FritzBox OS has a function to simulate a mirror port. You can download a .pcap which is continously written in realtime. I do it with a script from this link.
The problem is that I can't pipe the wget download to ntop like I could do it with e.g. tshark.
I'm looking for:
wget -O - http://fritz.box/never_ending.pcap | ntopng -f -
While this works fine:
wget -O - http://fritz.box/never_ending.pcap | tshark -i -
Suggestions of other analyzing software is ok (if pretty enough ;) ) but I want to use the FritzBox-pcap-thing...
Thanks for saving another day of mine :)
Edit:
So I'm comming to this approaches:
Make chunks of pcaps an run a script to analyse every pcap after another. Problem ntop do not merge the results, and I could get a storage problem if traffic running hot
Pipe wget to tshark and overwrite one pcap every time. Then analyse it with ntop. Problem again, the storage
Pipe wget to tshark cut some information out and store them to a database. Problem which info should I store and what programm likes dbs more than pcaps ?
The -i option in tshark is to specify an interface, whereas the -f option in ntop is to specify a name for the dump-file.
In ntopng I didn't even know there was a -f option!?
Does this solve your problem?

Proxy setting in gsutil tool

I use gsutil tool for download archives from Google Storage.
I use next CMD command:
python c:\gsutil\gsutil cp gs://pubsite_prod_rev_XXXXXXXXXXXXX/YYYYY/*.zip C:\Tmp\gs
Everything works fine, but if I try to run that command from corporate proxy, I receive error:
Caught socket error, retrying: [Errno 10051] A socket operation was attempted to an unreachable network
I tried several times to set the proxy settings in .boto file, but all to no avail.
Someone faced with such a problem?
Thanks!
Please see the section "I'm connecting through a proxy server, what do I need to do?" at https://developers.google.com/storage/docs/faq#troubleshooting
Basically, you need to configure the proxy settings in your .boto file, and you need to ensure that your proxy allows traffic to accounts.google.com as well as to *.storage.googleapis.com.
A change was just merged into github yesterday that fixes some of the proxy support. Please try it out, or specifically, overwrite this file with your current copy:
https://github.com/GoogleCloudPlatform/gsutil/blob/master/gslib/util.py
I believe I am having the same problem with the proxy settings being ignored under Linux (Ubuntu 12.04.4 LTS) and gsutils 4.2 (downloaded today).
I've been watching tcpdump on the host to confirm that gsutils is attempting to directly route to Google IPs instead of to my proxy server.
It seems that on the first execution of a simple command like "gsutil -d ls" it will use my proxy settings specified .boto for the first POST and then switch back to attempting to route directly to Google instead of my proxy server.
Then if I CTRL-C and re-run the exact same command, the proxy setting is no longer used at all. This difference in behaviour baffles me. If I wait long enough, I think it will work for the initial request again so this suggests some form on caching taking place. I'm not 100% of this behaviour yet because I haven't been able to predict when it occurs.
I also noticed that it always first tries to connect to 169.254.169.254 on port 80 regardless of proxy settings. A grep shows that it's hardcoded into oauth2_client.py, test_utils.py, layer1.py, and utils.py (under different subdirectories of the gsutils root).
I've tried setting the http_proxy environment variable but it appears that there is code that unsets this.

Problem with the -N option of wget

I am trying to scrape a website using wget. Here is my command:
wget -t 3 -N -k -r -x
The -N means "don't download file if server version older than local version". But this isn't working. The same files get downloaded over and over again when I restart the above scraping operation - even though the files have no changes.
Many of the downloaded pages report:
Last-modified header missing -- time-stamps turned off.
I've tried scraping several web sites but all tried so far give this problem.
Is this a situation controlled by the remote server? Are they choosing not so send those timestamp headers? If so, there may not be much I can do about it?
I am aware of the -NC (no clobber) option, but that will prevent an existing file not being overwritten even if the server file is newer, resulting in stale local data accumulating.
Thanks
Drew
The wget -N switch does work, but a lot of web servers don't send the Last-Modified header for various reasons. For example, dynamic pages (PHP or any CMS, etc.) have to actively implement the functionality (figure out when the content was last modified, and send the header). Some do, while some don't.
There really isn't another reliable way to check if a file has been changed, either.