I am trying to scrape a website using wget. Here is my command:
wget -t 3 -N -k -r -x
The -N means "don't download file if server version older than local version". But this isn't working. The same files get downloaded over and over again when I restart the above scraping operation - even though the files have no changes.
Many of the downloaded pages report:
Last-modified header missing -- time-stamps turned off.
I've tried scraping several web sites but all tried so far give this problem.
Is this a situation controlled by the remote server? Are they choosing not so send those timestamp headers? If so, there may not be much I can do about it?
I am aware of the -NC (no clobber) option, but that will prevent an existing file not being overwritten even if the server file is newer, resulting in stale local data accumulating.
Thanks
Drew
The wget -N switch does work, but a lot of web servers don't send the Last-Modified header for various reasons. For example, dynamic pages (PHP or any CMS, etc.) have to actively implement the functionality (figure out when the content was last modified, and send the header). Some do, while some don't.
There really isn't another reliable way to check if a file has been changed, either.
Related
I've tried downloading files from different subdirs using wget
wget -r -nd --no-parent -e robots=off --auth-no-challenge --user=myusr --ask-password https://myexdomain.com/dir/subdir/
and it works without any problems.
When I however, make the wget call to one directory-level up
wget -r -nd --no-parent -e robots=off --auth-no-challenge --user=myusr --ask-password https://myexdomain.com/dir/
wget returns 400: BAD REQUEST
Any ideas why?
HTTP error 400 is fairly generic and usually related to the formatting of the client request. Without you providing the actual websites in question for us to test, it's a bit difficult to guess. However, here are a few guesses:
Does it work fine on plain http but not on https? If so, it may be related to a security policy.
Are you using a firewall, router, or resident antivirus that inspects packets? If so, it may be changing the packet structure rendering it invalid. If you're on a bigger network, is there such a device between you and the host in question (a tracert would help you figure that out).
Do you have access to this host server? If so, can you create new directories and test the exact conditions under which it fails? Is it always at the one-directory-below-root level?
You're excluding robots, which can be fine sometimes, but can also trigger different responses from servers depending on context. Have you tested without the robots exclusion?
Have you tried with a different user-agent? Some servers respond differently to wget user-agents. Can you access these files in a web-browser? Using curl? Or with a different user-agent specified with --user-agent in your wget command?
If none of this works, let us know the results of additional testing.
I'm trying to upload data to Google Cloud Storage from a disk with ~3000 files totalling 1TB. I'm using gsutil cp -R <disk-top-directory> <bucket>. My understanding is that, if gsutil is resumed/restarted, it uses checksums to determine when a file has already been uploaded and skips over it.
It doesn't appear to be doing this: it appears to be resuming the upload from the top and replacing the files all over again. When I run successive gsutil ls -Rl <bucket/disk-top-directory> ten minutes apart and compare them with diff, I see what appears to be the same files with the same sizes but a changed (newer) date. (i.e. consistent with the same file being re-uploaded.)
For example:
< 404104811 2014-04-08T14:13:44Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
---
> 404104811 2014-04-08T14:43:48Z gs://my-bucket/disk-top-directory/dir1/dir2/dir3/dir4/dir5/file-20.tsv.bz2
The machine I'm using to read the disk and transfer files is running Ubuntu 13.10. I installed gsutil using the pip instructions for Debian and Ubuntu.
Am I misunderstanding how gsutil's resumable transfers is supposed to work? If not, any diagnosis and fix to get the correct resume behavior? Thanks in advance!
You need to use the -n (No-clobber) switch to prevent the re-uploading of objects that already exist at the destination.
gsutil cp -Rn <disk-top-directory> <bucket>
From the help (gsutil help cp)
-n No-clobber. When specified, existing files or objects at the
destination will not be overwritten. Any items that are skipped
by this option will be reported as being skipped. This option
will perform an additional HEAD request to check if an item
exists before attempting to upload the data. This will save
retransmitting data, but the additional HTTP requests may make
small object transfers slower and more expensive.
Also according to this, when transfering files over 2MB, gsutil automatically uses a resumable transfer mode.
If you're open to working with the (still beta) gsutil v4, that version of gsutil has an rsync command. You can get this by running:
gsutil update gs://prerelease/gsutil_4.0beta2pre_minus_m_sugg.tar.gz
Please be sure to read the release notes before switching to this major new release, especially if you're using gsutil v3 in scripts.
I have a website to which I have FTP access only (otherwise I'd use rsync for this) and I'd like to keep a local copy of it. At the moment I run the following wget command every once in a while
wget -m --ftp-user=me --ftp-password=secret ftp://my.server.com
When there are many updates it does get tedious with wget only having one connection at a time. I read about aria2 but couldn't find any hints as to answer the questions whether it would be possible to use aria2 as a replacement for this purpose?
No, according to the aria2 docs the option for downloading only newer files only works with http(s).
--conditional-get[=true|false]
Download file only when the local file is older than remote file. This function only works with HTTP(S) downloads only. It does not work if file size is specified in Metalink.
I came across the "--delete-after" option when I was reading the manpage of wget ?
what's the purpose of providing such an option ? Is it just for testing the page is ok for downloading ? Or maybe there are other situations where this option is useful, I hope you guys may give me some hints.
With reference to your comments above. I'm providing some examples of how we use it. We have a few websites running on Rackspace Cloud Sites which is a managed cloud hosting solution. We don't have access to regular cron.
We had an issue with runaway usage on a site using WordPress because WP kept calling wp-cron.php. To give you a sense of runaway usage, it used up in one day the allotted CPU cycles for a month. Anyway what I did was disable wp-cron.php being called within the WordPress system and manually call it through wget. I'm not interested in the output from the process so if I don't use --delete-after with wget (wget ... > /dev/null 2>&1 works well too) the folder where wget runs would get filled with hundreds of useless logs and output of each time the script was called.
We also have SugarCRM installed and that system requires its cron script to be called to handle system maintenance. We use wget silently for that as well. Basically a lot of these kinds of web-based systems have cron scripts. If you can't call your scripts directly say using php on the machine then the other option is calling it silently with wget.
The command to call these cron scripts is quite basic - wget --delete-after http://example.com/cron.php?parameters=if+needed
I'm using wget (with cron) to automate commands to a web application, so I have no interest in the contents of the pages. --delete-after is ideal for this.
You can use it for testing if a page is downloading ok, but usually it's used to force proxy servers to cache their contents.
If your sitting on a connection where there's a network appliance caching content between the site and your endpoint, and you have a site that's popular among users on that network, then what you may want to do as a sysadmin, is to use a down level machine just after the proxy to script a recursive "-r" or mirror "-m" wget operation.
The proxy appliance will see this and pre-cache the site and it's assets, thus making site accesses for uses after said proxy a bit faster.
You'd then want to specify "--delete-after" to free up the disk space used unless your wanting to keep a local copy of all sites you force to cache.
Sometimes you only need to visit a website to set an IP address - say if you are rolling your own dyn dns service.
I'm using wget to connect to a secure site like this:
wget -nc -i inputFile
where inputeFile consists of URLs like this:
https://clientWebsite.com/TheirPageName.asp?orderValue=1.00&merchantID=36&programmeID=92&ref=foo&Ofaz=0
This page returns a small gif file. For some reason, this is taking around 2.5 minutes. When I paste the same URL into a browser, I get back a response within seconds.
Does anyone have any idea what could be causing this?
The version of wget, by the way, is "GNU Wget 1.9+cvs-stable (Red Hat modified)"
I know this is a year old but this exact problem plagued us for days.
Turns out it was our DNS server but I got around it by disabling IP6 on my box.
You can test it out prior to making the system change by adding "--inet4-only" to the end of the command (w/o quotes).
Try forging your UserAgent
-U "Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-GB; rv:1.9.0.1) Gecko/2008070206 Firefox/3.0.1"
Disable Ceritificate Checking ( slow )
--no-check-certificate
Debug whats happening by enabling verbostity
-v
Eliminate need for DNS lookups:
Hardcode thier IP address in your HOSTS file
/etc/hosts
123.122.121.120 foo.bar.com
Have you tried profiling the requests using strace/dtrace/truss (depending on your platform)?
There are a wide variety of issues that could be causing this. What version of openssl is being used by wget - there could be an issue there. What OS is this running on (full information would be useful there).
There could be some form of download slowdown being enforced due to the agent ID being passed by wget implemented on the site to reduce the effects of spiders.
Is wget performing full certificate validation? Have you tried using --no-check-certificate?
Is the certificate on the client site valid? You may want to specify --no-certificate-check if it is a self-signed certificate.
HTTPS (SSL/TLS) Options for wget
One effective solution is to delete https:\\.
This accelerate my download for around 100 times.
For instance, you wanna download via:
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
You can use the following command alternatively to speed up.
wget data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2