Spider a website, what the fastest method other than wget

Spider a website, what the fastest method other than wget - wget

I use wget --spider -l4 -r --delete-after option.
I realise the crawling rate is relatively slow.
my page have page 2,3,4,5 etc (link from page 1) of that category.
If i just want hit the website page once, but not retrieve the html, what should I use to do it the fastest way.
In general is to able to touch all the internal links, to make them load once(to generate my cache). I don't need crawl the html. any idea how to do this the best fast way. The site are all interlinked
Main site
> Category
> Different Posts of Category
> Pages of Post
I want able crawl all of the links of the site itself, in the fastest way ( don't need download anything) just like a google bot spider around.
Thanks

I suggest trying mget http://rockdaboot.github.io/mget/
mget is wget-workalike, but multithreaded. So it can make use of parallel connections to speed things up. It also has more-sophisticated compression support. If you start using it much I think you’ll find that it generally just works faster overall.

Related

Sitewide Google Optimize redirect test

I have a website www.website.com and we build the same website on another system shop.website.com. We want to test if the new system converts so we thought to set up a google optimize. Is there a way to redirect all the sublink? Something like www.website.com/* to shop.website.com/* as the structure is exactly the same. So if people go to www.website.com/page3 it will go shop.website.com/page3. We have more than 700 pages so doing 700 different redirect experiment it's basically impossible. I read every guide article and internet related forum around but I couldn't find an easy way. Any help?
I'll be extremely grateful.

Controlling a random webpage through a Perl script

There is a random website say abc.com and this website has a search engine. Is it possible to create a perl script to automatically read from a text file and feed search values into this search engine and automatically download the files that are the result of the search ? Once the download is complete, the loop has to continue until all the search values have been exhausted. I don't have any server details about the website itself.
Any help is much appreciated. Thanks !

This is HTTP client programming. You're basically writing a program that is pretending to be a browser.
The standard module for doing this is probably WWW::Mechanize (see the cookbook and the examples).
If you want something lower level, then the LWP bundle of modules will do all that you want.
There's a free online book. But it's a little old and probably doesn't reflect current best practices.

How to keep HTTrack Crawlers away from my website through robots.txt?

I'm maintaining the website http://www.totalworkflow.co.uk and not sure if the HTTrack follow the instructions given in robots.txt file. If there is any answer that we can keep the HTTrack away from the website please suggest me implement with or else just tell the robot name so I could be able to block this crap from crawling my website. If this is not possible by robots.txt, please recommend if any other way to keep this robots away from the website?
You are right there is no necessity for spam crawlers to follow the guidelines given in the robots.txt file. I know that the robots.txt is only for genuine search engines only. However, the application HTTrack may look genuine if the developers hard code this application not to skip the robots.txt guidelines if provided. If this option is provided then the application would be really useful for the purpose intended. OK lets come to my issue, actually what I would like to find the solution is to keeps the HTTRack crawlers away without hard code anything on the web server. I try to solve this issue at the webmasters level first. However, your idea is great to consider in the future. Thank you

It should obey robots.txt, but robots.txt is a thing that you don't have to obey (and actually a pretty good thing to find what you don't want other people to see for spam bots) so what's the guarantee that (even if it obeys robots now) some time in the future there won't be an option to ignore all robots.txt and metatags? I think a better way is to configure your server-side application to detect and block user agents. There is a chance that the user agent string is hardcoded somewhere in the crawler's source code and the user won't be able to change it to stop you from blocking that crawler. All you have to do is write a server script to spit out user agent information (or check server logs) and then create blocking rules according to this information. Alternatively, you can just google a list of known "bad agents". To block user agents on a server that supports HTACCESS, have a look at this thread for one way of doing it:
Block by useragent or empty referer

On Google Sites, is it possible to directly provide my own pages, stylesheets etc without going thru the creating process?

Now I have to pick a template and jump thru bunch of loops to create a page that I don't like. So we wonder if it's possible to just drop a few web pages and the css that we created on our own as our web site on Google Apps, doable?

AFAIK, that is not possible. I made a google site for a client last year, and it takes a lot of hoops to go through for customization. You can do quite a bit, it might not be readily noticeable. I found googling the things I wanted to do tended to yield the quickest results.

How to integrate vBulletin features into an external site

I have a web site I'm building and the client wants to have features from vBulletin (blog, forums) integrated into the site. Its not enough to simply add the sites skin to vBulletin. Is there a way to do this?
I would expect there to be documentation on how, if it is possible, to do such a thing but haven't been able to find anything.
I'd rather not connect and query the vBulletin database directly.

There is no proper API for this yet, so you'd either have to rely on things like RSS, or query the database directly. RSS won't get you old data, nor any forum structures, etc. just basics of new data.

After much research (see: cursing) I've found that external.php and blog_external.php do what I want though not quite as elegantly as I would like.
So if you want to incorporate forum threads into your web page then external.php is what you need. It appears to be a bit more customizable in that you can have it output in JavaScript, XML, RSS, and RSS Enclosure (podcasting).
If you want to incorporate blog posts you appear to be limited to RSS only. Like I said, less than ideal but at least its something.
There is more information here: http://www.vbulletin.com/docs/html/vboptions_group_external