making LWP Useragent faster - perl

I need to perform a large number of HTTP post requests, and ignore the response. I am currently doing this using LWP::UserAgent. It seems to run somewhat slow though I am not sure if it is waiting for a response or what, is there anyway to speed it up and possibly just ignore the responses?

bigian's answer is probably the best for this, but another way to speed things up is to use LWP::ConnCache to allow LWP to re-use existing connections rather than build a new connection for every request.
Enabling it is this simple if you're pounding on just one site --
my $conn_cache = LWP::ConnCache->new;
$conn_cache->total_capacity([1]) ;
$ua->conn_cache($conn_cache) ;
I've found this to double the speed of some operations on http site, and more than double it for https sites.

LWP::Parallel
http://metacpan.org/pod/LWP::Parallel
"Introduction
ParallelUserAgent is an extension to the existing libwww module. It allows you to take a list of URLs (it currently supports HTTP, FTP, and FILE URLs. HTTPS might work, too) and connect to all of them in parallel, then wait for the results to come in."
It's great, it's worked wonders for me...

Related

Using GET verb to update in rest api?

I know the use of http verbs is based on standard specification. But my question if I use "GET" for update operations and write a code logic to update, does it create issues in any scenario? Apart from the standard, what else could be the reason to use these verbs for a specific purpose only?
my question if I use "GET" for update operations and write a code logic to update, does it create issues in any scenario?
Yes.
A simple example - suppose the network between the client and the server is unreliable; specifically, for a time, HTTP responses are being lost. A general purpose component (like a web proxy) might time out, and then, noticing that the method token of the request is GET, resend the request a second/third/fourth time, with your server performing its update on every GET request.
Let us further assume that these multiple update operations lead to an undesirable outcome; where do we properly affix blame?
Second example: you send someone a copy of the link to the update operation, so that they can send you a request at the appropriate time. But suppose you send that link to them in an email, and the email client recognizes the uri and (as a performance optimization) pre-fetches the link, triggering your update operation too early. Where do we properly affix the blame?
HTTP does not attempt to require the results of a GET to be safe. What it does is require that the semantics of the operation be safe, and therefore it is a fault of the implementation, not the interface or the user of that interface, if anything happens as a result that causes loss of property -- Fielding, 2002
In these, and other examples, blame is correctly affixed to your server, because GET has a standardized meaning which include the constraint that the semantics of the request are safe.
That's not to say that you can't have side effects when handling a GET request; "hit counters" are almost as old as the web itself. You have a lot of freedom in your implementation; so long as you respect the uniform interface, there won't be too much trouble.
Experience report: one of our internal tools uses GET requests to trigger scheduling; in our carefully controlled context (which is not web scale), we get away with it, and have for a very long time.
To borrow your language, there are certainly scenarios that would give us problems; but given our controls we manage to avoid them.
I wouldn't like our chances, though, if requests started coming in from outside of our carefully controlled context.
I think it's a decent question. You're asking a hypothetical: is there any value to doing the right other than that's we agree to use GET for fetching? e.g.: is there value beyond the fact that it's 'semantically nice'. A similar question in HTML might be: "Is it ok to use a <div> with an onclick instead of a <button>? (the answer is no).
There certainly is. Clients, servers and intermediates all change their behavior depending on what method is used. Even if your server can process GET for updates, and you build a client that uses this, your browser might still get confused.
If you are interested in this subject, don't ask on a forum; read the spec. The HTTP specification tells you what clients, servers and proxies should do when they encounter certain methods, statuses and headers.
Start at RFC7231

Recreate a site from a tcpdump?

It's a long story, but I am trying to save an internal website from the pointy hair bosses who see no value from it anymore and will be flicking the switch at some point in the future. I feel the information contained is important and future generations will want to use it. No, it's not some adult site, but since it's some big corp, I can't say any more.
The problem is, the site is a mess of ASP and Flash that only works under IE7 and is buggy under IE8 and 32bit only even. All the urls are session style and are gibberish. The flash objects itself pull extra information with GET request to ASP objects. It's really poorly designed for scraping. :)
So my idea is to do a tcpdump as I navigate the entire site. Then somehow dump the result of every GET into a sql database. Then with a little messing with the host file, redirect every request to some cgi script that will look for a matching get request in the database and return the data. So the entire site will be located in an SQL database in URL/Data keypairs. Flat file may also work.
In theory, I think this is the only way to go about this. The only problem I see is if they do some client side ActiveX/Flash stuff that generates session URLs that will be different each time.
Anyway, I know Perl, and the idea seems simple with the right modules, so I think I can do most of the work in that, but I am open to any other ideas before I get started. Maybe this exist already?
Thanks for any input.
To capture I wouldn't use tcpdump, but either the crawler itself or a webproxy that can be tweaked to save everything, e.g. Fiddler, Squid, or mod_proxy.

Single request to multiple asynchronous responses

So, here's the problem. iPhones are awesome, but bandwidth and latency are serious issues with apps that have serverside requirements. My initial plan to solve this was to make multiple requests for bits of data (pun unintended) and have that be how the issue of lots of incoming//outgoing data was handled. This is a bad idea for a lot of reasons, most obvious to me is that my poor database (MySQL) can't handle this very well. From what I understand it's better to request large chunks all at once, especially if I'm going to ask for all of it anyways.
The problem is now I'm waiting again for a large amount of data to get through. I was wondering if there's a way to basically send the server a bunch of IDs to get from the database, and then that SINGLE request then sends a lot of little responses, each one containing all the information about a single db entry. Order is irrelevant, and ideally I'd be able to send another request to the server telling it to stop sending me things because I have what I need.
I realize this is probably NOT a simple thing to do so if you (awesome) guys could point me in the right direction that would also be incredible.
Current system is iPhone (Cocoa//Objective-C) -> PHP -> MySQL
Thanks a ton in advance.
AFAIK, a single request cannot get multiple responses. From what you are asking, it seems that you need to do this in two parts.
Part 1: Send a single call with the IDs.
Your server responds with a single message that contains the URLs or the information needed to call the unique "smaller" answers.
Part 2: Working from that list of responses, fire off multiple requests that run on their own threads.
I am thinking of this similar to how a web page works. You call the HTML URL in a web browser. The HTML tells the browser all the places/URLS it needs to get additional pieces (images, css, js, etc) to build the full page.
Hope this helps.

SOP issue behind reverse proxy

I've spent the last 5 months developing a gwt app, and it's now become time for third party people to start using it. In preparation for this one of them has set up my app behind a reverse proxy, and this immediately resulted in problems with the browser's same origin policy. I guess there's a problem in the response headers, but I can't seem to rewrite them in any way to make the problem go away. I've tried this
response.setHeader("Server", request.getRemoteAddress());
in some sort of naive attempt to mimic the behaviour I want. Didn't work (to the surprise of no-one).
Anyone knowing anything about this will most likely snicker and shake their heads when reading this, and I do not blame them. I would snicker too, if it was me... I know nothing at all about this, and that naturally makes this problem awfully hard to solve. Any help at all will be greatly appreciated.
How can I get the header rewrite to work and get away from the SOP issues I'm dealing with?
Edit: The exact problem I'm getting is a pop-up saying:
"SmartClient can't directly contact
URL
'https://localhost/app/resource?action='doStuffs'"
due to browser same-origin policy.
Remove the host and port number (even
if localhost) to avoid this problem,
or use XJSONDataSource protocol (which
allows cross-site calls), or use the
server-side HttpProxy included with
SmartClient Server."
But I shouldn't need the smartclient HttpProxy, since I have a proxy on top of the server, should I? I've gotten no indications that this could be a serialisation problem, but maybe this message is hiding the real issue...
Solution
chris_l and saret both helped to find the solution, but since I can only mark one I marked the answer from chris_l. Readers are encouraged to bump them both up, they really came through for me here. The solution was quite simple, just remove any absolute paths to your server and use only relative ones, that did the trick for me. Thanks guys!
The SOP (for AJAX requests) applies, when the URL of the HTML page, and the URL of the AJAX requests differ in their "origin". The origin includes host, port and protocol.
So if the page is http://www.example.com/index.html, your AJAX request must also point to something under http://www.example.com. For the SOP, it doesn't matter, if there is a reverse proxy - just make sure, that the URL - as it appears to the browser (including port and protocol) - isn't different. The URL you use internally is irrelevant - but don't use that internal URL in your GWT app!
Note: The solution in the special case of SmartClient turned out to be using relative URLs (instead of absolute URLs to the same origin). Since relative URLs aren't an SOP requirement in browsers, I'd say that's a bug in SmartClient.
What issue are you having exactly?
Having previously had to write a reverseproxy for a GWT app I can't remember hitting any SOP issues, one thing you need to do though is make sure response headers and uri's are rewritten to the reverseproxies url - this includes ajax callback urls.
One issue I hit (which you might also experience) when running behind a reverseproxy was with the serialization policy of GWT server.
Fixing this required writing an implementation of RemoteServiceServlet. While this was in early/mid 2009, it seems the issue still exists.
Seems like others have hit this as well - see this for further details (the answer by Michele Renda in particular)

Is there any way to allow failed uploads to resume with a Perl CGI script?

The application is simple, an HTML form that posts to a Perl script. The problem is we sometimes have our customers upload very large files (gt 500mb) and their internet connections can be unreliable at times.
Is there any way to resume a failed transfer like in WinSCP or is this something that can't be done without support for it in the client?
AFAIK, it must be supported by the client. Basically, the client and the server need to negotiate which parts of the file (likely defined as parts in "multipart/form-data" POST) have already been uploaded, and then the server code needs to be able to merge newly uploaded data with existing one.
The best solution is to have custom uploader code, usually implemented in Java though I think this may be possible in Flash as well. You might be even able to do this via JavaScript - see 2 sections with examples below
Here's an example of how Google did it with YouTube: http://code.google.com/apis/youtube/2.0/developers_guide_protocol_resumable_uploads.html
It uses "308 Resume Incomplete" HTTP response which sends range: bytes=0-408 header from the server to indicate what was already uploaded.
For additional ideas on the topic:
http://code.google.com/p/gears/wiki/ResumableHttpRequestsProposal
Someone implemented this using Google Gears on calient side and PHP on server side (the latter you can easily port to Perl)
http://michaelshadle.com/2008/11/26/updates-on-the-http-file-upload-front/
http://michaelshadle.com/2008/12/03/updates-on-the-http-file-upload-front-part-2/
It's a shame that your clients can't use ftp uploading, since this already includes abilities like that. There is also "chunked transfer encoding" in HTTP. I don't know what Perl modules might support it already.