I have written a Perl Script which uses WWW::Mechanize to connect to a site, login and then visit a few pages inside the site. It all works good, however, when I try to visit a large number of pages, the script gets killed. I am sure this has got nothing to with the HTTP Server's Configuration and the connection limits configured. This is because, the script is running on my own site.
Here's a high level overview of my script:
$url="http://example.com";
$mech=WWW::Mechanize->new();
$mech->cookie_jar(HTTP::Cookies->new());
$mech->get($url);
login to the site using the form fields.
Now, once I am logged in, I connect to URLs within the site as follows:
$i is the iteration counter in a for loop
$internal_url="http://example.com/index.php?page=$i";
$mech->get($internal_url);
perform some operations on the page returned ($mech->content using HTML::TreeBuilder::XPath)
now, I iterate over the for loop connecting to a different internal_url, since the value of $i is incremented in every iteration.
As I said, it all works good. However, after about 180 pages, the script gets killed.
What could be the reason? I have tried multiple times.
I even added a $mech->delete; right before the end of the FOR loop to prevent any memory leak.
However, the only issue is that the login session which was maintained by $mech would be destroyed as a result of this.
I have tried multiple times and this script always gets killed after visiting the same number of pages.
Thanks.
Try this code:
$mech=WWW::Mechanize->new();
$mech->stack_depth(0);
OR
$mech=WWW::Mechanize->new(stack_depth=>0);
According to the docs: Get or set the page stack depth. Use this if
you're doing a lot of page scraping and running out of memory.
A value of 0 means "no history at all." By default, the max stack
depth is humongously large, effectively keeping all history.
Related
I have a strange problem with my script in powershell, I want to examine the average time of downloading page. I write script which fires frequently. But sometimes my script returns result 0, which means it downloads site in 0 ms. If i modified my script to save whole site to the file when the download time is about 0ms it doesn't saves anything. And I'm interesting if I do something wrong, or powershell function isn't too accurate to count such "small" times.
ps. other "good" results are about 4-9 ms.
Here is a part of my script which responds to count the download time:
$StartTime = Get-Date
$PageDownload = $Request.DownloadString("mypage.com")
$TimeTaken = ((Get-Date) - $StartTime).TotalMilliseconds
Get-Date should be as precise as the system clock is.
There could be web caching going on. Unfortunately, disabling caching for WebClient is not possible, from what I see elsewhere. The "do it right" method is to construct your own Http request with the TcpClient class, but that's also pretty complex.
One easy way to make sure you're not being cached is to put an arbitrary value as a GET request. It's a hack, but it is often enough to fool a cache. So, instead of:
"http://mypage.com"
You use:
"http://mypage.com?someUnusedValueName=$([System.Environment]::TickCount)"
I've a standalone Raspberry Pi which shows a webpage from another server.
It reloads after 30 minutes via JavaScript on the webpage.
In some cases, the server isn't reachable for a very short time and Chromium shows the usual This webpage is not available message, and stops reloading
(because no JavaScript from the page triggers an reload).
In this case, how can I still reload the webpage after a few seconds?
Now i had the Idea to fetch the website results via AJAX and replace it in the current page if they were available.
Rather than refreshing the webpage every few minutes, what you can do is ping the server using javascript (pingjs is a nice library that can do that)
Now, if the ping is successful, reload the page. If it is not successful, wait for 30 more seconds and ping it again. Doing this continuously, will basically make you wait until the server is open again (i.e. you can ping it)
I think this is a much simpler method compared to making your own java browser and making a browser plugin.
Extra info: You should use a exponential function or timeout checking to avoid unnecessary processing overhead. i.e. the first time out find the ping fails, wait for 30 seconds, second time wait for 30*(2^1) sec, 3rd time wait for 30*(2^2) and so on until you reach a maximum value.
Note - this assumes your server is really unreachable ... and not just that the html page in unavailable (there's a small but appreciable difference)
My favored approach would be to copy the web page locally using a script every 30 mins and point chromium to the local copy.
The advantage is that script can run every 30 seconds, and it checks if the successful page pull happened in the last 30 mins. If YES it then does nothing. If NO then you can keep attempting to pull it. In the mean time the browser will be set to refresh the page every 5 seconds, but because it is pulling a local page it does little to no work for each refresh. You then can detect if what it has pulled back has the required content in it.
This approach assumes that your goal is to avoid refreshing the page every few seconds and therefore reducing load on the remote page.
Use these options to grab the whole page....
# exit if age of last reload is less than 1800 seconds (30 minutes)
AGE_IN_SECS=$(( $( perl -e 'print time();' ) - $(stat -c "%Y" /success/directory/index.html) ))
[[ $AGE_IN_SECS -lt 1800 ]] && exit
# copy whole page to the current directory
cd /temporary/directory
wget -p -k http://www.example.com/
and then you need to test the page in some way to ensure you have what you need, for example (using bash script)....
RESULT=$(grep -ci "REQUIRED_PATTERN_MATCH" expected_file_name )
[[ $result -gt 0 ]] && cp -r /temporary/directory/* /success/directory
rm -rf /temporary/directory/*
NOTE:
This is only the bare bones of what you need as I don't know the specifics of what you need. But you should also look at trying to ...
ensure you have a timeout on the wget, such that you do not have multiple wgets running.
create some form of back off so that you do not hammer the remote server when it is trouble
ideally show some message on the page if it is over 40 minutes old so that viewer knows a problem is being experienced.
you could use a chromium refresh plugin to pull the page from locally
you can use your script to alter the page once you have downloaded it if you want to add in additional/altered formatting (e.g. replace the css file?)
I see three solutions:
Load page in iframe (if not blocked), and check for content/response).
Create simple browser in java (not so hard, even if you dont know this language, using webview)
Create plugin for your browser.
reloading a page via javascript is pretty easy:
function refresh() {
var xhr = new XMLHttpRequest();
xhr.onreadystatechange = function() {
if (xhr.readyState == 4 && xhr.status === 200)
document.body.innerHTML = this.responseXML.body;
else
setTimeout('refresh', 1500);
};
xhr.open('GET', window.location.href);
xhr.responseType = "document"
xhr.send();
}
setInterval('refresh', 30*60*1000);
this should work as you requested
I used this link to fetch the pound to euro exchange rate on a daily (nightly) basis:
http://www.google.com/ig/calculator?hl=en&q=1pound=?euro
This returned an array which I then stripped and used the data I needed.
Since the first of November they retired iGoogle resulting in the URL to forward to: https://support.google.com/websearch/answer/2664197
Anyone knows an alternative URL that won't require me to rewrite the whole function? I'm sure google didn't stop providing this service entirely.
I started getting cronjob errors today on this very issue. So I fell back to a prior URL I was using before I switched to the faster/reliable iGoogle.
Url to programmatically hit (USD to EUR):
http://www.webservicex.net/CurrencyConvertor.asmx/ConversionRate?FromCurrency=USD&ToCurrency=EUR
Details about it:
http://www.webservicex.net/ws/WSDetails.aspx?CATID=2&WSID=10
It works for now, but its prone to be slow at times, and used to respond with an "Out of space" error randomly. Just be sure to code in a way to handle that, and maybe run the cron four times a day instead of once. I run ours every hour.
Example code to get the rate out of the return (there is probably a more elegant way):
$ci = curl_init($accessurl);
curl_setopt($ci, CURLOPT_HTTPGET, 1);
curl_setopt($ci, CURLOPT_RETURNTRANSFER, 1);
$rawreturn = curl_exec($ci);
curl_close($ci);
$rate = trim(preg_replace("/.*<double[^>]*>([^<]*)<\/double[^>]*>.*/i","$1",$rawreturn));
This is a weird one. :)
I have a script running under Apache 1.3, with Apache::PerlRun option of mod_perl. It uses the standard CGI.pm module. It's a regularly accessed script on a busy server, accessed over https.
The URL is typically something like...
/script.pl?action=edit&id=47049
Which is then brought into Perl the usual way...
my $action = $cgi->param("action");
my $id = $cgi->param("id");
This has been working successfully for a couple of years. However we started getting support requests this week from our customers who were accessing this script and getting blank pages. We already had a line like the following that put the current URL into a form we use for customers to report an issue about a page...
$cgi->url(-query => 1);
And when we view source of the page, the result of that command is the same URL, but with an entirely different query string.
/script.pl?action=login&user=foo&password=bar
A query string that we recognise as being from a totally different script elsewhere on our system.
However crazy it sounds, it seems that when users are accessing a URL with a query string, the query string that the script is seeing is one from a previous request on another script. Of course the script can't handle that action and outputs nothing.
We have some automated test scripts running to see how often this happens, and it's not every time. To throw some extra confusion into the mix, after an Apache restart, the problem seems to initially disappear completely only to come back later. So whatever is causing it is somehow relieved by a restart, but we can't see how Apache can possibly take the request from one user and mix it up with another.
This, it appears, is an interesting combination of Apache 1.3, mod_perl 1.31, CGI.pm and Apache::GTopLimit.
A bug was logged against CGI.pm in May last year: RT #57184
Which also references CGI.pm params not being cleared?
CGI.pm registers a cleanup handler in order to cleanup all of it's cache.... (line 360)
$r->register_cleanup(\&CGI::_reset_globals);
Apache::GTopLimit (like Apache::SizeLimit mentioned in the bug report) also has a handler like this:
$r->post_connection(\&exit_if_too_big) if $r->is_main;
In pre mod_perl 1.31, post_connection and register_cleanup appears to push onto the stack, while in 1.31 it appears as if the GTopLimit one clobbers the CGI.pm entry. So if your GTopLimit function fires because the Apache process has got to large, then CGI.pm won't be cleaned up, leaving it open to returning the same parameters the next time you use it.
The solution seems to be to change line 360 of CGI.pm to;
$r->push_handlers( 'PerlCleanupHandler', \&CGI::_reset_globals);
Which explicitly pushes the handler onto the list.
Our restart of Apache temporarily resolved the problem because it reduced the size of all the processes and gave GTopLimit no reason to fire.
And we assume it has appeared over the past few weeks because we have increased the size of the Apache process either through new developments which included something that wasn't before.
All tests so far point to this being the issue, so fingers crossed it is!
I want to get file size I'm doing this:
my $filename=$query->param("upload_file");
my $filesize = (-s $filename);
print "Size: $filesize ";`
Yet it is not working. Note that I did not upload the file. I want to check its size before uploading it. So to limit it to max of 1 MB.
You can't know the size of something before uploading. But you can check the Content-Length request header sent by the browser, if there is one. Then, you can decide whether or not you want to believe it. Note that the Content-Length will be the length of the entire request stream, including other form fields, and not just the file upload itself. But it's sufficient to get you a ballpark figure for conformant clients.
Since you seem to be running under plain CGI, you should be able to get the request body length in $ENV{CONTENT_LENGTH}.
Also want to sanity check against possibly already having post max set (from perldoc CGI):
$CGI::POST_MAX
If set to a non-negative integer, this variable puts a ceiling on the size of
POSTings, in bytes. If CGI.pm detects a POST that is greater than the ceiling,
it will immediately exit with an error message. This value will affect both
ordinary POSTs and multipart POSTs, meaning that it limits the maximum size of
file uploads as well. You should set this to a reasonably high value, such as
1 megabyte.
The uploaded file is stashed in a tmp location on the server when the form is submitted, check the file size there.
Supply the value for $field.
my $upload_filehandle = $query->upload($field);
my $tmpfilename = $query->tmpFileName($upload_filehandle);
my $file_size = (-s $tmpfilename);
This has nothing to do with Perl.
You are trying to read the filesize of a file on the user's computer using commands that read files on your server, what you want can't be done using Perl.
This is something that has to be done in the browser, and looking briefly at these questions it's either very hard or impossible.
Your best bet is to allow the user to start the upload and abort if the file is too big.
If you want to check before you process the request, you might be better off checking on the web page that triggers the request. I don't think the web browser can do it on it's own, but if you don't mind Flash, there are many Flash upload tools that can check things like size (as well as file types) and prevent uploading.
A good one to start with is the YUI Uploader. Lots more here: What is the best multiple file JavaScript / Flash file uploader?
Obviously you would want to check on the server side too, but by the time the user has started sending the request to the server, you are already using up your CPU cycles and bandwidth.
Thanks everyone for your replies; I just found out why $filesize = (-s $filename); was not working before, it is due that I was checking file size while sending Ajax request and not while re submitting the page.That's why I was having size to be zero. I fixed that to submit the page and it worked. Thanks.
Just read this post but while checking the content-length is a good approximate pre-check you could also save the file to temporary folder and then perform any kind of check on it. If it doesn't meet your criteria just delete and don't send it to it's final destination.
Look at the perl documentation for file stats -X - perldoc.perl.org and stat-perldoc.perl.org. Also, you can look at this upload script which is doing the similar thing what you are trying to do.