i am pulling urls from my database with a perl script where i employ fetchrow_array to pull URL from the database which worked fine until i encountered a very long URL georgelog24.blog.iskreni.net/?bid=6744d9dcf85991ed2e4b8a258153a1ab&lid=ff9963b9a798ea335b75b5f7c0c295d1
then it started to give me this error.
DBD::ODBC::st fetchrow_array failed: st_fetch/SQLFetch (long truncated DBI attribute LongTruncOk not set and/or LongReadLen too small) (SQL-HY000) [state was HY000 now 01004]
[Microsoft][ODBC SQL Server Driver]String data, right truncation (SQL-01004) at C:\test\multihashtest2.pl line 44.
I believe this is on the database side as the code i have been using to pull URL has worked before. The database that i am using is MSSQL server 2005.
the URL column in the database uses text type currently, but i have tried changing it to varchar(max) and nvarchar(max) but the error still stands.
After a bit of trial and error i found that the maximum length of the url then i could query successfully with fetchrow_array was 81 characters. And since URLs can span ridiculous lengths sometimes, i cannot put a restriction on URL length.
Can anybody help me understand and suggest a fix for this?
FYI: line 44 is the first line in my code below
while (($myid,$url) = $statement_handle->fetchrow_array()) { # executes as many threads as there are jobs to do
my $thread = threads->create(\&webcrawl); #initiate thread
my $tid = $thread->tid;
print " - Thread $tid started\n"; #obtain thread no. and print
push (#Threads, $thread); #push thread into array for "housekeeping" later on
}
Try with:
#not anymore errors if content is truncated - you don't necessarily want this
$statement_handle->{'LongTruncOk'} = 1;
#nice, hard coded constant for the length of data to be read from Longs
$statement_handle->{'LongReadLen'} = 20000;
while (($myid,$url) = $statement_handle->fetchrow_array()) { # executes as many threads as there are jobs to do
my $thread = threads->create(\&webcrawl); #initiate thread
my $tid = $thread->tid;
print " - Thread $tid started\n"; #obtain thread no. and print
push (#Threads, $thread); #push thread into array for "housekeeping" later on
}
Also, I'd recommend you to try Parallel::ForkManager for parallelizing jobs - I find it much more intuitive and easy to use than threads
Please look at the DBI attributes LongTruncOk and LongReadlen
You will NEED to either accept truncation or set a max size as text and varchar(max) columns can be massive so if it was left to the DBD it would have no choice but to allocate massive amounts of memory in case the column is the max size of that column.
Important point: you need to set the LongReadLen and/or LongTruncOk attributes on the database handle prior to preparing the statement, as noted here.
Attempting to set it on the prepared statement handle prior to fetching data will have no effect on truncation of the returned data.
Related
I've added advisory locking to Sqitch, using Postgres advisory locks and MySQL GET_LOCK(). This feature prevents more than one instance of Sqitch from deploying to a database at one time. This works great, but I wanted to add a lock timeout, too, so that one never finds a CI/CD process hung for hours or days because something went amiss.
MySQL's GET_LOCK() supports a timeout argument, but Postgres advisory locks do not. Since I thought it likely that other database engines would also not have timeouts, I thought it best to implement the timeout in Perl. Following the DBI manual, I used Sys::SigAction to set and handle the timeout:
# Try waiting for the lock.
require App::Sqitch::SigAction;
return $self->_locked(1) unless App::Sqitch::SigAction::timeout_call($wait, sub {
$self->wait_lock
});
I also added tests to confirm it works with both MySQL and Postgres. So far so good.
Alas, Sys::SigAction does not work on Windows. I took a stab and testing it on Windows, but since Windows Perl is not compiled with d_sigaction, which Sys::SigAction also requires, I didn't get far. I tried implementing the Perl-standard alarm/$SIG{ALRM} pattern, but it failed to send the signal while waiting on the Postgres lock.
Which has led me here and to my question: What is the best cross-platform pattern for timing out some execution in Perl? Ideally it has a straight-forward interface, works on *nix and Windows, and effectively handles breaking out of a database query.
I ended up ditching Sys::SigAction following discussion here and elsewhere, and instead switched to:
Letting the database handle the timeout, as MySQL's get_lock() does
Adding a simple interface for polling with exponential backoff and timeout that engines can use to poll for a lock instead of waiting (similar to Retry::Backoff)
Switching the Postgres implementation to use the async query support in DBD::Pg to send off the lock request, and uses the backoff/timeout interface to check to see if it has returned and cancel the query if it times out
I was especially pleased to realize I could do #3, as I originally used the timeout/backoff interface to poll with pg_try_advisory_lock( key ), which just feels heavy. Better to asynchronously call pg_advisory_lock ( key ) and poll for its response. It looks like this:
sub wait_lock {
my $self = shift;
# Asyncronouslly request a lock with an indefinite wait.
my $dbh = $self->dbh;
$dbh->do(
'SELECT pg_advisory_lock(75474063)',
{ pg_async => DBD::Pg::PG_ASYNC() },
);
# Use _timeout to periodically check for the result.
return 1 if $self->_timeout(sub { $dbh->pg_ready && $dbh->pg_result });
# Timed out, cancel the query and return false.
$dbh->pg_cancel;
return 0;
}
Of course the MySQL implementation is simpler, since get_lock() does all the work:
sub wait_lock {
my $self = shift;
$self->dbh->selectcol_arrayref(
q{SELECT get_lock('sqitch working', ?)},
undef, $self->lock_timeout
)->[0]
}
I have a file, where I keep stored a serialized perl hash. In my current script, I load the values like this:
my $arrayref = retrieve("mySerializedFile");
my $a = $arrayref->[0];
my $b = $arrayref->[1];
my $c = $arrayref->[2];
My problem is that the file is about a 1GB so it takes about ten secs to load, and then a second more to perform some operations. I would like to reduce the retrieve time.
Is there any way of having this info loaded before the script execution? I mean, mySerialiedFile is not suposed to be changed in a long time, so if I could have it loaded always on the system would be nice, and would improve my execution time from 11secs to 1.
Following the suggestions in the comments, I used a db engine, which improved A LOT the execution time, which is about 5secs now.
I've inherited some code at work i'm trying to improve on. My Perl skills are somewhat lacking so would love some assistance!
Essentially this script is SNMP polling a network of thousands of nodes to update it's local interface index cache. I've found it's hitting a problem where it's exhausting it's memory and failing. Code as follows (heavily reduced but i think you'll get the jist)
use strict;
use warnings;
use Parallel::Loops;
my %snmp_results;
my $maxProcs = 50;
my #exceptions;
my #devices;
my %snmp_results;
my $pl = Parallel::Loops->new($maxProcs);
$pl->share(\%snmp_results, \#exceptions );
load_devices();
get_snmp_interfaces();
sub get_snmp_interfaces {
$pl->foreach( \#devices, sub {
my ($name, $community, $snmp_ver) = #$_;
# Create the new ifindex cache, and return an array reference to the new entries
my $result = getSNMPIFFull($name, $community, $snmp_ver);
if (defined $result && $result ne "") {
my %cache = %{$result};
print "Got cache for $name\n";
# Build hash of all the links polled through SNMP
# [ifindex, ifdesc, ifalias, ifspeed, ip]
for my $link (keys %cache) {
$snmp_results{$name}{$cache{$link}[0]} = [$cache{$link}[0], $cache{$link}[1], $cache{$link}[2], $cache{$link}[3], $cache{$link}[4]];
}
}
else {
push(#exceptions, "Unable to poll $name - $community - $snmp_ver");
}
});
}
This particular VM has 3.1GB of ram alloctable and is idling on about 83MB usage when this script is not running. If i drop the maxProcs down to 25, it will finish fine but this script can already take a long time given the sheer number of devices + latency so would rather keep the parallelism high!
I have a feeling that the $pl->share() is sharing the ever-expanding %snmp_results with each forked process which is definitely not necessary since it's not reading/modifying other entries: just adding new entries. Is there a better way I can be doing this?
I'm also slightly unsure about my %cache = %{$result};. If this is just creating a pointer as a hash then cool but if it's doing a copy, that's also a bit wasteful!
Any help will be greatly appreciated!
Documentation of the module can be found in the CPAN here.
There's one part talking about the performance:
Also, if each loop sub returns a massive amount of data, this needs to
be communicated back to the parent process, and again that could
outweigh parallel performance gains unless the loop body does some
heavy work too.
You are probably moving around complete copies of the variables in memory, pushing to the machine's limit if the MIB to poll and number of machines are big enough.
Since what you are doing is an I/O intensive task and not a CPU task that could benefit of parallel CPU processing, I would reconsider the approach of launching so many (50!) threads for polling.
Run the program with $maxProcs down to 1 to 5 processes and see how it behaves. Do some profiling of your code, attaching Devel::NYTProf to check where you are consuming time and if increasing the number of processes actually leads to a better performance.
Reconsider using Parallel::Loops for this task. You may get better performance with use threads[1] and a hash shared between the different threads (use threads::shared).
Apologies if this could have been a comment. Starting in SO is difficult due to all the limitations that are in place :(
If you already found a solution it would be great if you could share with us your findings. I didn't know Parallel::Loops before and I think I can give it some use.
I have a loop that reads from a socket in Lua:
socket = nmap.new_socket()
socket:connect(host, port)
socket:set_timeout(15000)
socket:send(command)
repeat
response,data = socket:receive_buf("\n", true)
output = output..data
until data == nil
Basically, the last line of the data does not contain a "\n" character, so is never read from the socket. But this loop just hangs and never completes. I basically need it to return whenever the "\n" delimeter is not recognised. Does anyone know a way to do this?
Cheers
Updated
to include socket code
Update2
OK I have got around the initial problem of waiting for a "\n" character by using the "receive_bytes" method.
New code:
--socket set as above
repeat
data = nil
response,data = socket:receive_bytes(5000)
output = output..data
until data == nil
return output
This works and I get the large complete block of data back. But I need to reduce the buffer size from 5000 bytes, as this is used in a recursive function and memory usage could get very high. I'm still having problems with my "until" condition however, and if I reduce the buffer size to a size that will require the method to loop, it just hangs after one iteration.
Update3
I have gotten around this problem using string.match and receive_bytes. I take in at least 80 bytes at a time. Then string.match checks to see if the data variable conatins a certain pattern. If so it exits. Its not the cleanest solution, but it works for what I need it to do. Here is the code:
repeat
response,data = socket:receive_bytes(80)
output = output..data
until string.match(data, "pattern")
return output
I believe the only way to deal with this situation in a socket is to set a timeout.
The following link has a little bit of info, but it's on http socket: lua http socket timeout
There is also this one (9.4 - Non-Preemptive Multithreading): http://www.lua.org/pil/9.4.html
And this question: http://lua-list.2524044.n2.nabble.com/luasocket-howto-read-write-Non-blocking-TPC-socket-td5792021.html
A good discussion on Socket can be found on this link:
http://nitoprograms.blogspot.com/2009/04/tcpip-net-sockets-faq.html
It's .NET but the concepts are general.
See update 3. Because the last part of the data is always the same pattern, I can read in a block of bytes and each time check if that block has the pattern. If it has the pattern it will mean that it is the end of the data, append to the output variable and exit.
(Update to answer Jonathan Leffler's question below):
We're running Perl 5.8.7 and Oracle 11.1.0.7.0.
Due to the company's policy, developers have no arbitrary control in regard to software upgrade. Giving the proposal to the upper management takes months to be followed up (if approved) - I guess it's not a surprisingly odd situation for several other companies too.
I inherited the program from someone else left the company and found the warning about "issuing rollback() ..." from the application log file. The actual problem "maximum open_cursor exceeded" was found after I run DBI_TRACE=2=/tmp/trace.log program_name.pl.
Looking at the number of $dbh->{ActiveKids}, $dbh->{Kids}, and $dbh->{CachedKids}, I assume the maximum open cursor is 50 as the error happens after it reaches 50.
Our legacy production codes are using these modules:
DBI - 1.48
Ima::DBI - 0.33
Class::DBI - 0.96
Class::DBI::Oracle - 0.51
DBD::Oracle - 1.16
For some odd policy reason, upgrading the module to a newer version is not possible :(
The application relies on using CDBI to handle relationships on a large number of tables. A simplify snippet of the code is as below:
JOB:
foreach my $job (#jobs) {
my #records = $job->record;
RECORD:
foreach my $record (#records) {
my #datas = $record->data;
DATA:
foreach my $data (#datas) {
....
}
}
}
where each #jobs, $record, and $data is an object to a table and the inner most loop calls several other triggers.
Somewhere after several loops I'm getting an Oracle error: maximum open_cursor exceeded and then I got the error from the CDBI: issuing rollback() for database handle being DESTROYE'd without explicit disconnect.
I can workaround it by undef-ing the DBI CachedKids on the most outer loop, with:
# somewhere during initialization
$self->{_this_dbh} = __PACKAGE__->db_Main();
....
JOB:
foreach my $job (#jobs) {
RECORD: ....
DATA: ....
$self->{_this_dbh}->{CachedKids} = undef;
}
Is that the proper way to do it?
Or does CDBI support a way to clear statement handle the same way as DBI $sth->finish() ?
Thanks.
At some point, you will have to explain why you cannot upgrade to more nearly current versions of the software. You didn't mention which version of Perl you are using, or which version of Oracle; somehow, I suspect that it is neither 5.10.1 nor 11gR2.
Current versions:
Class::DBI 3.0.17
Class::DBI::Oracle 0.51
DBI 1.609 (version 1.48 is from 2005)
DBD::Oracle 1.23 (version 1.16 is from 2004)
Ima::DBI 0.35
What changed recently? Why are you suddenly finding problems in a piece of software that was, presumably, very stable? Is this new code?
With plain DBI, when you undef a statement handle (by having it go out of scope, for example), then the resources associated with it are released - more or less noisily. However, there is enough infrastructure between Class::DBI and DBI that it is hard to tell how this might map.
Have you worked out what the limit on open cursors actually is?
Have you worked out whether you've opened enough cursors to actually exceed that limit?
Have you tried running with DBI_TRACE set in the environment? A value such as 3 will tell you a fair amount about what it going on - maybe too much. It would show whether cursors are being released properly or not.
Have you tried reducing the number of tables manipulated in a single session?
Have you considered disconnecting and reconnecting between manipulating tables?
Is there a way to get to the statement handle corresponding to the Class::DBI abstractions, so that you can in fact execute $sth->finish()?