loading a serialized variable in perl - perl

I have a file, where I keep stored a serialized perl hash. In my current script, I load the values like this:
my $arrayref = retrieve("mySerializedFile");
my $a = $arrayref->[0];
my $b = $arrayref->[1];
my $c = $arrayref->[2];
My problem is that the file is about a 1GB so it takes about ten secs to load, and then a second more to perform some operations. I would like to reduce the retrieve time.
Is there any way of having this info loaded before the script execution? I mean, mySerialiedFile is not suposed to be changed in a long time, so if I could have it loaded always on the system would be nice, and would improve my execution time from 11secs to 1.

Following the suggestions in the comments, I used a db engine, which improved A LOT the execution time, which is about 5secs now.

Related

Why does a program with Parallel::Loops exhaust my memory?

I've inherited some code at work i'm trying to improve on. My Perl skills are somewhat lacking so would love some assistance!
Essentially this script is SNMP polling a network of thousands of nodes to update it's local interface index cache. I've found it's hitting a problem where it's exhausting it's memory and failing. Code as follows (heavily reduced but i think you'll get the jist)
use strict;
use warnings;
use Parallel::Loops;
my %snmp_results;
my $maxProcs = 50;
my #exceptions;
my #devices;
my %snmp_results;
my $pl = Parallel::Loops->new($maxProcs);
$pl->share(\%snmp_results, \#exceptions );
load_devices();
get_snmp_interfaces();
sub get_snmp_interfaces {
$pl->foreach( \#devices, sub {
my ($name, $community, $snmp_ver) = #$_;
# Create the new ifindex cache, and return an array reference to the new entries
my $result = getSNMPIFFull($name, $community, $snmp_ver);
if (defined $result && $result ne "") {
my %cache = %{$result};
print "Got cache for $name\n";
# Build hash of all the links polled through SNMP
# [ifindex, ifdesc, ifalias, ifspeed, ip]
for my $link (keys %cache) {
$snmp_results{$name}{$cache{$link}[0]} = [$cache{$link}[0], $cache{$link}[1], $cache{$link}[2], $cache{$link}[3], $cache{$link}[4]];
}
}
else {
push(#exceptions, "Unable to poll $name - $community - $snmp_ver");
}
});
}
This particular VM has 3.1GB of ram alloctable and is idling on about 83MB usage when this script is not running. If i drop the maxProcs down to 25, it will finish fine but this script can already take a long time given the sheer number of devices + latency so would rather keep the parallelism high!
I have a feeling that the $pl->share() is sharing the ever-expanding %snmp_results with each forked process which is definitely not necessary since it's not reading/modifying other entries: just adding new entries. Is there a better way I can be doing this?
I'm also slightly unsure about my %cache = %{$result};. If this is just creating a pointer as a hash then cool but if it's doing a copy, that's also a bit wasteful!
Any help will be greatly appreciated!
Documentation of the module can be found in the CPAN here.
There's one part talking about the performance:
Also, if each loop sub returns a massive amount of data, this needs to
be communicated back to the parent process, and again that could
outweigh parallel performance gains unless the loop body does some
heavy work too.
You are probably moving around complete copies of the variables in memory, pushing to the machine's limit if the MIB to poll and number of machines are big enough.
Since what you are doing is an I/O intensive task and not a CPU task that could benefit of parallel CPU processing, I would reconsider the approach of launching so many (50!) threads for polling.
Run the program with $maxProcs down to 1 to 5 processes and see how it behaves. Do some profiling of your code, attaching Devel::NYTProf to check where you are consuming time and if increasing the number of processes actually leads to a better performance.
Reconsider using Parallel::Loops for this task. You may get better performance with use threads[1] and a hash shared between the different threads (use threads::shared).
Apologies if this could have been a comment. Starting in SO is difficult due to all the limitations that are in place :(
If you already found a solution it would be great if you could share with us your findings. I didn't know Parallel::Loops before and I think I can give it some use.

Powershell download page in 0 milliseconds

I have a strange problem with my script in powershell, I want to examine the average time of downloading page. I write script which fires frequently. But sometimes my script returns result 0, which means it downloads site in 0 ms. If i modified my script to save whole site to the file when the download time is about 0ms it doesn't saves anything. And I'm interesting if I do something wrong, or powershell function isn't too accurate to count such "small" times.
ps. other "good" results are about 4-9 ms.
Here is a part of my script which responds to count the download time:
$StartTime = Get-Date
$PageDownload = $Request.DownloadString("mypage.com")
$TimeTaken = ((Get-Date) - $StartTime).TotalMilliseconds
Get-Date should be as precise as the system clock is.
There could be web caching going on. Unfortunately, disabling caching for WebClient is not possible, from what I see elsewhere. The "do it right" method is to construct your own Http request with the TcpClient class, but that's also pretty complex.
One easy way to make sure you're not being cached is to put an arbitrary value as a GET request. It's a hack, but it is often enough to fool a cache. So, instead of:
"http://mypage.com"
You use:
"http://mypage.com?someUnusedValueName=$([System.Environment]::TickCount)"

parallely execution of perl language

I have a perl program that read the packets of a flow from a pcap file, but it takes a lot of time,I want to make it parallel,but I don't know is it possible or not?if yes can I do it with MPI?and another question, the best way for making this code parallel,here is the piece of my code ( I think I should work on this part for paralleling, but I don't know the best way!)
while (!eof($inFileH))
{
#inFileH is the handler of the pcap file
#in each while I read one packet
$ts_sec = readBytes($inFileH,4);
$ts_usec = readBytes($inFileH,4);
$incl_len = readBytes($inFileH,4);
$orig_len = readBytes($inFileH,4);
if ($totalLen == 0) # it is the 1st packet
{
$startTime = $ts_sec + $ts_usec/1000000;
}
$timeStamp = $ts_sec + $ts_usec/1000000 - $startTime;
$totalLen += $orig_len;
$#packet = -1; n # initing the array
for (my $i=0 ; $i<$incl_len ; $i++) #read all included octects of the current packet
{
read $inFileH, $packet[$i], 1;
$packet[$i] = ord($packet[$i]);
}
#and after that I will work on the "packet" and analyze it
so how should I send the file content for other processors to work on it in parallel.....
First you need to determine the bottleneck. If it is really CPU usage (i.e. CPU usage is at 100% while you are running the script), you need to figure out where the processing spends its time.
This may well be in the way that you are parsing the input. There may be obvious ways to speed this up. For instance, if you use complex regular expressions, and focus exclusively on matching input correctly, there may be ways to make the matching a lot faster by rewriting the expressions or doing simpler matches before trying more complex ones.
If you can't reduce CPU usage far enough in this way, and you really want to parallelize, see if you can employ the mechanism with which Perl was born: Unix pipes. You can write Perl scripts that pass data through to each other in a pipeline, or you can do the creation of the processes and pipes within Perl itself (see perlopentut, and if that isn't enough, perlipc).
As a general rule, I would consider these options first before trying other mechanisms, but it really depends of the details of what you're trying to do and the context in which you need to do it.

Long data load time in Matlab

I have four variables, each saved in 365 mat-files (size: 8 x 92 x 240). I try to load these into my function in a for-loop day=1:365, one variable file per day. However, the two first variables always take abnormally long time to load. My code for loading looks like this:
load([eraFolder sprintf('Y%dD%d-tempSD.mat',year,day)], 'tempSD'); % took 5420 s to load
load([eraFolder sprintf('Y%dD%d-tempDewSD.mat',year,day)], 'tempDewSD')
load([eraFolder sprintf('Y%dD%d-eEraSD.mat',year,day)], 'eEraSD'); % took 6 seconds to load
load([eraFolder sprintf('Y%dD%d-pEraSD.mat',year,day)], 'pEraSD');
Using Profiler, I could see that the first two variables took 5420 seconds to load in 365 calls, whereas the the last two variables took 6 and 4 seconds to load respectively over 365 calls. When I swap the order in which variables are loaded, e.g. eEraSD before tempSD, it is still the first two loads that take more time.
When using tic toc to track the time spent on loading, it appears that the time to load a the first or second variable exponentially increases with the number of calls (with the last calls taking 50 seconds to run) . For the third and fourth variable, the loading time stays around 0.02-0.04 seconds per file, more or less independent on how far in the for loop I have gone. See figures below.
When using importdata instead of 'load', the first line takes about 8000 seconds to load 365 times (with the loading exponentially increasing as shown for T in the second figure). The other lines then take about 10 seconds to load 365 times.
I can't understand why it looks this way and what I can do to decrease the loading time. Would greatly appreciate any idea of a possible solution for this.
I suppose your data sets are in the same directory(over network or local) and with same attributes e.g. access properties and so on.
Then the only option left is with the charateristics of the vairbales stored in those matfiles. Can you check how much those variables appear in size e.g. by loading a sample one. This will narrow down to solve your problem.
Hope that help.
FS
Thanks for your help. I finally found out what caused the problem. In a 'for' loop later in the script, I saved other data to a folder I called temp. After renaming that folder to something else (e.g. temporary), the data loading problem disappeared.
(Doesn't matter so much now that the practical problem is solved, but I can't really say I understand why there was this peculiar relationship between the later 'save' call and this 'importdata' or 'load' call.)
Please see new question about the temp folder

How can I validate an image file in Perl?

How would I validate that a jpg file is a valid image file. We are having files written to a directory using FTP, but we seem to be picking up the file before it has finished writing it, creating invalid images. I need to be able to identify when it is no longer being written to. Any ideas?
Easiest way might just be to write the file to a temporary directory and then move it to the real directory after the write is finished.
Or you could check here.
JPEG::Error
[arguments: none] If the file reference remains undefined after a call to new, the file is to be considered not parseable by this module, and one should issue some error message and go to another file. An error message explaining the reason of the failure can be retrieved with the Error method:
EDIT:
Image::TestJPG might be even better.
You're solving the wrong problem, I think.
What you should be doing is figuring out how to tell when whatever FTPd you're using is done writing the file - that way when you come to have the same problem for (say) GIFs, DOCs or MPEGs, you don't have to fix it again.
Precisely how you do that depends rather crucially on what FTPd on what OS you're running. Some do, I believe, have hooks you can set to trigger when an upload's done.
If you can run your own FTPd, Net::FTPServer or POE::Component::Server::FTP are customizable to do the right thing.
In the absence of that:
1) try tailing the logs with a Perl script that looks for 'upload complete' messages
2) use something like lsof or fuser to check whether anything is locking a file before you try and copy it.
Again looking at the FTP issue rather than the JPG issue.
I check the timestamp on the file to make sure it hasn't been modified in the last X (5) mins - that way I can be reasonably sure they've finished uploading
# time in seconds that the file was last modified
my $last_modified = (stat("$path/$file"))[9];
# get the time in secs since epoch (ie 1970)
my $epoch_time = time();
# ensure file's not been modified during the last 5 mins, ie still uploading
unless ( $last_modified >= ($epoch_time - 300)) {
# move / edit or what ever
}
I had something similar come up once, more or less what I did was:
var oldImageSize = 0;
var currentImageSize;
while((currentImageSize = checkImageSize(imageFile)) != oldImageSize){
oldImageSize = currentImageSize;
sleep 10;
}
processImage(imageFile);
Have the FTP process set the readonly flag, then only work with files that have the readonly flag set.