Is it normal that two different versions of perl produce different results? - eclipse

I am trying to perform some stack analysis on an MCU following the steps described here. The site links then to a Perl script that I launch as a post-build operation by meanings of a simple batch file.
The IDEA based on Eclipse uses the Perl executable at the path:
C:\..\S32DS_ARM_v2018.R1\utils\msys32\usr\bin\perl.exe
perl.exe -v gives:
This is perl 5, version 22, subversion 1 (v5.22.1) built for i686-msys-thread-multi-64int
The OS (windows) has a perl installation at
C:\Perl64\bin\perl.exe
perl.exe -v gives:
This is perl 5, version 24, subversion 3 (v5.24.3) built for MSWin32-x64-multi-thread
(with 1 registered patch, see perl -V for more detail)
I can confirm that avstak.pl (the perl script I am referring some lines above) produces different results with the former or the latter.
WHY this happens, is out of my area of expertise at the moment.
What I would like to understand is
Understand why this is happening;
Understand which perl provides the right outputs (pretty sure I suppose the 5.24.3 is the correct one);
Learning how to prevent this issue if I am going to use perl in future.
Thanks and best regards,
L.
Edit: the outcome of the script with the two different perl versions (reduced output for readability):
This one is result_5.22.1
Func Cost Frame Height
------------------------------------------------------------------------
> I2C_MasterGetTransferStatus 292 292 1
> FLEXIO_I2C_DRV_MasterStartTransfer 236 236 1
> CLOCK_DRV_Init 172 172 1
> CLOCK_SYS_SetConfiguration 172 172 1
> EDMA_DRV_ConfigScatterGatherTransfer 132 132 1
> CLOCK_SYS_SetSystemClockConfig 76 76 1
> FLEXIO_I2C_DRV_MasterInit 60 60 1
> EDMA_DRV_ConfigSingleBlockTransfer 60 60 1
> main 52 52 1
> LPI2C_DRV_MasterSetBaudRate 52 52 1
> LPI2C_DRV_MasterStartDmaTransfer 52 52 1
> FLEXIO_DRV_InitDriver 52 52 1
> I2C_MasterInit 44 44 1
> LPI2C_DRV_SlaveStartDmaTransfer 44 44 1
> CLOCK_SYS_UpdateConfiguration 44 44 1
> CLOCK_DRV_SetClockSource 44 44 1
> LPI2C_DRV_SlaveInit 44 44 1
> EDMA_DRV_Init 44 44 1
> EDMA_DRV_Deinit 36 36 1
> CLOCK_SYS_ConfigureSOSC 36 36 1
> CLOCK_SYS_ConfigureFIRC 36 36 1
vs
result_5.24.3
Func Cost Frame Height
------------------------------------------------------------------------
> main 536 52 9
I2C_MasterSendDataBlocking 484 28 8
> I2C_MasterReceiveDataBlocking 484 28 8
> I2C_MasterReceiveData 468 20 7
> I2C_MasterSendData 468 20 7
FLEXIO_I2C_DRV_MasterReceiveDataBlocking 456 28 7
FLEXIO_I2C_DRV_MasterSendDataBlocking 456 28 7
FLEXIO_I2C_DRV_MasterSendData 448 20 5
FLEXIO_I2C_DRV_MasterReceiveData 448 20 5
FLEXIO_I2C_DRV_MasterStartTransfer 428 236 4
> I2C_MasterGetTransferStatus 408 292 6
CLOCK_SYS_UpdateConfiguration 336 44 6
CLOCK_SYS_SetConfiguration 292 172 5
> CLOCK_DRV_Init 292 172 5
LPI2C_DRV_MasterReceiveDataBlocking 256 20 7
> I2C_SlaveReceiveDataBlocking 252 12 8
> I2C_SlaveSendDataBlocking 252 12 8
As you can see the hight number in the first version doesn't increase (and it should).
Cost and frame suffer the same issue I suppose.
the script is here:
#!/usr/bin/perl -w
# avstack.pl: AVR stack checker
# Copyright (C) 2013 Daniel Beer <dlbeer#gmail.com>
#
# Permission to use, copy, modify, and/or distribute this software for
# any purpose with or without fee is hereby granted, provided that the
# above copyright notice and this permission notice appear in all
# copies.
#
# THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL
# WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED
# WARRANTIES OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL THE
# AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL
# DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR
# PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER
# TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
# PERFORMANCE OF THIS SOFTWARE.
#
# Usage
# -----
#
# This script requires that you compile your code with -fstack-usage.
# This results in GCC generating a .su file for each .o file. Once you
# have these, do:
#
# ./avstack.pl <object files>
#
# This will disassemble .o files to construct a call graph, and read
# frame size information from .su. The call graph is traced to find, for
# each function:
#
# - Call height: the maximum call height of any callee, plus 1
# (defined to be 1 for any function which has no callees).
#
# - Inherited frame: the maximum *inherited* frame of any callee, plus
# the GCC-calculated frame size of the function in question.
#
# Using these two pieces of information, we calculate a cost (estimated
# peak stack usage) for calling the function. Functions are then listed
# on stdout in decreasing order of cost.
#
# Functions which are recursive are marked with an 'R' to the left of
# them. Their cost is calculated for a single level of recursion.
#
# The peak stack usage of your entire program can usually be estimated
# as the stack cost of "main", plus the maximum stack cost of any
# interrupt handler which might execute.
use strict;
# Configuration: set these as appropriate for your architecture/project.
my $objdump = "arm-none-eabi-objdump";
my $call_cost = 4;
# First, we need to read all object and corresponding .su files. We're
# gathering a mapping of functions to callees and functions to frame
# sizes. We're just parsing at this stage -- callee name resolution
# comes later.
my %frame_size; # "func#file" -> size
my %call_graph; # "func#file" -> {callees}
my %addresses; # "addr#file" -> "func#file"
my %global_name; # "func" -> "func#file"
my %ambiguous; # "func" -> 1
foreach (#ARGV) {
# Disassemble this object file to obtain a callees. Sources in the
# call graph are named "func#file". Targets in the call graph are
# named either "offset#file" or "funcname". We also keep a list of
# the addresses and names of each function we encounter.
my $objfile = $_;
my $source;
open(DISASSEMBLY, "$objdump -dr $objfile|") ||
die "Can't disassemble $objfile";
while (<DISASSEMBLY>) {
chomp;
if (/^([0-9a-fA-F]+) <(.*)>:/) {
my $a = $1;
my $name = $2;
$source = "$name\#$objfile";
$call_graph{$source} = {};
$ambiguous{$name} = 1 if defined($global_name{$name});
$global_name{$name} = "$name\#$objfile";
$a =~ s/^0*//;
$addresses{"$a\#$objfile"} = "$name\#$objfile";
}
if (/: R_[A-Za-z0-9_]+_CALL[ \t]+(.*)/) {
my $t = $1;
if ($t eq ".text") {
$t = "\#$objfile";
} elsif ($t =~ /^\.text\+0x(.*)$/) {
$t = "$1\#$objfile";
}
$call_graph{$source}->{$t} = 1;
}
}
close(DISASSEMBLY);
# Extract frame sizes from the corresponding .su file.
if ($objfile =~ /^(.*).o$/) {
my $sufile = "$1.su";
open(SUFILE, "<$sufile") || die "Can't open $sufile";
while (<SUFILE>) {
$frame_size{"$1\#$objfile"} = $2 + $call_cost
if /^.*:([^\t ]+)[ \t]+([0-9]+)/;
}
close(SUFILE);
}
}
# In this step, we enumerate each list of callees in the call graph and
# try to resolve the symbols. We omit ones we can't resolve, but keep a
# set of them anyway.
my %unresolved;
foreach (keys %call_graph) {
my $from = $_;
my $callees = $call_graph{$from};
my %resolved;
foreach (keys %$callees) {
my $t = $_;
if (defined($addresses{$t})) {
$resolved{$addresses{$t}} = 1;
} elsif (defined($global_name{$t})) {
$resolved{$global_name{$t}} = 1;
warn "Ambiguous resolution: $t" if defined ($ambiguous{$t});
} elsif (defined($call_graph{$t})) {
$resolved{$t} = 1;
} else {
$unresolved{$t} = 1;
}
}
$call_graph{$from} = \%resolved;
}
# Create fake edges and nodes to account for dynamic behaviour.
$call_graph{"INTERRUPT"} = {};
foreach (keys %call_graph) {
$call_graph{"INTERRUPT"}->{$_} = 1 if /^__vector_/;
}
# Trace the call graph and calculate, for each function:
#
# - inherited frames: maximum inherited frame of callees, plus own
# frame size.
# - height: maximum height of callees, plus one.
# - recursion: is the function called recursively (including indirect
# recursion)?
my %has_caller;
my %visited;
my %total_cost;
my %call_depth;
sub trace {
my $f = shift;
if ($visited{$f}) {
$visited{$f} = "R" if $visited{$f} eq "?";
return;
}
$visited{$f} = "?";
my $max_depth = 0;
my $max_frame = 0;
my $targets = $call_graph{$f} || die "Unknown function: $f";
if (defined($targets)) {
foreach (keys %$targets) {
my $t = $_;
$has_caller{$t} = 1;
trace($t);
my $is = $total_cost{$t};
my $d = $call_depth{$t};
$max_frame = $is if $is > $max_frame;
$max_depth = $d if $d > $max_depth;
}
}
$call_depth{$f} = $max_depth + 1;
$total_cost{$f} = $max_frame + ($frame_size{$f} || 0);
$visited{$f} = " " if $visited{$f} eq "?";
}
foreach (keys %call_graph) { trace $_; }
# Now, print results in a nice table.
printf " %-30s %8s %8s %8s\n",
"Func", "Cost", "Frame", "Height";
print "------------------------------------";
print "------------------------------------\n";
my $max_iv = 0;
my $main = 0;
foreach (sort { $total_cost{$b} <=> $total_cost{$a} } keys %visited) {
my $name = $_;
if (/^(.*)#(.*)$/) {
$name = $1 unless $ambiguous{$name};
}
my $tag = $visited{$_};
my $cost = $total_cost{$_};
$name = $_ if $ambiguous{$name};
$tag = ">" unless $has_caller{$_};
if (/^__vector_/) {
$max_iv = $cost if $cost > $max_iv;
} elsif (/^main#/) {
$main = $cost;
}
if ($ambiguous{$name}) { $name = $_; }
printf "%s %-30s %8d %8d %8d\n", $tag, $name, $cost,
$frame_size{$_} || 0, $call_depth{$_};
}
print "\n";
print "Peak execution estimate (main + worst-case IV):\n";
printf " main = %d, worst IV = %d, total = %d\n",
$total_cost{$global_name{"main"}},
$total_cost{"INTERRUPT"},
$total_cost{$global_name{"main"}} + $total_cost{"INTERRUPT"};
print "\n";
print "The following functions were not resolved:\n";
foreach (keys %unresolved) { print " $_\n"; }
Edit2:
As Amon suggested to check, subsequent iterations of the script on the same dataset doesn't produce the same output. Values (cost/frame/height) are always the same but the order in which the functions are reported is different.

Related

horizontal absolute values of every line

I am trying to calculate the absolute values of line 2 - values of line 1
and then the horizontal absolute values of every line in my input file. Here's a part of that input.
43 402 51 360
63 60 69 63
65 53 89 55
103 138 135 135
109 36 123 38
To be more precise about what im trying to do I made the following example
initial data
0 2 0 0
0 1 1 1
next stage (absolute value after subscription the second line minus the first line)
2 2 0
1 0 0
final stage (horizontal application of abs values until one column remained)
0
1
The below code was a failed attempt to obtain the final stage of the single column. My problem here is that I don't know how to obtain the final (desired) stage by using subroutine, as I believe that it is a better way to solving my problem. Of course, every idea or better approach is welcome.
#!/usr/bin/perl
use feature qw(say);
use strict;
use warnings;
use Data::Dumper;
my #rows = 'table_only_numbers';
open(my $fh, '<:encoding(UTF-8)', $rows)
sub ori {
for ($num_cols=#{ $rows[$r-1]}; $num_cols=1; $num_cols-- ){
my #diff_diffs = map { abs($diffs[$_-1] - $diffs[$_]) } 1..$num_cols-1;
#final=#diff_diffs;
say join ' ',#final;
return (final) }
my $num_cols = #{ $rows[0] };
for my $r (1..$#rows) {
die "Bad format!" if #{ $rows[$r] } != $num_cols;
my #diffs = map { abs($rows[$r-1][$_] - $rows[$r][$_]) } 0..$num_cols-1;
while ($num_cols>1)
{
$final_output = ori(#{ $rows[0] })
say "final_output";
}
}
close $fh;
Finally, I figure it by myself without subroutines!!! Im posting it in case someone face the same issue in the future.I know that it is an easiest way to do it but as I am newbie in Perl it is the easiest way for me.
So I used:
for the first abs of the line 2 minus the line 1
my #data = map { abs($current[$_]-$previous[$_]) } 0..$#current;
push #final, \#data;
To obtain the absolute value of row 2 minus the row 1
And after I used 3 times as I had 3 columns left (in my case) the following coding line and each time I Substituted the #xxx with a new variable. and I have desired output of I column.
foreach my $row (#XXX) {
my #data = map { abs(#{$row}[$_]-#{$row}[$_+1]) } 0..$#{$row}-1;
say join ' ', #data;
push #XXX, \#data;}

Perl variable not assigned in foreach: scope issues

I am trying to normalize some scores from a .txt file by dividing each score for each possible sense (eg. take#v#2; referred to as $tokpossense in my code) by the sum of all scores for a wordtype (e.g. take#v; referred to as $tokpos). The difficulty is in grouping the wordtypes together when processing each line of the so that the normalized scores are printed upon finding a new wordtype/$tokpos. I used two hashes and an if block to achieve this.
Currently, the problem seems to be that $tokpos is undefined as a key in SumHash{$tokpos} at line 20 resulting in a division by zero. However, I believe $tokpos is properly defined within the scope of this block. What is the problem exactly and how would I best solve it? I would also gladly hear alternative approaches to this problem.
Here's an example inputfile:
i#CL take#v#17 my#CL checks#n#1 to#CL the#CL bank#n#2 .#IT
Context: i#CL <target>take#v</target> my#CL checks#n to#CL the#CL bank#n
Scores for take#v
take#v#1: 17
take#v#10: 158
take#v#17: 174
Winning score: 174
Context: i#CL take#v my#CL <target>checks#n</target> to#CL the#CL bank#n .#IT
Scores for checks#n
check#n#1: 198
check#n#2: 117
check#n#3: 42
Winning score: 198
Context: take#v my#CL checks#n to#CL the#CL <target>bank#n</target> .#IT
Scores for bank#n
bank#n#1: 81
bank#n#2: 202
bank#n#3: 68
bank#n#4: 37
Winning score: 202
My erroneous Code:
#files = #ARGV;
foreach $file(#files){
open(IN, $file);
#lines=<IN>;
foreach (#lines){
chomp;
#store tokpossense (eg. "take#v#1") and rawscore (eg. 4)
if (($tokpossense,$rawscore)= /^\s{4}(.+): (\d+)/) {
#split tokpossense for recombination
($tok,$pos,$sensenr)=split(/#/,$tokpossense);
#tokpos (eg. take#v) will be a unique identifier when calculating normalized score
$tokpos="$tok\#$pos";
#block for when new tokpos(word) is found in inputfile
if (defined($prevtokpos) and
($tokpos ne $prevtokpos)) {
# normalize hash: THE PROBLEM LIES IN $SumHash{$tokpos} which is returned as zero > WHY?
foreach (keys %ScoreHash) {
$normscore=$ScoreHash{$_}/$SumHash{$tokpos};
#print the results to a file
print "$_\t$ScoreHash{$_}\t$normscore\n";
}
#empty hashes
undef %ScoreHash;
undef %SumHash;
}
#prevtokpos is assigned to tokpos for condition above
$prevtokpos = $tokpos;
#store the sum of scores for a tokpos identifier for normalization
$SumHash{$tokpos}+=$rawscore;
#store the scores for a tokpossense identifier for normalization
$ScoreHash{$tokpossense}=$rawscore;
}
#skip the irrelevant lines of inputfile
else {next;}
}
}
Extra info: I am doing Word Sense Disambiguation using Pedersen's Wordnet WSD tool which uses Wordnet::Similarity::AllWords. The output file is generated by this package and the found scores have to be normalized for implementation in our toolset.
You don't assign anything to $tokpos. The assignment is part of a comment - syntax highlighting in your editor should've told you. strict would've told you, too.
Also, you should probably use $prevtokpos in the division: $tokpos is the new value that you haven't met before. To get the output for the last token, you have to process it outside the loop, as there's no $tokpos to replace it. To avoid code repetition, use a subroutine to do that:
#!/usr/bin/perl
use warnings;
use strict;
my %SumHash;
my %ScoreHash;
sub output {
my $token = shift;
for (keys %ScoreHash) {
my $normscore = $ScoreHash{$_} / $SumHash{$token};
print "$_\t$ScoreHash{$_}\t$normscore\n";
}
undef %ScoreHash;
undef %SumHash;
}
my $prevtokpos;
while (<DATA>){
chomp;
if (my ($tokpossense,$rawscore) = /^\s{4}(.+): (\d+)/) {
my ($tok, $pos, $sensenr) = split /#/, $tokpossense;
my $tokpos = "$tok\#$pos";
if (defined $prevtokpos && $tokpos ne $prevtokpos) {
output($prevtokpos);
}
$prevtokpos = $tokpos;
$SumHash{$tokpos} += $rawscore;
$ScoreHash{$tokpossense} = $rawscore;
}
}
output($prevtokpos);
__DATA__
i#CL take#v#17 my#CL checks#n#1 to#CL the#CL bank#n#2 .#IT
Context: i#CL <target>take#v</target> my#CL checks#n to#CL the#CL bank#n
Scores for take#v
take#v#1: 17
take#v#10: 158
take#v#17: 174
Winning score: 174
Context: i#CL take#v my#CL <target>checks#n</target> to#CL the#CL bank#n .#IT
Scores for checks#n
check#n#1: 198
check#n#2: 117
check#n#3: 42
Winning score: 198
Context: take#v my#CL checks#n to#CL the#CL <target>bank#n</target> .#IT
Scores for bank#n
bank#n#1: 81
bank#n#2: 202
bank#n#3: 68
bank#n#4: 37
Winning score: 202
You're confusing yourself by trying to print the results as soon as $tokpos changes. For one thing it's the values for $prevtokpos that are complete, but your trying to output the data for $tokpos; and also you're never going to display the last block of data because you require a change in $tokpos to trigger the output.
It's far easier to accumulate all the data for a given file and then print it when the end of file is reached. This program works by keeping the three values
$tokpos, $sense, and $rawscore for each line of the output in array #results, together with the total score for each value of $tokpos in %totals. Then it's simply a matter of dumping the contents of #results with an extra column that divides each value by the corresponding total.
use strict;
use warnings;
use 5.014; # For non-destructive substitution
for my $file ( #ARGV ) {
open my $fh, '<', $file or die $!;
my (#results, %totals);
while ( <$fh> ) {
chomp;
next unless my ($tokpos, $sense, $rawscore) = / ^ \s{4} ( [^#]+ \# [^#]+ ) \# (\d+) : \s+ (\d+) /x;
push #results, [ $tokpos, $sense, $rawscore ];
$totals{$tokpos} += $rawscore;
}
print "** $file **\n";
for my $item ( #results ) {
my ($tokpos, $sense, $rawscore) = #$item;
printf "%s\t%s\t%6.4f\n", $tokpos.$sense, $rawscore, $rawscore / $totals{$tokpos};
}
print "\n";
}
output
** tokpos.txt **
take#v#1 17 0.0487
take#v#10 158 0.4527
take#v#17 174 0.4986
check#n#1 198 0.5546
check#n#2 117 0.3277
check#n#3 42 0.1176
bank#n#1 81 0.2088
bank#n#2 202 0.5206
bank#n#3 68 0.1753
bank#n#4 37 0.0954

some help on the following perl script

Need help in merging/concatenating /combining /binding etc
I have several ascii files each defining one variable which I have converted to a single column array
I have such columnised data for many variables ,so I need to perform a column bind like R does and make it one single file.
I can do the same in R but there are too many files. Being able to do it with one single code will help save a lot of time.
Using the following code ,new to perl and need help with this.
#filenames = ("file1.txt","file2.txt");
open F2, ">file_combined.txt" or die;
for($j = 0; $j< scalar #filenames;$j++){
open F1, $filenames[$j] or die;
for($i=1;$i<=6;$i++){$line=<F1>;}
while($line=<F1>){
chomp $line;
#spl = split '\s+', $line;
for($i=0;$i<scalar #spl;$i++){
print F2 "$spl[$i]\n";
paste "file_bio1.txt","file_bio2.txt"> file_combined.txt;
}
}
close F1;
}
Input files here are Ascii text files of a raster.They look like this
32 12 34 21 32 21 22 23
12 21 32 43 21 32 21 12
The above mentioned code without the paste syntax converts these files into a single column
32
12
34
21
32
21
22
23
12
21
32
43
21
32
21
12
The output should look like this
12 21 32
32 23 23
32 21 32
12 34 12
43 32 32
32 23 23
32 34 21
21 32 23
Each column represents a different ascii file.
I need around 15 such ascii files into one dataframe.I can do the same in R but it consumes a lot of time as the number of files and regions of interest are too many and the files are a bit large too.
Let's step through what you have...
# files you want to open for reading..
#filenames = ("file1.txt","file2.txt");
# I would use the 3 arg lexical scoped open
# I think you want to open this for 'append' as well
# open($fh, ">>", "file_combined.txt") or die "cannot open";
open F2, ">file_combined.txt" or die;
# #filenames is best thought as a 'list'
# for my $file (#filenames) {
for($j = 0; $j< scalar #filenames;$j++){
# see above example of 'open'
# - $filenames[$j] + $file
open F1, $filenames[$j] or die;
# what are you trying to do here? You're overriding
# $line in the next 'while loop'
for($i=1;$i<=6;$i++){$line=<F1>;}
# while(<$fh1>) {
while($line=<F1>){
chomp $line;
# #spl is short for split?
# give '#spl' list a meaningful name
#spl = split '\s+', $line;
# again, #spl is a list...
# for my $word (#spl) {
for($i=0;$i<scalar #spl;$i++){
# this whole block is a bit confusing.
# 'F2' is 'file_combined.txt'. Then you try and merge
# ( and overwrite the file) with the paste afterwards...
print F2 "$spl[$i]\n";
# is this a 'system call'?
# Missing 'backticks' or 'system'
paste "file_bio1.txt","file_bio2.txt"> file_combined.txt;
}
}
# close $fh1
close F1;
}
# I'm assuming there's a 'close F2' somewhere here..
It looks like you're trying to do this:
#filenames = ("file1.txt","file2.txt");
$oufile = "combined_text.txt";
`paste $filenames[0] $filenames[1] > $outfile`;

Understanding Segfault with perl-DBI

I had a script that was doing segfault and I'm not really comfortable the way I resolved it so I wanted to post the question here to understand a little bit more about the cause of it, and maybe a better solution.
Here's what my script does (removed some detailed code to leave the "core" of it):
# Here's a query I need to do every X seconds to monitor progress of other tasks
# This is apparently the key to my segfault problem
my $stmt = $dbh->prepare($query);
my $all_done = 0;
while(!$all_done) {
$self->debug("Waiting for $n blocker tasks to be finished");
# Execute query to pull the status of the tasks from DB
$stmt->execute();
my $pending = [];
while(my $hr = $stmt->fetchrow_hashref()) {
push #{$pending}, $hr->{'TASK_NAME'} if ($hr->{'STATUS'} ne 'COMPLETE');
}
if(scalar(#{$pending}) > 0) {
$all_done = 0;
sleep($sleep_gap);
}
else { $all_done = 1; }
}
Now, the script works well, most of the time. However, it goes to segfault when 3 or more instances of the script are running in parallel (same script, separate processes, not threads).
How did I solve it?
I solved it by doing the DBH::prepare call every time during the while(!$all_done) loop.
So, this code does NOT segfault, even with several processes running in parallel. I reproduced the error several times, consistently, and then I did the same with the new code. I'm positive that moving the statement inside the loop FIXED the problem.
Any ideas why this may be happening?
I'm using perl 5.8 and perl-DBI version 1.609.
Here's also the output of strace when the script segfaults:
read(5, "\1\7\0\0\6\0\0\0\0\0\20\27\234\312\272\221eG2;\33S\364\230\313\220\221Bxp\4\7"..., 2064) = 263
write(4, "\1\31\0\0\6\0\0\0\0\0\21i \1\1\0\0\0\2\0\0\0\3^!)\4\4\0\0\0\0"..., 281) = 281
read(4, "\0\177\0\0\6\0\0\0\0\0\v\5\2\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0 \10\6"..., 2064) = 127
write(2, "debug:Waiting for 1 blocker task"..., 49debug:Waiting for 1 blocker tasks to be finished
) = 49
write(5, "\0\252\0\0\6\0\0\0\0\0\3^\20p\200\0\0\2\0\0\0\0\0\0\0\0\1\r\0\0\0\0"..., 170) = 170
read(5, "\0\301\0\0\6\0\0\0\0\0\6\"\2\0\0\0#\0\0\0\0\0\0\0\0\0\0\0\7 ru"..., 2064) = 193
write(5, "\1]\0\0\6\0\0\0\0\0\3^\21)\200\0\0\0\0\0\0\1\234\0\0\0\1\r\0\0\0\0"..., 349) = 349
read(5, "\0y\0\0\6\0\0\0\0\0\10\6\0S\254b\f\0\t\0\0\1\0\0\0\1\0\0\0\0\0\0"..., 2064) = 121
write(5, "\0000\0\0\6\0\0\0\0\0\3h\22\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 48) = 48
read(5, "\0\26\0\0\6\0\0\0\0\0\10\2\0\0\0\t\5\0\0\0\21\0", 2064) = 22
time(NULL) = 1333827285
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({10, 0}, {10, 0}) = 0
time(NULL) = 1333827295
write(4, "\1\31\0\0\6\0\0\0\0\0\21i\"\1\1\0\0\0\3\0\0\0\3^#)\4\4\0\0\0\0"..., 281) = 281
read(4, "\0\177\0\0\6\0\0\0\0\0\v\5\2\0\0\0\1\0\0\0\0\0\0\0\0\0\0\0 \10\6"..., 2064) = 127
write(2, "debug:Waiting for 1 blocker task"..., 49debug:Waiting for 1 blocker tasks to be finished
) = 49
write(5, "\0)\0\0\6\0\0\0\0\0\21i\23\1\1\0\0\0\1\0\0\0\3N\24\2\0\0\0#\0\0"..., 41) = 41
read(5, "\1>\0\0\6\0\0\0\0\0\20\27\234\312\272\221eG2;\33S\364\230\313\220\221Bxp\4\7"..., 2064) = 318
--- SIGSEGV (Segmentation fault) # 0 (0) ---
+++ killed by SIGSEGV +++
[ Process PID=22767 runs in 32 bit mode. ]
I'd say you don't have much to worry about since SQLite statement prepare tends to be fast. I mesured that a simple SQL prepare via Perl DBI took only 20 microseconds (see results and code below). Considering your application you shouldn't notice any difference in performance.
As SQLite's classic locking allows only one writer and multiple readers at a time the issue you are facing can be related to locking or transaction handling. However SIGSEGV is never the expected behvior.
Results
Perl 5.014002
DBI 1.618
SQLite 3.7.9
time in s for 100000 prepares: 2.01810789108276
ys per prepare: 20.1810789108276
Code
use DBI;
use Time::HiRes qw ( time );
use strict;
use warnings;
my $dbc = DBI->connect (
'dbi:SQLite:dbname=/tmp/test.db',
undef, undef, { AutoCommit => 0, RaiseError => 1, PrintError => 1 }
) || die $DBI::errstr;
print "Perl $]\n";
print "DBI $DBI::VERSION\n";
print "SQLite $dbc->{sqlite_version}\n";
my $start = time();
my $n = 100_000;
foreach ( 1 .. $n ) {
my $stmt = $dbc->prepare( qq{
select count(*) from sec where sid is not null
} );
}
my $end = time();
print
"time in s for $n prepares: " .
( $end - $start ) .
"\n";
print
"ys per prepare: " .
( ( ( $end - $start ) * 1_000_000 ) / $n ) .
"\n";

Loading Big files into Hashes in Perl (BLAST tables)

I'm a perl beginner, please help me out with my query... I'm trying to extract information from a blast table (a snippet of what it looks like is below):
It's a standard blast table input... I basically want to extract any information on a list of reads (Look at my second script below , to get an idea of what I want to do).... Anyhow this is precisely what I've done in the second script:
INPUTS:
1) the blast table:
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
34.3 7.5 123828 GH8NFLV01A03QJ GH8NFLV01A03QJ rank=0239249 x=305.0 y=1945.5 length=452 1 XP_002639994 Hypothetical protein CBG10824 [Caenorhabditis briggsae] 3 0 52.1739130434783 32.6086956521739 46 452 367 10.1769911504425 12.5340599455041 111 248 79 124
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
37.7 0.98 33114 GH8NFLV01A0H5C GH8NFLV01A0H5C rank=0066011 x=298.0 y=2638.5 length=573 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] -3 0 73.5294117647059 52.9411764705882 34 573 703 5.93368237347295 4.83641536273115 131 232 273 306
103 1e-020 65742 GH8NFLV01A0MXI GH8NFLV01A0MXI rank=0124865 x=300.5 y=644.0 length=475 1 ABZ08973 hypothetical protein ALOHA_HF4000APKG6B14ctg1g18 [uncultured marine crenarchaeote HF4000_APKG6B14] 2 0 77.9411764705882 77.9411764705882 68 475 151 14.3157894736842 45.0331125827815 2 205 1 68
41.6 0.053 36083 GH8NFLV01A0QKX GH8NFLV01A0QKX rank=0071366 x=301.0 y=1279.0 length=526 1 XP_766153 hypothetical protein [Theileria parva strain Muguga] -1 0 66.6666666666667 56.6666666666667 30 526 304 5.70342205323194 9.86842105263158 392 481 31 60
45.4 0.003 78246 GH8NFLV01A0Z29 GH8NFLV01A0Z29 rank=0148293 x=304.0 y=1315.0 length=432 1 ZP_04111769 hypothetical protein bthur0007_56280 [Bacillus thuringiensis serovar monterrey BGSC 4AJ1] 3 0 51.8518518518518 38.8888888888889 54 432 193 12.5 27.979274611399 48 209 97 150
71.6 4e-011 97250 GH8NFLV01A14MR GH8NFLV01A14MR rank=0184885 x=317.5 y=609.5 length=314 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 92.5 92.5 40 314 311 12.7388535031847 12.8617363344051 193 312 13 52
58.2 5e-007 154555 GH8NFLV01A1KCH GH8NFLV01A1KCH rank=0309994 x=310.0 y=2991.0 length=267 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 82.051282051282 82.051282051282 39 267 311 14.6067415730337 12.540192926045 142 258 1 39
2) The reads list:
GH8NFLV01A09B8
GH8NFLV01A02ED
etc
etc
3) the output I want:
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
I want a subset of the information in the first list, given a list of read names I want to extract (that is found in the 4th column)
Instead of hashing the reads list (only?) I want to hash the blast table itself, and use the information in Column 4 (of the blast table)as the keys to extract the values of each key, even when that key may have more than one value(i.e: each read name might actually have more than one hit , or associated blast result in the table), keeping in mind, that the value includes the WHOLE row with that key(readname) in it.
My greplist.pl script does this, but is very very slow, I think , ( and correct me if i'm wrong) that by loading the whole table in a hash, that this should speed things up tremendously ...
Thank you for your help.
My scripts:
The Broken one (mambo5.pl)
#!/usr/bin/perl -w
# purpose: extract blastX data from a list of readnames
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash = <DATA>;
close (DATA);
my $filename=$ARGV[0];
open(OUT, "> $filename.bololom");
my $readName;
while ( <LIST> )
{
#########;
if(/^(.*?)$/)#
{
$readName=$1;#
chomp $readName;
if (exists $hash{$readName})
{
print "bingo!";
my $output =$hash{$readName};
print OUT "$output\n";
}
else
{
print "it aint workin\n";
#print %hash;
}
}
}
close (LIST);
The Slow and quick cheat (that works) and is very slow (my blast tables can be about 400MB to 2GB large, I'm sure you can see why it's so slow)
#!/usr/bin/perl -w
##
# This script finds a list of names in a blast table and outputs the result in a new file
# name must exist and list must be correctly formatted
# will not output anything using a "normal" blast file, must be a table blast
# if you have the standard blast output use blast2table script
use strict;
my $filein=$ARGV[0] or die ("usage: ./listgrep.pl readslist blast_table\n");
my $db=$ARGV[1] or die ("usage: ./listgrep.pl readslist blast_table\n");
#open the reads you want to grep
my $read;
my $line;
open(READSLIST,$filein);
while($line=<READSLIST>)
{
if ($line=~/^(.*)$/)
{
$read = $1;
print "$read\n";
system("grep \"$read\" $db >$read\_.out\n");
}
#system("grep $read $db >$read\_.out\n");
}
system("cat *\_.out >$filein\_greps.txt\n");
system("rm *.out\n");
I don't know how to define that 4th column as the key : maybe I could use the split function, but I've tried to find a way that does this for a table of more than 2 columns to no avail... Please help!
If there is an easy way out of this please let me know
Thanks !
I'd do the opposite i.e read the readslist file into a hash then walk thru the big blast file and print the desired lines.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
# Read the readslist file into a hash
open my $fh, '<', 'readslist' or die "Can't open 'readslist' for reading:$!";
my %readslist = map { chomp; $_ => 1 }<$fh>;
close $fh;
open my $fh_blast, '<', 'blastfile' or die "Can't open 'blastfile' for reading:$!";
# loop on all the blastfile lines
while (<$fh_blast>) {
chomp;
# retrieve the key (4th column)
my ($key) = (split/\s+/)[3];
# print the line if the key exists in the hash
say $_ if exists $readslist{$key};
}
close $fh_blast;
I suggest you build an index to turn your blasts file temporarily into an indexed-sequential file. Read through it and build a hash of addresses within the file where every record for each key starts.
After that it is just a matter of seeking to the correct places in the file to pick up the records required. This will certainly be faster than most simple solutions, as it entails read the big file only once. This example code demonstrates.
use strict;
use warnings;
use Fcntl qw/SEEK_SET/;
my %index;
open my $blast, '<', 'blast.txt' or die $!;
until (eof $blast) {
my $place = tell $blast;
my $line = <$blast>;
my $key = (split ' ', $line, 5)[3];
push #{$index{$key}}, $place;
}
open my $reads, '<', 'reads.txt' or die $!;
while (<$reads>) {
next unless my ($key) = /(\S+)/;
next unless my $places = $index{$key};
foreach my $place (#$places) {
seek $blast, $place, SEEK_SET;
my $line = <$blast>;
print $line;
}
}
Voila, 2 ways of doing this, one with nothing to do with perl :
awk 'BEGIN {while ( i = getline < "reads_list") ar[$i] = $1;} {if ($4 in ar) print $0;}' blast_table > new_blast_table
Mambo6.pl
#!/usr/bin/perl -w
# purpose: extract blastX data from a list of readnames. HINT: Make sure your list file only has unique names , that way you save time.
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash;
my $val;
my $key;
while (<DATA>)
{
#chomp;
if(/((.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?))$/)
{
#print "$1\n";
$key= $5;#read
$val= $1;#whole row; notice the brackets around the whole match.
$hash{$key} .= exists $hash{$key} ? "$val\n" : $val;
}
else {
print "something wrong with format";
}
}
close (DATA);
open(OUT, "> $ARGV[1]\_out\.txt");
my $readName;
while ( <LIST> )
{
#########;
if(/^(.*?)$/)#
{
$readName=$1;#
chomp $readName;
if (exists $hash{$readName})
{
print "$readName\n";
my $output =$hash{$readName};
print OUT "$output";
}
else
{
#print "it aint workin\n";
}
}
}
close (LIST);
close (OUT);
The oneliner is faster, and probably better than my script, I'm sure some people can find easier ways to do it... I just thought I'd put this up since it does what I want.