How to get the max value from an array in Perl? - perl

I have an input source as follows. What I want to do is capture the numeric value on each of the Layer lines into an array and then print out the maximum value.
input
MACRO cell_1
size 0.1 by 0.1 ;
pin a
(....same topology as pin vcc)
END a
pin b
(.....same topology as pin vcc)
END b
Pin vcc
aaaa
bbbb
Port
Layer m2 ;
END
CCC
DDD
Port
Layer m1 ;
END
EEE
FFF
Port
Layer m0 ;
END
END vcc
pin d
(....same topology as pin vcc)
END d
END cell_1
MACRO cell_2
(repeated)
END cell_2
my code:
foreach my $file ( #files ) { # #files = multiple path of abc/def/fgh/cell_lef
open( INFILE, "<$file" ) || die "Can not open stdcell_file\n";
my #lines = <INFILE>;
close INFILE;
$init = 1;
$delimiter =~ /^$/; # between each MACRO. haven't utilize this yet
foreach (#lines) {
if ( $init ) {
$path = 1;
$init = 0;
#num = ();
}
if ( $path ) {
if ( /MACRO\s+(\S+) /) {
$cellname = $1; print "$cellname\n";
}
if ( /SIZE\s+(\S+)\s+(\S+)\s+(\S+)/ ) {
$footprint_x = $1;
$footprint_y = $3;
print "$footprint_x $footprint_y\n";
}
if ( /PIN vcc/ .. /END vcc/ ) {
#grab the highest value from layer (m*)
#print "max layer = m*"
}
$init = 1;
}
}
}
intended output
cell_1
0.1 0.1
m2
cell_2
0.2 0.2
m3
The code I am attempting to use:
if ( /PIN vcc/../END vcc/ ) {
if ( /LAYER\s+m(\S+) / ) {
push(#num, $1);
print "#num";
}
}
The problem with my code so far is that when I print the value of #num, all the values are joined together as a string (210) instead of individual elements: 2 1 0 — so I am not able to do the sorting to get the max value.
Update: I am not sure how to integrate the while into my code as I'm using foreach as my main loop

Your code is capturing the numbers ok, it's just that you're printing the whole array. Perl's default separator for array elements when you print an array is "" ie, nothing - so, it looks like a single string but it's three (or however many) elements printed next to each other without a separator.
You can iterate over input line by line in Unix filter style with while (<>) {. You can switch on a "scanning mode" flag when "PIN vcc" is found and switch it off when "END vcc" is found. After that, use a regex to search for the layer line but always use 'extended mode' /x so you can use whitespace in your regex.
As the regex's for switching modes and capturing the layer number are mutually exclusive, you can let the other checks happen after a check succeeds - just be aware that if future changes result in overlapping cases, you need to do a next when one of the regex succeeds.
Finally, List::Util is a core module so, you might as well grab the max function from it;
use v5.12;
use warnings;
use List::Util qw( max );
my #num ;
my $scanning = 0;
while (<>) {
$scanning = 1 if /PIN vcc/ ;
$scanning = 0 if /END vcc/ ;
next unless $scanning ;
push #num, $1 if /Layer \s+ m (\d+) /x ;
}
say "Max layer number found: ", max(#num) ;

Or simply use sort function for to do it.
my #values ;
while (<DATA>)
{
push (#values , $1) if (/Layer\s+m(\d+)\s;/);
}
my ($max) = sort{$b <=> $a} #values ;
print "$max\n";
__DATA__
Pin vcc
aaaa
bbbb
Port
Layer m3 ;
END
CCC
DDD
Port
Layer m1 ;
END
EEE
FFF
Port
Layer m0 ;
END
use sort function and store the first result into list contain the variable $max.

my #data = <DATA> ;
my #num ;
foreach (#data) {
push #num, $1 if /Layer\s+m(\S+)/ ;
}
# home made max() - sorts an anonymous array and prints the last element
print [ sort { $a <=> $b } #num ]->[-1] , "\n" ;
__DATA__
Pin vcc
aaaa
bbbb
Port
Layer m2 ;
END
CCC
DDD
Port
Layer m1 ;
END
EEE
FFF
Port
Layer m0 ;
END
END vcc

Using map to fetch numbers from file. sort the number and get maximum number.
#!/usr/bin/perl
use strict;
use warnings;
my $max = (sort{$b <=> $a }map{/Layer\s+m(\d+)/} <DATA>)[0];
print $max,"\n";
__DATA__
Pin vcc
aaaa
bbbb
Port
Layer m3 ;
END
CCC
DDD
Port
Layer m1 ;
END
EEE
FFF
Port
Layer m0 ;
END

First, you don't need to store all values if you're looking only for max value; and second, print the max value after the loop is over,
my $max = "-inf";
if (/PIN vcc/ .. /END vcc/) {
$max = $1 if /LAYER\s+m(\S+)/ and $max < $1;
}
# ..
print $max, "\n";

Related

Split data separated by pipe from a file perl

I have a file which i have data pipe seprated having some values 0 ,1 , or some text. Now what i have to do is to read the file line by line and split data seprated by pipe and then if the values of any index is 1 then i have create increment it's value by 1 into a hash so at last i have a hash which will give details how many true values are there in the files.
Below is the example of data and the name of each value , if the value is 1 then in hash key value will get incremented by 1
A B C D E F G H I
===========================
0|0|0|1|0|0|ABC|15:44:07|0
0|0|0|1|0|0|ABC|15:44:07|0
0|0|0|0|0|0|ABC|15:44:07|0
0|1|0|0|0|1|ABC|15:44:07|0
So final out put will be like below :
$finalHash->{D} = 2;
$finalHash->{F} = 1;
$finalHash->{B} = 1;
I am able to read the file line by line :
my $final_hash;
for my $line ( <$ifh> ) {
my #el = split( /\|/, $line );
#Now here have to set the value in the hashkey where we will get the 1 value
}
Don't use for to iterate file lines, use while. The former might exhaust memory if the file was very large, as it needs to create a list of all the lines; the latter reads one line at a time.
#! /usr/bin/perl
use warnings;
use strict;
my #column_names = qw( A B C D E F );
my %counts;
while (<DATA>) {
my #columns = split /\|/;
for my $i (0 .. $#column_names) {
++$counts{ $column_names[$i] } if $columns[$i];
}
}
use Data::Dumper; print Dumper \%counts;
__DATA__
0|0|0|1|0|0|ABC|15:44:07|0
0|0|0|1|0|0|ABC|15:44:07|0
0|0|0|0|0|0|ABC|15:44:07|0
0|1|0|0|0|1|ABC|15:44:07|0

Multi-column file comparison and range extraction

Pardon me for asking a question without any coding effort. But it seems too much difficult to me.
I have a data file with tab separated three data columns (and some repetitive header lines) as:
Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...
# ../Output/Split_Seq/NP_415931.4.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.89 u-p
1 -5.79 ---
2 0.85 yui
3 0.51 uio
4 0.66 -08
Sequence ../Output/yy\Programs\YP_986467.7 alignment. Using default output format...
# ../Output/Split_Seq/YP_986467.7.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.001 -s-
1 0.984 ---
2 0.564 -fg
3 0.897 -sr
From the second data column, for those value(s) which are more than 0.5, I want to extract the corresponding first column number (or range).
For the above Input, the output would be:
NP_416485.4: 1, 3-5
YP_986467.7: 2-4
Here, "NP_416485.4" and "YP_986467.7" are from header descriptor (after \Programs). (Note that, the actual value for "NP_416485.4" for example, should be, "NP_416485.4: 0, 2-4", but I increases all of them with +1 as I don't want to start with 0).
Thanks for your consideration. I would appreciate any help. Thank you
Here is one approach. In case you would have a DOS data file on a Unix machine, I used \r?\n to match a new line, so it will work for all cases:
use feature qw(say);
use strict;
use warnings;
my $file_name = 'input.txt';
open ( my $fh, '<', $file_name ) or die "Could not open file '$file_name': $!";
my $str = do { local $/; <$fh> };
close $fh;
my #chunks = $str =~ /(Sequence(?:.(?!Sequence))*)/sg;
my %ids;
for my $cstr ( #chunks ) {
my ( $id, $data ) = $cstr
=~/Split_Seq\/(\S+)\.fasta.*?\r?\n\r?\n(.*)$/s;
my #lines = split /\n/, $data;
my #vals;
for my $line ( #lines ) {
my #fields = split " ", $line;
push ( #vals, $fields[0] + 1 ) if $fields[1] > 0.5;
}
$ids{$id} = \#vals;
}
for my $id ( keys %ids ) {
my #tmp = sort { $a <=> $b } #{ $ids{$id} };
my ( $first, $last );
my #rr;
for my $i (0..$#tmp) {
if ( $i == 0 ) {
$first = $tmp[0];
$last = undef;
}
if ( $i < $#tmp && ($tmp[$i] == ($tmp[$i+1] - 1 )) ) {
$last = $tmp[$i+1];
next;
}
if ( defined $last ) {
push #rr, "$first-$last";
$last = undef;
}
else {
push #rr, $tmp[$i];
}
$first = ( $i < $#tmp ) ? $tmp[$i+1] : undef;
}
say "$id: ", join ",", #rr;
}
Output:
NP_416485.4: 1,3-5
YP_986467.7: 2-4
You don't really give a good description of your problem, and you haven't made any effort to solve it yourself, but here's a solution to the first part of your problem (parsing the file into a data structure). You'll need to walk the %results hash and produce the output that you want.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
my %results;
my $section;
while (<DATA>) {
# Look for a new section
if (/\\Programs\\(\S+)\s/) {
$section = $1;
}
# Look for data lines
if (/^\d\b/) {
my #data = split;
if ($data[1] > 0.5) {
push #{$results{$section}}, $data[0] + 1;
}
}
}
say Dumper \%results;
__DATA__
Sequence ../Output/yy\Programs\NP_416485.4 alignment. Using default output format...
# ../Output/Split_Seq/NP_415931.4.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.89 u-p
1 -5.79 ---
2 0.85 yui
3 0.51 uio
4 0.66 -08
Sequence ../Output/yy\Programs\YP_986467.7 alignment. Using default output format...
# ../Output/Split_Seq/YP_986467.7.fasta -- js_divergence - window_size: 3
# jjhgjg cstr score
0 0.001 -s-
1 0.984 ---
2 0.564 -fg
3 0.897 -sr

Splitting and tallying substrings within mixed integer-string data

Input Data (example):
40A3B35A3C
30A5B28A2C2B
Desired output (per-line) is a single number determined by the composition of the code 40A3B35A3C and the following rules:
if A - add the proceeding number to the running total
if B - add the proceeding number to the running total
if C - subtract the proceeding number from the running total
40A 3B 35A 3C would thus produce 40 + 3 + 35 - 3 = 75.
Output from both lines:
75
63
Is there an efficient way to achieve this for a particular column (such as $F[2]) in a tab-delimited .txt file using a one-liner? I have considered splitting the entire code into individual characters, then performing if statement checks to detect A/B/C, but my Perl knowledge is limited and I am unsure how to go about this.
When you use split with a capture, the captured group is returned from split, too.
perl -lane '
#ar = split /([ABC])/, $F[2];
$s = 0;
$s += $n * ("C" eq $op ? -1 : 1) while ($n, $op) = splice #ar, 0, 2;
print $s
' < input
Or maybe more declarative:
BEGIN { %one = ( A => 1,
B => 1,
C => -1 ) }
#ar = split /([ABC])/, $F[2];
$s = 0;
$s += $n * $one{$op} while ($n, $op) = splice #ar, 0, 2;
print $s
When working through a string like this, it's useful to know that regular expressions can return a list of results.
E.g.
my #matches = $str =~ m/(\d+[A-C])/g; #will catch repeated instances
So you can do something like this:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
my $total;
#break the string into digit+letter groups.
for (m/(\d+[A-C])/g) {
#separate out this group into num and code.
my ( $num, $code ) = m/(\d+)([A-C])/;
print "\t",$num, " => ", $code, "\n";
if ( $code eq "C" ) {
$total -= $num;
}
else {
$total += $num;
}
}
print $total, " => ", $_;
}
__DATA__
40A3B35A3C
30A5B28A2C2B
perl -lne 'push #a,/([\d]+)[AB]/g;
push #b,/([\d]+)[C]/g;
$sum+=$_ for(#a);$sum-=$_ for(#b);
print $sum;#a=#b=();undef $sum' Your_file
how it works
use the command line arg as the input
set the hash "%op" to the
operations per letter
substitute the letters for operators in the
input evaluate the substituted input as an expression
use strict;
use warnings;
my %op=qw(A + B + C -);
$ARGV[0] =~ s/(\d+)(A|B|C)/$op{$2} $1/g;
print eval($ARGV[0]);

count occurrences of every character per every line position in perl

Similar to question
unix - count occurrences of character per line/field
but for every character in every position on the line.
Given a file of ~500-characters per 1e7 lines,
I want a two dimensional summary structure like
$summary{'a','b','c','0','1','2'}[pos 0..499] = count_integer
that shows the number of times each character was used in each position of the line. Either order of dimensions is fine.
My first method did ++summary{char}[pos] while reading,
but since many lines are the same,
it was much faster to count identical lines first,
then summarize summary{char}[pos] += n at a time
Are there more idiomatic or faster ways than the following C-like 2d loop?
#!perl
my ( %summary, %counthash ); # perl 5.8.9
sub method1 {
print "method1\n";
while (<DATA>) {
my #c = split( // , $_ );
++$summary{ $c[$_] }[$_] foreach ( 0 .. $#c );
} # wend
} ## end sub method1
sub method2 {
print "method2\n";
++$counthash{$_} while (<DATA>); # slurpsum the whole file
foreach my $str ( keys %counthash ) {
my $n = $counthash{$str};
my #c = split(//, $str);
$summary{ $c[$_] }[$_] += $n foreach ( 0 .. $#c );
} #rof my $str
} ## end sub method2
# MAINLINE
if (rand() > 0.5) { &method1 } else { &method2 }
print "char $_ : #{$summary{$_}} \n" foreach ( 'a', 'b' );
# both methods have this output summary
# char a : 3 3 2 2 3
# char b : 2 2 3 3 2
__DATA__
aaaaa
bbbbb
aabba
bbbbb
aaaaa
Depending on how your data is formed method2 might be a bit faster or slower than method 1.
But a big difference would be to use unpack instead of split.
use strict;
use warnings;
my ( %summary, %counthash ); # perl 5.8.9
sub method1 {
print "method1\n";
my #l= <DATA>;
for my $t(1..1000000) {
foreach (#l) {
my #c = split( // , $_ );
++$summary{ $c[$_] }[$_] foreach ( 0 .. $#c );
}
} # wend
} ## end sub method1
sub method2 {
print "method2\n";
++$counthash{$_} while (<DATA>); # slurpsum the whole file
for my $t(1..1000000) {
foreach my $str ( keys %counthash ) {
my $n = $counthash{$str};
my $i = 0;
$summary{ $_ }[$i++] += $n foreach ( unpack("c*",$str) );
}
}
} ## end sub method2
# MAINLINE
#method1();
method2();
print "char $_ : ". join (" ", #{$summary{ord($_)}}). " \n"
foreach ( 'a', 'b' );
# both methods have this output summary
# char a : 3 3 2 2 3
# char b : 2 2 3 3 2
__DATA__
aaaaa
bbbbb
aabba
bbbbb
aaaaa
runs much faster. (6 instead of 7.x seconds on my pc)

How can I generate a set of ranges from the first letters of a list of words in Perl?

I'm not sure exactly how to explain this, so I'll just start with an example.
Given the following data:
Apple
Apricot
Blackberry
Blueberry
Cherry
Crabapple
Cranberry
Elderberry
Grapefruit
Grapes
Kiwi
Mulberry
Nectarine
Pawpaw
Peach
Pear
Plum
Raspberry
Rhubarb
Strawberry
I want to generate an index based on the first letter of my data, but I want the letters grouped together.
Here is the frequency of the first letters in the above dataset:
2 A
2 B
3 C
1 E
2 G
1 K
1 M
1 N
4 P
2 R
1 S
Since my example data set is small, let's just say that the maximum number to combine the letters together is 3. Using the data above, this is what my index would come out to be:
A B C D-G H-O P Q-Z
Clicking the "D-G" link would show:
Elderberry
Grapefruit
Grapes
In my range listing above, I am covering the full alphabet - I guess that is not completely neccessary - I would be fine with this output as well:
A B C E-G K-N P R-S
Obviously my dataset is not fruit, I will have more data (around 1000-2000 items), and my "maximum per range" will be more than 3.
I am not too worried about lopsided data either - so if I 40% of my data starts with an "S", then S will just have its own link - I don't need to break it down by the second letter in the data.
Since my dataset won't change too often, I would be fine with a static "maximum per range", but it would be nice to have that calculated dynamically too. Also, the dataset will not start with numbers - it is guaranteed to start with a letter from A-Z.
I've started building the algorithm for this, but it keeps getting so messy I start over. I don't know how to search google for this - I'm not sure what this method is called.
Here is what I started with:
#!/usr/bin/perl
use strict;
use warnings;
my $index_frequency = { map { ( $_, 0 ) } ( 'A' .. 'Z' ) };
my $ranges = {};
open( $DATASET, '<', 'mydata' ) || die "Cannot open data file: $!\n";
while ( my $item = <$DATASET> ) {
chomp($item);
my $first_letter = uc( substr( $item, 0, 1 ) );
$index_frequency->{$first_letter}++;
}
foreach my $letter ( sort keys %{$index_frequency} ) {
if ( $index_frequency->{$letter} ) {
# build $ranges here
}
}
My problem is that I keep using a bunch of global variables to keep track of counts and previous letters examined - my code gets very messy very fast.
Can someone give me a step in the right direction? I guess this is more of an algorithm question, so if you don't have a way to do this in Perl, pseudo code would work too, I guess - I can convert it to Perl.
Thanks in advance!
Basic approach:
#!/usr/bin/perl -w
use strict;
use autodie;
my $PAGE_SIZE = 3;
my %frequencies;
open my $fh, '<', 'data';
while ( my $l = <$fh> ) {
next unless $l =~ m{\A([a-z])}i;
$frequencies{ uc $1 }++;
}
close $fh;
my $current_sum = 0;
my #letters = ();
my #pages = ();
for my $letter ( "A" .. "Z" ) {
my $letter_weigth = ( $frequencies{ $letter } || 0 );
if ( $letter_weigth + $current_sum > $PAGE_SIZE ) {
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
$current_sum = $letter_weigth;
#letters = ( $letter );
next;
}
push #letters, $letter;
$current_sum += $letter_weigth;
}
if ( $current_sum ) {
my $title = $letters[ 0 ];
$title .= '-' . $letters[ -1 ] if 1 < scalar #letters;
push #pages, $title;
}
print "Pages : " . join( " , ", #pages ) . "\n";
Problem with it is that it outputs (from your data):
Pages : A , B , C-D , E-J , K-O , P , Q-Z
But I would argue this is actually good approach :) And you can always change the for loop into:
for my $letter ( sort keys %frequencies ) {
if you need.
Here's my suggestion:
# get the number of instances of each letter
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# transform the list of counts into a map of count => letters
my %freq = ();
while (my ($letter, $count) = each %count)
{
push #{ $freq{ $count } }, $letter;
}
# now print out the list of letters for each count (or do other appropriate
# output)
foreach (sort keys %freq)
{
my #sorted_letters = sort #{ $freq{$_} };
print "$_: #sorted_letters\n";
}
Update: I think that I misunderstood your requirements. The following code block does something more like what you want.
my %count = ();
while (<FILE>)
{
$count{ uc( substr( $_, 0, 1 ) ) }++;
}
# get the maximum frequency
my $max_freq = (sort values %count)[-1];
my $curr_set_count = 0;
my #curr_set = ();
foreach ('A' .. 'Z') {
push #curr_set, $_;
$curr_set_count += $count{$_};
if ($curr_set_count >= $max_freq) {
# print out the range of the current set, then clear the set
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";
#curr_set = ();
$curr_set_count = 0;
}
}
# print any trailing letters from the end of the alphabet
if (#curr_set > 1)
print "$curr_set[0] - $curr_set[-1]\n";
else
print "$_\n";
Try something like that, where frequency is the frequency array you computed at the previous step and threshold_low is the minimal number of entries in a range, and threshold_high is the max. number. This should give harmonious results.
count=0
threshold_low=3
threshold_high=6
inrange=false
frequency['Z'+1]=threshold_high+1
for letter in range('A' to 'Z'):
count += frequency[letter];
if (count>=threshold_low or count+frequency[letter+1]>threshold_high):
if (inrange): print rangeStart+'-'
print letter+' '
inrange=false
count=0
else:
if (not inrange) rangeStart=letter
inrange=true
use strict;
use warnings;
use List::Util qw(sum);
my #letters = ('A' .. 'Z');
my #raw_data = qw(
Apple Apricot Blackberry Blueberry Cherry Crabapple Cranberry
Elderberry Grapefruit Grapes Kiwi Mulberry Nectarine
Pawpaw Peach Pear Plum Raspberry Rhubarb Strawberry
);
# Store the data by starting letter.
my %data;
push #{$data{ substr $_, 0, 1 }}, $_ for #raw_data;
# Set max page size dynamically, based on the average
# letter-group size (in this case, a multiple of it).
my $MAX_SIZE = sum(map { scalar #$_ } values %data) / keys %data;
$MAX_SIZE = int(1.5 * $MAX_SIZE + .5);
# Organize the data into pages. Each page is an array reference,
# with the first element being the letter range.
my #pages = (['']);
for my $letter (#letters){
my #d = exists $data{$letter} ? #{$data{$letter}} : ();
if (#{$pages[-1]} - 1 < $MAX_SIZE or #d == 0){
push #{$pages[-1]}, #d;
$pages[-1][0] .= $letter;
}
else {
push #pages, [ $letter, #d ];
}
}
$_->[0] =~ s/^(.).*(.)$/$1-$2/ for #pages; # Convert letters to range.
This is an example of how I would write this program.
#! /opt/perl/bin/perl
use strict;
use warnings;
my %frequency;
{
use autodie;
open my $data_file, '<', 'datafile';
while( my $line = <$data_file> ){
my $first_letter = uc( substr( $line, 0, 1 ) );
$frequency{$first_letter} ++
}
# $data_file is automatically closed here
}
#use Util::Any qw'sum';
use List::Util qw'sum';
# This is just an example of how to calculate a threshold
my $mean = sum( values %frequency ) / scalar values %frequency;
my $threshold = $mean * 2;
my #index;
my #group;
for my $letter ( sort keys %frequency ){
my $frequency = $frequency{$letter};
if( $frequency >= $threshold ){
if( #group ){
if( #group == 1 ){
push #index, #group;
}else{
# push #index, [#group]; # copy #group
push #index, "$group[0]-$group[-1]";
}
#group = ();
}
push #index, $letter;
}elsif( sum( #frequency{#group,$letter} ) >= $threshold ){
if( #group == 1 ){
push #index, #group;
}else{
#push #index, [#group];
push #index, "$group[0]-$group[-1]"
}
#group = ($letter);
}else{
push #group, $letter;
}
}
#push #index, [#group] if #group;
push #index, "$group[0]-$group[-1]" if #group;
print join( ', ', #index ), "\n";