How do I get the lowest matched captureno? - perl

How do I get the lowest matched group no in a regular expression?
Suppose there is a regular expression
/(a(b))|(b(1))|(c(4))/...
e.g input string is "b1" , the lowest matched group no 2. ($2)
e.g input string is "c4" , the lowest matched group no 5. ($5)
e.g input string is "ab" , the lowest matched group no 1. ($1)
I have one solution but it is not very efficient.
Thakns all for trying.
The real problem is efficiency. Many have provided similar solutions I found.
The problem is linear time searching for the lowest group. O(N) where n is number of capture groups.
I wondered if there is a faster way. O(1) That was the aim of this question.
I expected the Perl has a hidden feature to get that value. I guess there is not.
Meanwhile I found the solution myself, here it is..
/(a(b)(??{ $first=1;"" }))|(b(1)(??{ $first=2;"" }))|(c(4)(??{ $first=5;"" }))/
The time to find out $first is O(1).
if (#matches = $conv::content =~/$conv::trtree{convertsourceregqr}[$conversionno]/)
{
my $firstno;
my $c = 0;
for my $m (#matches)
{
if (defined $m)
{
$firstno=$c;
last;
}
$c++;
}**strong text****strong text**

This doesn't specifically match your question, but it might address your actual problem (or else a future reader's).
Edit (12/10/12):
One more option, the special construct (?|) will reorganize the numbering in alternations, so that the numbers will be consistent. This won't help identify which group matched, but will assure you that the matches are in $1 and $2. If you need to know which matched, named captures (below) are the way to go.
#!/usr/bin/env perl
use strict;
use warnings;
foreach my $v ('ab', 'b1', 'c4') {
print "Input: $v\n";
next unless $v =~ /(?|(a(b))|(b(1))|(c(4)))/;
print "$1 => $2\n";
}
Original
Perhaps you want to use named captures to ease the burden of understanding what matched. The named capture results are placed in the %+ hash and are thus much easier to introspect.
#!/usr/bin/env perl
use strict;
use warnings;
foreach my $v ('ab', 'b1', 'c4') {
print "Input: $v\n";
next unless $v =~ /(?<a>a(?<ab>b))|(?<b>b(?<b1>1))|(?<c>c(?<c4>4))/;
foreach my $key (sort keys %+) {
next unless defined $+{$key};
print "\t$key => $+{$key}\n";
}
}
prints
Input: ab
a => ab
ab => b
Input: b1
b => b1
b1 => 1
Input: c4
c => c4
c4 => 4
EDIT
In fact, for alternations like this, perhaps you want to simply use recurring names!
#!/usr/bin/env perl
use strict;
use warnings;
foreach my $v ('ab', 'b1', 'c4') {
print "Input: $v\n";
next unless $v =~ /(?<outer>a(?<inner>b))|(?<outer>b(?<inner>1))|(?<outer>c(?<inner>4))/;
print "\touter => $+{outer}\n";
print "\tinner => $+{inner}\n";
}
prints
Input: ab
outer => ab
inner => b
Input: b1
outer => b1
inner => 1
Input: c4
outer => c4
inner => 4

Store the matches in an array, and find the index of the first defined value:
my $str = 'c4';
my #matches = ( $str =~ m/(a(b))|(b(1))|(c(4))/ );
for my $i ( 0..$#matches ) {
if ( defined $matches[$i] ) {
printf "First matching group: %d\n", $i+1;
last;
}
}
# output: 5
Note that this will never output 2, 4 or 6 since groups 1, 3 or 5 must match for one of them to match.
If you only want the content of the first matching group:
use List::Util 'first';
my $str = 'c4';
print first { defined } $str =~ m/(a(b))|(b(1))|(c(4))/;

The special variables #- and #+ hold the starting and ending positions of successful matches. The practical application to your question is that if $<n> holds some value (for $<n> in $1, $2, etc.), then $+[<n>] will be larger than $-[<n>].
for ('b1', 'c4', 'ab') {
/(a(b))|(b(1))|(c(4))/;
my #i = grep { $+[$_] > $-[$_] } 1..$#+;
# #i contains list of successful matches,
# i.e., if #i == (3,4), then $3 and $4 contain values
if (#i > 0) {
print "Earliest match for '$_' is: \$$i[0]\n";
} else {
print "No match for '$_'\n";
}
}

First off, using parentheses that way is confusing. The simplest solution to this particular problem is to just use one:
/(ab|b1|c4)/
Since the other parentheses do not serve a purpose in this particular case, this will work.
However, there may be times when grouping is needed, in which case you can use non-capturing parentheses and just use one to capture, (?: ... ). In your case it would look like this:
/((?:a(?:b))|(?:b(?:1))|(?:c(?:4)))/

The group numbering in a regular expression is the number of parens.
1 2 3 4 5 6
/(a(b))|(b(1))|(c(4))/
A quick script to demonstrate this:
#!/usr/bin/perl
foreach my $v ('ab', 'b1', 'c4') {
$v =~ /(a(b))|(b(1))|(c(4))/;
if(defined $1) { print "One!\n"; }
if(defined $3) { print "Three!\n"; }
if(defined $5) { print "Five!\n"; }
print << "--EOB--";
$v
1 $1
2 $2
3 $3
4 $4
5 $5
6 $6
--EOB--
}
Which produces the output:
One!
ab
1 ab
2 b
3
4
5
6
Three!
b1
1
2
3 b1
4 1
5
6
Five!
c4
1
2
3
4
5 c4
6 4
At this point, one should be able to easily modify the code to do whatever for whichever group is matched.

Related

Sort comma-delimited file by three columns with custom criteria in Perl

I have a comma-delimited, text file. I want to sort the file by the 3rd column first, then the 2nd column, then the 1st column.
However, I want the 3rd column to be sorted alphabetically, with the longest value first.
For example, AAA, then AA, then A, then BBB, then BB, then B, then CCC, then CC, and so on.
Input (alpha-sort-test2.txt):
JOHN,1,A
MARY,3,AA
FRED,5,BBB
SAM,7,A
JOHN,3,AAA
JOHN,2,AAA
BETTY,2,AAA
JARROD,7,AAA
JOANNE,2,BB
AMANDA,2,DD
AMY,5,B
PETE,7,CC
MATT,4,B
SARAH,3,CCC
GEORGE,3,CC
AMANDA,3,AAA
The Perl code that I have so far is as follows:
$infile = "alpha-sort-test2.txt";
$outfile = "alpha-sort-test-sorted2.txt";
open (INFILE, "<$infile") or die "Could not open file $infile $!";
open (OUTFILE, ">$outfile");
my #array = sort howtosort <INFILE>;
foreach (#array)
{
chomp;
print "$_\n";
print OUTFILE "$_\n";
}
sub howtosort
{
my #flds_a = split(/,/, $a);
my #flds_b = split(/,/, $b);
$flds_a[2] cmp $flds_b[2];
}
close INFILE;
close OUTFILE;
Current output (alpha-sort-test-sorted2.txt):
JOHN,1,A
SAM,7,A
MARY,3,AA
AMANDA,3,AAA
JOHN,3,AAA
JOHN,2,AAA
BETTY,2,AAA
JARROD,7,AAA
AMY,5,B
MATT,4,B
JOANNE,2,BB
FRED,5,BBB
PETE,7,CC
GEORGE,3,CC
SARAH,3,CCC
AMANDA,2,DD
Desired output:
BETTY,2,AAA
JOHN,2,AAA
AMANDA,3,AAA
JOHN,3,AAA
JARROD,7,AAA
MARY,3,AA
JOHN,1,A
SAM,7,A
FRED,5,BBB
JOANNE,2,BB
MATT,4,B
AMY,5,B
SARAH,3,CCC
GEORGE,3,CC
PETE,7,CC
AMANDA,2,DD
Thanks in advance.
There's a little complication with that criterion for the third field.
Lexicographical comparison goes char by char, so abc is lesser-than ax but longer strings are greater, with all else equal. So ab is lesser-than b but ab is greater-than a.
Thus that requirement for the third field mixes these two things and breaks cmp right down the middle. If we were to use cmp then ab comes before b (correct) but aa comes after a (not wanted). I don't see how to make use of cmp at all for that requirement.
So here's a very basic implementation of it, for these criteria
use warnings;
use strict;
use feature 'say';
use Path::Tiny qw(path); # convenience
my $file = shift // die "Usage: $0 file\n";
my #lines = path($file)->lines({ chomp => 1 });
my #sorted =
map { $_->[0] }
sort { custom_sort($a, $b) }
map { [$_, split /,/] }
#lines;
say for #sorted;
sub custom_sort {
my ($aa, $bb) = #_;
# Last field for both terms, their lengths
my ($af, $bf) = map { $_->[-1] } $aa, $bb;
my ($len_a, $len_b) = map { length } $af, $bf;
# Strip and return first characters and compare them lexicographically
# Then compare lengths of original strings if needed
# Keep going until difference is found or one string is depleted
while (
(my $ca = substr $af, 0, 1, "") and
(my $cb = substr $bf, 0, 1, "") )
{
if ($ca gt $cb) {
return 1
}
elsif ($ca lt $cb) {
return -1;
}
elsif ($len_a < $len_b) {
return 1
}
elsif ($len_a > $len_b) {
return -1
}
}
# Still here, so third field was the same; use other two criteria
return
$aa->[2] <=> $bb->[2]
||
$aa->[1] cmp $bb->[1];
}
This prints out the desired list.
Some comments
Before invoking sort we first form an arrayref, with the whole string and its individual fields, so that the string need not be split later on every single comparison; this is Schwartzian transform
Criterion for the third-field: compare character by character alphabetically until a difference is found; if one string is contained in the other then the longer one wins. So the char-by-char comparison of abc and ab stops at b and abc 'wins'
The (optional) fourth argument in substr is the replacement for the returned substring, found per the second and third argument. So here an empty string replaces one-long substring that starts at 0 -- it removes and returns the first character. This is quite like using shift on an array
If the third fields are exactly the same then the second fields are compared numerically and if they are the same then the first fields are compared alphabetically
After the comparison we retrieve the original string from the sorted arrayrefs

Compare 2 sets of numbers contained in strings

I have 2 scalars as below:
$a = '100 105 010';
$b = '010 105 100';
How do I compare if both has same set of values? order doesn't matter.
one way is to arrange it in ascending order and compare, is there a better way?
You can split each string into an array and sort and compare arrays. By hand:
use warnings;
use strict;
my $x = '100 105 1 010';
my $y = '010 105 100 2';
my #xs = sort { $a <=> $b } split ' ', $x;
my #ys = sort { $a <=> $b } split ' ', $y;
if (#xs != #ys) {
print "Differ in number of elements.\n";
}
else {
for (0..$#xs) {
print "Differ: $xs[$_] vs $ys[$_]\n" if $xs[$_] != $ys[$_];
}
}
# For boolean comparison put it in a sub
print arr_num_eq(\#xs, \#ys), "\n";
sub arr_num_eq {
my ($ra, $rb) = #_;
return 0 if #$ra != #$rb;
$ra->[$_] != $rb->[$_] && return 0 for 0..$#$ra;
return 1;
}
The sorting can be moved to the sub as well, which would then take strings. The way it stands it can be used for comparison of existing arrays as well. Please add argument checking.
There is a number of modules that have this capability. The perm from Array::Compare hides the sorting above, but internally joins sorted arrays into strings thus duplicating the work here since we started with strings. The List::AllUtils certainly offers this as well with its long list of utilities.
See this post, for example, for a few methods (just not the smart match ~~), and for benchmarks if efficiency is a concern.
Using the mentioned implementation idea from Array::Compare, per comment by ysth
sub str_num_eq {
return join(' ', sort { $a <=> $b } split / /, $_[0])
eq join(' ', sort { $a <=> $b } split / /, $_[1])
}
What the most suitable method is depends on what this is for and how it is used. Is it only a boolean comparison, or will more be done if they are found to differ? How does it come about in your program flow? What are typical sizes of strings, how often is it run? Are the strings most often the same or different, do they typically differ a lot or a little? Etc.
Without modules, you can use hashes:
#!/usr/bin/perl
use warnings;
use strict;
my $x = '100 105 010 2';
my $y = '010 105 100 100 1';
my (%hx, %hy);
$hx{$_}++ for split ' ', $x;
$hy{$_}++ for split ' ', $y;
for my $k (keys %hx) {
if (! exists $hy{$k}) {
print "$k missing in y\n";
} elsif ($hy{$k} != $hx{$k}) {
print "$k has different number of occurences\n";
}
delete $hy{$k};
}
print "$_ missing in x\n" for keys %hy;
$a and $b are special variables used in sort, so I renamed them to $x and $y.
split turns the strings into lists. Each hash counts how many times a member occurs in the list.
See also Perl FAQ 4.
Something else try with pattern matching,
This is not a straight forward but it will work.
Construct the pattern by anyone of your scalar value. Then check the another string by the constructed pattern.
my $a = '100 100 105';
my $b = '100 105 100';
my #b_ary = split(" ",$b);
my $regex = join'\b|\b', #b_ary;
my $word_length = #b_ary * 2 - 1; #Count the number of words and space.
my $rgx = qr"^(?:\b$regex\b|\s){$word_length}$"; #`{n}` match word exact n times
if($a=~m/$rgx/)
{
print "same values\n";
}
else
{
print "Not a same values\n";
}
The answer is already posted above. This is just in case you want to remove the white spaces and compare each number.
$x = '100 105 010';
$y = '010 105 100';
join("",sort split "",join("",split " ",$x)) eq join("",sort split "",join("",split " ",$y));

Perl sort on numbers

I have some perl script to process a file (contains lots of numbers) line by line.
File content (sample data, first 3 numbers are seperated by space, then the seperate is tab between 3rd and 4th numbers):
1 2 3 15
2 9 8 30
100 106 321 92
9 8 2 59
300 302 69 88
....
Script content:
# snippet of script
open(INF, "$infile") || die "Unable to open file $infile: $!\n";
#content = <INF>;
close(INF);
foreach $line (#content) {
# blah blah, script to handle math here
# Now the numbers are stored in separate variables
# $n1 stores the 1st number, i.e.: 1
# $n2 stores the 2nd number, i.e.: 2
# $n3 stores the 3rd number, i.e.: 3
# $n4 stores the 4th number, i.e.: 15
# Solution code to be inserted here
}
I would like to:
Sort the variables $n1, $n2, $n3 and output them in ascending order.
At the end of foreach, get rid of duplicates
My approach:
# Insert below code to foreach
$numbers{$n1} = 1;
$numbers{$n2} = 1;
$numbers{$n3} = 1;
#keys = sort { $numbers{$b} <=> $numbers{$a} } keys %numbers;
#push #numbers, "$keys[0] $keys[1] $keys[2]";
$numbers2{"$keys[0] $keys[1] $keys[2]"} = 1;
This defines two hashes: 1st hash is for sorting, 2nd hash is for removing duplicates after sorting.
Is there any better approach?
Thanks,
Updated with another solution -- it is the lines that may be duplicate, not numbers on a line.
In order to remove duplicate lines it is easiest if we have all sorted lines of three numbers in an array. Then we post-process that by running them through uniq. There are (at least) two ways to do this.
Store lines in an array, each being a reference to a sorted array with the three numbers. Then for comparison construct a string of each on the fly. This is better if there is yet other processing of numbers somewhere along, as they are in arrays.
Build a string out of each sorted line and store those in an array. Then it's easier to compare.
Below I use the first approach, assuming that there is other processing of numbers.
use warnings;
use strict;
use feature wq(say);
use List::MoreUtils qw(uniq);
my $file = 'sort_nums.txt';
my #content = do {
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>;
};
my #linerefs_all;
foreach my $line (#content) {
# Calculations ... numbers stored in ($n1, $n2, $n3, $n4)
my ($n1, $n2, $n3) = split '\s+' $line; # FOR TESTING
# Add to #rlines a reference to the sorted array with first three
push #linerefs, [ sort { $a <=> $b } ($n1, $n2, $n3) ];
}
# Remove dupes by comparing line-arrays as strings, then remake arrayrefs
my #linerefs = map { [ split ] } uniq map { join ' ', #$_ } #linerefs_all;
say "#$_" for #linerefs;
Using the posted lines in the file sort_nums.txt, the code above prints
1 2 3
2 8 9
100 106 321
69 300 302
Explanation of the post-processing line, read from the right.
The map on the right processes a list of arrayrefs. It dereferences each and joins its elements with a space, forming a string for the line. It returns a list of such strings, one per line.
That list is pruned of duplicates by uniq, which itself returns a list, fed into the map on the left.
In that map's block each string is split by (the default) white space into a list (of numbers on the line), and then a reference of that is taken by [ ]. This map thus returns a list of references to arrays, one for each line, what is assigned to #linerefs.
This is then printed.
If this is too much to stomach for one statement break the process into steps, generating intermediate arrays. Or switch to the second method above.
Initial post, assuming that numbers on each line may be duplicates
I take the objective to be: sort three variables and keep only unique ones, for each line.
use List::MoreUtils qw(uniq);
foreach my $line (#content) {
# Calculations, numbers stored in ($n1, $n2, $n3, $n4)
my #nums = uniq sort { $a <=> $b } ($n1, $n2, $n3);
say "#nums";
}
Remember that after this you don't know which one(s) of $n1, $n2, $n3 may have been dropped.
If, for some reason, a non-core module is not suitable, see this in perlfaq4 For example,
my %seen = ();
my #nums = sort { $a <=> $b } grep { ! $seen{$_}++ } ($n1, $n2, $n3);
or, if you need it without an extra hash around
my #nums = do { my %seen; sort { $a <=> $b } grep { !$seen{$_}++ } ($n1, $n2, $n3) };

In Perl, how do I sort by frequency of a value?

I am trying to create a program to count the different values that occur in a column of a data file. So, it would be something like, if the possible values of a column are A, B, C. The output is something like
A 456
B 234
C 344
I have been able to get the running counts of A, B and C easily by doing something like this
my %count;
for my $f (#ffile) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
chomp;
my #U = split / /;
$count{$U[2]}++;
}
}
foreach my $w (sort keys %count) {
printf $w\t$count{$w};
}
For instance here I am counting the second column of the file in the path given.
How do I sort the output of the printf by the counts rather than the keys (or values A, B, C) to get -
A 456
C 344
B 234
This is a FAQ:
perldoc -q sort
use warnings;
use strict;
my %count = (
A => 456,
B => 234,
C => 344
);
for my $w (sort { $count{$b} <=> $count{$a} } keys %count) {
print "$w\t$count{$w}\n";
}
__END__
A 456
C 344
B 234
for my $w (sort {$count{$b} <=> $count{$a}} keys %count) {
print "$w\t$count{$w}\n";
}
Some additional comments:
The output is something like...by doing something like this
You help us help you if you paste your actual code, abbreviated where possible.
When people recreate their actual code, they often obscure or omit the very source of their problem.
chomp;
my #U = split / /;
This splits on space characters and looks for the count after the second space; it's often easier to do:
my #U = split ' ';
split used with a constant space instead of a regex splits on any sequence of whitespace, like split /\s+/ except that it ignores trailing whitespace...this is a common enough thing to do that there is this special syntax for it. Note that the chomp becomes unnecessary.

Perl count the sum of one column aggregating by another

I have a dataset will a lot of columns. What I need to do is to sum a aggregate a certain column in terms of another. As an example,
ID Volume
A 20
D 60
B 10
A 50
K 30
B 100
D 80
So I want an aggregated sum of all the different IDs (A, B, C...) in terms of volumes and sorted by that sum
The result would be like
D 140
B 110
A 70
K 30
how would I accomplish this in perl?
#!/usr/bin/perl
use strict;
use warnings;
my %ids_and_sums;
while (<>) {
# The regex will only consider one single uppercase letter as
# an ID; in case your IDs may look different, you could prepend
# your 'ID Volume' line with a character which will never be part
# of an ID, and modify below regex to meet your needs
my ($id, $volume) = m/^([A-Z])\s+(\d+)/;
if ($id and $volume) {
$ids_and_sums{$id} += $volume;
}
}
foreach my $key (sort {$ids_and_sums{$b} <=> $ids_and_sums{$a}} keys %ids_and_sums) {
print "$key: $ids_and_sums{$key}\n";
}
This prints:
D: 140
B: 110
A: 70
K: 30
EDIT: I have modified the code so that the sorting will be in descending order of the sums.
You can do it as:
perl -lnae '$H{$F[0]} += $F[1];END { print $_." ".$H{$_} for(keys %H) }'
passing it all but the first line of your input file as standard input.
Ideone Link
You can make Perl discard the heading line as:
perl -lnae 'BEGIN{$i=1;}if($i){$i=0;next;}$H{$F[0]} += $F[1];END { print $_." ".$H{$_ } for(keys %H) }' file
Ideone Link
$, = ' '; # set output field separator
$\ = "\n"; # set output record separator
while (<>) {
($Fld1,$Fld2) = split(' ', $_, -1);
$map{$Fld1} += $Fld2;
}
foreach $i (keys %map) {
print $i, $map{$i};
}
something like this