perl regex too greedy - perl

I went through similar questions asked by other members and applied (or tried to apply) solutions from their inquiry but they did not work on my issue. My pattern match and grouping is too greedy and does not stop at first pipe(|). If I get more specific, I think it can but I'm trying to figure out how I can stop the pattern match at the first instance of the pipe?
Here are couple of lines
09:30:00.063|IN:|8=FIX.4.2|9=206|35=D|34=5159|49=CLIENT|52=20191024-13:30:00.050|56=SERV|57=DEST|1=05033|11=ABZ5702|15=USD|21=1|38=2000|40=2|44=92.48|47=A|54=5|55=RC|60=20191024-13:30:00.050|111=0|114=N|336=X|5700=AP|9281=SOV|10=202
09:37:21.208|IN:|8=FIX.4.2|9=170|35=D|34=5184|49=CLIENT|52=20191024-13:37:21.206|56=SERV|57=ATXB|1=J5129|11=136404|15=USD|21=1|38=100|40=2|44=1.39|47=A|54=2|55=DIW|59=2|60=20191024-13:30:00.206|10=029
I'm expecting my perl script to return the following output from the above data:
09:30:00.063|13:30:00.050|ABZ5702
09:37:21.208|13:37:21.206|136404
I tried all this and few other veriations but could not get it to produce the above output:
#$msg =~ s/([^|]*).*|52=([^|]*).*|11=([^|]*).*/$1|$2|$3/;
$msg =~ s/(.+)\|??.*|52=([^|]*).*|11=([^|]*).*/$1|$2|$3/;
#$msg =~ s/^([^|]*).??|52=([^|]*).??|11=([^|]*).*/$1|$2|$3/;
#$msg =~ s/^([^\|??]*).*|52=([^\|??]*).*|11=([^\|??]*).*/$1|$2|$3/;
#$msg =~ s/(.*\|??).*|52=(.+\|??).*|11=(.+\|??).*/one $1|two $2|three $3/;
#$msg =~ s/(.*?|).*|52=(.*?|).*|11=(.*|?).*/$1|$2|$3/;
#$msg =~ /(.*)|??.*|52=(.*)|??.*|11=(.*)|??.*/$1|$2|$3/;
#$msg =~ s/|.*-[0-3][0-9]:/|/;
print "$msg\n";```
I realize there are other more than one way to skin the cat but there are cases where I need to use the pattern match approach. How can I get it to produce the expected output using the pattern matching where it stops each group at first pipe(|)? Can someone tell me what am I doing wrong?

Try this:
s/(.*?)\|.*\|52=([^|]*).*\|11=([^|]*).*/$1 $2 $3/;
There were a couple of pipe delimiters that needed escaping.
You need to look at non-greedy matching https://docstore.mik.ua/orelly/perl/cookbook/ch06_16.htm
The first matching group is (.*?) instead of (.*). The ? means we match as little as possible.
In general, for parsing FIX in perl, as long as there are no repeating groups, I would recommend splitting on | first and then creating a hash of tag-value pairs.

I would do it a little bit different - split line into array and work on individual element of array.
The regex may be an acceptable solution for one particular case if format of line predetermined and will never change.
use strict;
use warnings;
use Data::Dumper;
my $debug = 0;
while( my $line = <DATA> ) {
my #array = split /\|/, $line;
print Dumper(\#array) if $debug;
$array[7] =~ s/.+?-//;
$array[11] =~ s/\d+=//;
printf "%s\n", join '|', #array[0,7,11];
}
__DATA__
09:30:00.063|IN:|8=FIX.4.2|9=206|35=D|34=5159|49=CLIENT|52=20191024-13:30:00.050|56=SERV|57=DEST|1=05033|11=ABZ5702|15=USD|21=1|38=2000|40=2|44=92.48|47=A|54=5|55=RC|60=20191024-13:30:00.050|111=0|114=N|336=X|5700=AP|9281=SOV|10=202
09:37:21.208|IN:|8=FIX.4.2|9=170|35=D|34=5184|49=CLIENT|52=20191024-13:37:21.206|56=SERV|57=ATXB|1=J5129|11=136404|15=USD|21=1|38=100|40=2|44=1.39|47=A|54=2|55=DIW|59=2|60=20191024-13:30:00.206|10=029

Related

Why is my last line is always output twice?

I have a uniprot document with a protein sequence as well as some metadata. I need to use perl to match the sequence and print it out but for some reason the last line always comes out two times. The code I wrote is here
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if($_=~m /^\s+(\D+)/) { #this is the pattern I used to match the sequence in the document
$seq=$1;
$seq=~s/\s//g;} #removing the spaces from the sequence
print $seq;
}
I instead tried $seq.=$1; but it printed out the sequence 4.5 times. Im sure i have made a mistake here but not sure what. Here is the input file https://www.uniprot.org/uniprot/P30988.txt
Here is your code reformatted and extra whitespace added between operators to make it clearer what scope the statements are running in.
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if ($_ =~ m /^\s+(\D+)/) {
$seq = $1;
$seq =~ s/\s//g;
}
print $seq;
}
The placement of the print command means that $seq will be printed for every line from the input file -- even those that don't match the regex.
I suspect you want this
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if ($_ =~ m /^\s+(\D+)/) {
$seq = $1;
$seq =~ s/\s//g;
# only print $seq for lines that match with /^\s+(\D+)/
# Also - added a newline to make it easier to debug
print $seq . "\n";
}
}
When I run that I get this
MRFTFTSRCLALFLLLNHPTPILPAFSNQTYPTIEPKPFLYVVGRKKMMDAQYKCYDRMQ
QLPAYQGEGPYCNRTWDGWLCWDDTPAGVLSYQFCPDYFPDFDPSEKVTKYCDEKGVWFK
HPENNRTWSNYTMCNAFTPEKLKNAYVLYYLAIVGHSLSIFTLVISLGIFVFFRSLGCQR
VTLHKNMFLTYILNSMIIIIHLVEVVPNGELVRRDPVSCKILHFFHQYMMACNYFWMLCE
GIYLHTLIVVAVFTEKQRLRWYYLLGWGFPLVPTTIHAITRAVYFNDNCWLSVETHLLYI
IHGPVMAALVVNFFFLLNIVRVLVTKMRETHEAESHMYLKAVKATMILVPLLGIQFVVFP
WRPSNKMLGKIYDYVMHSLIHFQGFFVATIYCFCNNEVQTTVKRQWAQFKIQWNQRWGRR
PSNRSARAAAAAAEAGDIPIYICHQELRNEPANNQGEESAEIIPLNIIEQESSA
You can simplify this a bit:
while (<IN>) {
next unless m/^\s/;
s/\s+//g;
print;
}
You want the lines that begin with whitespace, so immediately skip those that don't. Said another way, quickly reject things you don't want, which is different than accepting things you do want. This means that everything after the next knows it's dealing with a good line. Now the if disappears.
You don't need to get a capture ($1) to get the interesting text because the only other text in the line is the leading whitespace. That leading whitespace disappears when you remove all the whitespace. This gets rid of the if and the extra variable.
Finally, print what's left. Without an argument, print uses the value in the topic variable $_.
Now that's much more manageable. You escape that scoping issue with if causing the extra output because there's no scope to worry about.

Perl with FASTA sequence extraction has problems (only) with first sequence

I am using a function/subroutine extract_seq available on internet to extract sequences in FASTA files. Briefly:
A sequence begins with first line identified by '>', followed by ID and other information separated by spaces
Subsequent lines (not beginning with '>' have multiple strings
A FASTA file can have 1 or more sequences
Bug is that the output has additional '>' character for first sequence (only) causing consistency problems.
Program works fine in extracting sequences based on ID except for additional '>' in case of first sequence. Could you please suggest a solution as well as reason for the bug? A simple regex would fix the problem but I do not feel good about fixing bugs that I cannot understand.
The Perl script is:
#!/usr/bin/perl -w
use strict;
my $seq_all = "seq_all.fa"; # all proteins in fasta format
foreach my $q_seq ("A0A1D8PC43","A0A1D8PC38") {
print "Querying $q_seq\n";
&extract_seq($seq_all, $q_seq);
}
exit 0;
sub extract_seq
{
open(my $fh, ">query.seq");
my $seq_all = $_[0];
my $lookup = $_[1];
local $/ = "\n>";
#ARGV = ($seq_all);
while (my $seq = <>) {
chomp $seq;
my ($id) = $seq =~ /^>*(\S+)/;
if ($id eq $lookup) {
print "$seq\n";
last;
}
}
}
The FASTA file is:
>A0A1D8PC43 A0A1D8PC43_CANAL Diphosphomevalonate decarboxylase
MYSASVTAPVNIATLKYWGKRDKSLNLPTNSSISVTLSQDDLRTLTTASASESFEKDQLW
LNGKLESLDTPRTQACLADLRKLRASIEQSPDTPKLSQMKLHIVSENNFPTAAGLASSAA
GFAALVSAIAKLYELPQDMSELSKIARKGSGSACRSLFGGFVAWEMGTLPDGQDSKAVEI
APLEHWPSLRAVILVVSDDKKDTPSTTGMQSTVATSDLFAHRIAEVVPQRFEAMKKAILD
KDFPKFAELTMKDSNSFHAVCLDSYPPIFYLNDTSKKIIKMVETINQQEVVAAYTFDAGP
NAVIYYDEANQDKVLSLLYKHFGHVPGWKTHYTAETPVAGVSRIIQTSIGPGPQETSESL
TK
>A0A1D8PC56 A0A1D8PC56_CANAL Uncharacterized protein OS=Candida
MSDTKKTTETDSEVGYLDIYLRFNDDMEKDYCFQVKTTTVFKDLYKVFRTLPISLRPSVF
YHAQPIGFKKSVSPGYLTQDGNFIFDEDSQKQAVPVNDNDLINETVWPGQLILPVWQFND
FGFYSFLAFLACWLYTDLPDFISPTPGICLTNQMTKLMAWVLVQFGKDRFAETLLADLYD
TVGVGAQCVFFGFHIIKCLFIFGFLYTGVFNPMRVFRLTPRSVKLDVTKEELVKLGWTGT
RKATIDEYKEYYREFKINQHGGMIQAHRAGLFNTLRNLGVQLESGEGYNTPLTEENKLRT
MRQIVEDAKKPDFKLKLSYEYFAELGYVFATNAENKEGSELAQLIKQYRRYGLLVSDQRI
KTVVRARKGETDEEKPKVEEVVEE
>A0A1D8PC67 A0A1D8PC67_CANAL Bfa1p OS=Candida albicans (strain
MVSDKLTLLRQFSEEDELFGDIEGIDYHDGETLKINKFSFPSSASSPSFAITGQSPNMRS
INGKRITRETLSEYSEENETDLTSEFSDQEFEWDGFNKNQSIYQQMNQRLIATKVAKQRE
AEREQRELMQKRHKDYDPNQTLRLKDFNKLTNENLTLLDQLDDEKTVNYEYVRDDVEDFA
QGFDKDFETKLRIQPSMPTLRSNAPTLKKYKSYGEFKCDNRVKQKLDRIPSFYNKNQLLS
KFKETKSYHPHHKKMGTVRCLNNNSEVPVTYPSISNMKLNKEKNRWEGNDIDLIRFEKPS
LITHKENKTKKRQGNMVYDEQNLRWINIESEHDVFDDIPDLAVKQLQSPVRGLSQFTQRT
TSTTATATAPSKNNETQHSDFEISRKLVDKFQKEQAKIEKKINHWFIDTTSEFNTDHYWE
IRKMIIEE
>A0A1D8PC38 A0A1D8PC38_CANAL Cta2p OS=Candida albicans (strain
MPENLQTRLHNSLDEILKSSGYIFEVIDQNRKQSNVITSPNNELIQKSITQSLNGEIQNF
HAILDQTVSKLNDAEWCLGVMVEKKKKHDELKVKEEAARKKREEEAKKKEEEAKKKAEEA
KKKEEEAKKAEEAKKAEEAKKVEEAAKKAEEAKKAEEEARKKAETAPQKFDNFDDFIGFD
INDNTNDEDMLSNMDYEDLKLDDKVPATTDNNLDMNNILENDESILDGLNMTLLDNGDHV
NEEFDVDSFLNQFGN
Edit:
The problem, as explained above, I face is that the output has additional '>' character for first sequence (only). I do not see the reason for the same and this is causing a lot of trouble. Output is:
Querying A0A1D8PC43
>A0A1D8PC43 A0A1D8PC43_CANAL Diphosphomevalonate decarboxylase
MYSASVTAPVNIATLKYWGKRDKSLNLPTNSSISVTLSQDDLRTLTTASASESFEKDQLW
LNGKLESLDTPRTQACLADLRKLRASIEQSPDTPKLSQMKLHIVSENNFPTAAGLASSAA
GFAALVSAIAKLYELPQDMSELSKIARKGSGSACRSLFGGFVAWEMGTLPDGQDSKAVEI
APLEHWPSLRAVILVVSDDKKDTPSTTGMQSTVATSDLFAHRIAEVVPQRFEAMKKAILD
KDFPKFAELTMKDSNSFHAVCLDSYPPIFYLNDTSKKIIKMVETINQQEVVAAYTFDAGP
NAVIYYDEANQDKVLSLLYKHFGHVPGWKTHYTAETPVAGVSRIIQTSIGPGPQETSESL
TK
Querying A0A1D8PC38
A0A1D8PC38 A0A1D8PC38_CANAL Cta2p OS=Candida albicans (strain
MPENLQTRLHNSLDEILKSSGYIFEVIDQNRKQSNVITSPNNELIQKSITQSLNGEIQNF
HAILDQTVSKLNDAEWCLGVMVEKKKKHDELKVKEEAARKKREEEAKKKEEEAKKKAEEA
KKKEEEAKKAEEAKKAEEAKKVEEAAKKAEEAKKAEEEARKKAETAPQKFDNFDDFIGFD
INDNTNDEDMLSNMDYEDLKLDDKVPATTDNNLDMNNILENDESILDGLNMTLLDNGDHV
NEEFDVDSFLNQFGN
$/ is the input record separator, setting local $/="\n>"; effect is that input is split into record ending with \n>, after chomp, the ending is removed however />*(\S+)/ may not match because > is consumed from previous record.
from FASTA wikipedia a line beginning by > is a comment and may not always be an id. However in case it is always the case, following may fix.
my ($id,$seq) = $seq =~ /^>*(.*)\n(\S+)/;
You set the record separator to \n>. This does not apply to the first sequence.
Fixed code sequence:
...
chomp $seq;
# for first sequence
$seq =~ s/^>//;
my ($id) = $seq =~ /^(\S+)/;
if ($id eq $lookup) {
...
Please note that your implementation is extremely inefficient, because it reads & parses the file contents for each query. How about splitting loading/parsing and querying into separate functions?
Alternative solution: give the full list of lookup values to the loader. It would then fill an answer array as it encounters the matches during reading the file.

Matching of data from output table

We need to match certain data element by element that is an output in tabular form obtained on the command prompt.The following is the approach being currently followed wherein the $Var contains the output. Is there an optimal way of doing this without directing the command output to file.
Please share your thoughts.
$Var = "iSCSI Storage LHN StgMgmt Name IP Name
==============================================================
0 Storage_1 15.178.209.194 admin
1 acct-mgmt 15.178.209.194 storage1
2 acct-mgmt2 15.178.209.194 storage2";
#tab = split("\n",$Var);
foreach (#tab) {
next if ($_ !~ /^\d/);
$_ =~ s/\s+//g;
$first=0 if($_ =~ /Storage/i && /15.178.209.194/);
push(#Array, $_); }
$_ =~ /Storage/i && /15.178.209.194/ is silly. That gets broken up like this: ($_ =~ /Storage/i) && (/15.178.209.194/). Either use $_ consistently or don't - the // and s/// operators automatically operate on $_.
Also you should know that in the regex /15.178.209.194/, the .s are being interpreted as any character. Either escape them or use the index() function.
Additionally, I would recommend that you separate each line using split(). This allows you to compare each individual column. You can use split() with a regex like so: #array = split(/\s+/, $string);.
Finally, I'm not really sure what $first is for, but I notice that all three sample lines in that input trigger $first=0 as they all contain that IP and the string "storage".
If I understand you correctly you want to invoke your script like this:
./some_shell_command | perl perl_script.pl
What you want to use is the Perl diamond operator <>:
#!/usr/bin/perl
use strict;
use warnings;
my $first;
my #Array;
for (<>) {
next unless /^\d/;
s/\s+/ /g;
$first = 0 if /Storage/i && /15.178.209.194/;
push(#Array, $_);
}
I've removed the redundant uses of $_ and fixed your substitution, since you probably don't want to remove all spaces.

Extracting specific lines with Perl

I am writing a perl program to extract lines that are in between the two patterns i am matching. for example the below text file has 6 lines. I am matching load balancer and end. I want to get the 4 lines that are in between.
**load balancer**
new
old
good
bad
**end**
My question is how do you extract lines in between load balancer and end into an array. Any help is greatly appreciated.
You can use the flip-flop operator to tell you when you are between the markers. It will also include the actual markers, so you'll need to except them from the data collection.
Note that this will mash together all the records if you have several, so if you do you need to store and reset #array somehow.
use strict;
use warnings;
my #array;
while (<DATA>) {
if (/^load balancer$/ .. /^end$/) {
push #array, $_ unless /^(load balancer|end)$/;
}
}
print #array;
__DATA__
load balancer
new
old
good
bad
end
You can use the flip-flop operator.
Additionally, you can also use the return value of the flipflop to filter out the boundary lines. The return value is a sequence number (starting with 1) and the last number has the string E0 appended to it.
# Define the marker regexes separately, cuz they're ugly and it's easier
# to read them outside the logic of the loop.
my $start_marker = qr{^ \s* \*\*load \s balancer\*\* \s* $}x;
my $end_marker = qr{^ \s* \*\*end\*\* \s* $}x;
while( <DATA> ) {
# False until the first regex is true.
# Then it's true until the second regex is true.
next unless my $range = /$start_marker/ .. /$end_marker/;
# Flip-flop likes to work with $_, but it's bad form to
# continue to use $_
my $line = $_;
print $line if $range !~ /^1$|E/;
}
__END__
foo
bar
**load balancer**
new
old
good
bad
**end**
baz
biff
Outputs:
new
old
good
bad
If you prefer a command line variation:
perl -ne 'print if m{\*load balancer\*}..m{\*end\*} and !m{\*load|\*end}' file
For files like this, I often use a change in the Record Separator ( $/ or $RS from English )
use English qw<$RS>;
local $RS = "\nend\n";
my $record = <$open_handle>;
When you chomp it, you get rid of that line.
chomp( $record );

Parsing a log file using perl

I have a log file where some of the entries look like this:
YY/MM/DD HH:MM:SS:MMM <Some constant text> v1=XXX v2=YYY v3=ZZZ v4=AAA AND BBB v5=CCC
and I'm trying to get it into a CSV format:
Date,Time,v1,v2,v3,v4,v5
YY/MM/DD,HH:MM:SS:MMM,XXX,YYY,ZZZ,AAA AND BBB,CCC
I'd like to do this in Perl - speaking personally, I could probably do it far quicker in other languages but I'd really like to expand my horizons a bit.
So far I can get as far as reading the file in and picking out only lines which meet my criteria but I can't seem to get the next stage done. I'll need to splice up the input line but so far I just can't work out how to do this. I've looked at s//and m// but they don't really give me what I want. If anyone can advise me how this can be done or give me pointers I'd much appreciate it.
Important points:
The values in the second part of the line are always in the same order so mapping / re-organising is not necesarily a problem.
Some of the fields have free text which is not quoted :( but as the labels all start v<number>= I'm hoping parsing this should still be a possibility.
Since there is no one delimiter, you'll need to try this a few different ways:
First, split on ' ', then take the first three values:
my #array = split / /, $line;
my ($date, $time, $constant) = splice #array, 0, 3;
Join the rest of the fields together again, and re-split on v\d+= to get the values:
my $rest = join ' ', #array;
# $rest should now be "v1=XXX v2=YYY ..."
my #values = split /\s*v\d+=/, $rest;
shift #values; # since the first element in #values will be empty
print join ',', $date, $time, #values;
Edit: Here's another approach that may be easier to follow, and is slightly more efficient. This takes advantage of the fact that your constant text occurs between the date/time and the value list.
# assume that CONSTANT is your constant text
my ($datetime, $valuelist) = split /\s*CONSTANT\s*/, $line;
my ($date, $time) = split / /, $datetime;
my #values = split /\s*v\d+=/, $valuelist;
shift #values;
print join ',', $date, $time, #values, "\n";
What have you tried with regular expressions and how has it failed? A regex with m// works fine for me:
#!/usr/bin/env perl
use strict;
use warnings;
print "Date,Time,v1,v2,v3,v4,v5\n";
while (my $line = <DATA>) {
my #matched = $line =~ m{^([^ ]+) ([^ ]+).*v1=(.*) v2=(.*) v3=(.*) v4=(.*) v5=(.*)};
print join(',', #matched), "\n";
}
__DATA__
YY/MM/DD HH:MM:SS:MMM <Some constant text> v1=XXX v2=YYY v3=ZZZ v4=AAA AND BBB v5=CCC
Two caveats:
1) v1 cannot contain the substring " v2=", v2 cannot contain " v3=", etc., but, with such a loose format, that's something that would likely cause problems for a human attempting to parse it, too.
2) This code assumes that there will always be v1 through v5. If there are fewer than five v*n* fields, the line will fail to match. If there are more, all additional fields will be appended to v5 (including their v*n* tags).
In case the log is fixed-width, you better off using unpack, you will see its benefits if the log grows very large (performance wise).