Find multiple substrings in strings and record location - perl

The following is the script for finding consecutive substrings in strings.
use strict;
use warnings;
my $file="Sample.txt";
open(DAT, $file) || die("Could not open file!");
#worry about these later
#my $regexp1 = "motif1";
#my $regexp2 = "motif2";
#my $regexp3 = "motif3";
#my $regexp4 = "motif4";
my $sequence;
while (my $line = <DAT>) {
if ($line=~ /(HDWFLSFKD)/g){
{
print "its found index location: ",
pos($line), "-", pos($line)+length($1), "\n";
}
if ($line=~ /(HD)/g){
print "motif found and its locations is: \n";
pos($line), "-", pos($line)+length($1), "\n\n";
}
if ($line=~ /(K)/g){
print "motif found and its location is: \n";
pos($line), "-",pos($line)+length($1), "\n\n";
}
if ($line=~ /(DD)/g){
print "motif found and its location is: \n";
pos($line), "-", pos($line)+length($1), "\n\n";
}
}else {
$sequence .= $line;
print "came in else\n";
}
}
It matches substring1 with string and prints out position where substring1 matched. The problem lies in finding the rest of the substrings. For substrings2 it starts again from the beginning of the string (instead of starting from the position where substring1 was found). The problem is that every time it calculates position it starts from the beginning of string instead of starting from the position of the previously found substring. Since substrings are consecutive substring1, substring2, substring3, substring4, their positions have to occur after the previous respectively.

Try this perl program
use strict;
use warnings;
use feature qw'say';
my $file="Sample.txt";
open( my $dat, '<', $file) || die("Could not open file!");
my #regex = qw(
HDWFLSFKD
HD
K
DD
);
my $sequence;
while( my $line = <$dat> ){
chomp $line;
say 'Line: ', $.;
# reset the position of variable $line
# pos is an lvalue subroutine
pos $line = 0;
for my $regex ( #regex ){
$regex = quotemeta $regex;
if( scalar $line =~ / \G (.*?) ($regex) /xg ){
say $regex, ' found at location (', $-[2], '-', $+[2], ')';
if( $1 ){
say " but skipped: \"$1\" at location ($-[1]-$+[1])";
}
}else{
say 'Unable to find ', $regex;
# end loop
last;
}
}
}

I'm not a perl expert but you can use $- and $+ to track index location for last regex match found.
Below is code built on top of your code that explains this.
use strict;
use warnings;
my $file="sample.txt";
open(DAT, $file) || die("Could not open file!");
open (OUTPUTFILE, '>data.txt');
my $sequence;
my $someVar = 0;
my $sequenceNums = 1;
my $motif1 = "(HDWFLSFKD)";
my $motif2 = "(HD)";
my $motif3 = "(K)";
my $motif4 = "(DD)";
while (my $line = <DAT>)
{
$someVar = 0;
print "\nSequence $sequenceNums: $line\n";
print OUTPUTFILE "\nSequence $sequenceNums: $line\n";
if ($line=~ /$motif1/g)
{
&printStuff($sequenceNums, "motif1", $motif1, "$-[0]-$+[0]");
$someVar = 1;
}
if ($line=~ /$motif2/g and $someVar == 1)
{
&printStuff($sequenceNums, "motif2", $motif2, "$-[0]-$+[0]");
$someVar = 2;
}
if ($line=~ /$motif3/g and $someVar == 2)
{
&printStuff($sequenceNums, "motif3", $motif4, "$-[0]-$+[0]");
$someVar = 3;
}
if ($line=~ /$motif4/g and $someVar == 3)
{
&printStuff($sequenceNums, "motif4", $motif4, "$-[0]-$+[0]");
}
else
{
$sequence .= $line;
if ($someVar == 0)
{
&printWrongStuff($sequenceNums, "motif1", $motif1);
}
elsif ($someVar == 1)
{
&printWrongStuff($sequenceNums, "motif2", $motif2);
}
elsif ($someVar == 2)
{
&printWrongStuff($sequenceNums, "motif3", $motif3);
}
elsif ($someVar == 3)
{
&printWrongStuff($sequenceNums, "motif4", $motif4);
}
}
$sequenceNums++;
}
sub printStuff
{
print "Sequence: $_[0] $_[1]: $_[2] index location: $_[3] \n";
print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] index location: $_[3]\n";
}
sub printWrongStuff
{
print "Sequence: $_[0] $_[1]: $_[2] was not found\n";
print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] was not found\n";
}
close (OUTPUTFILE);
close (DAT);
Sample input:
MLTSHQKKFHDWFLSFKDSNNYNHDSKQNHSIKDDIFNRFNHYIYNDLGIRTIA
MLTSHQKKFSNNYNSKQNHSIKDIFNRFNHYIYNDLGIRTIA
MLTSHQKKFSNNYNSKHDWFLSFKDQNHSIKDIFNRFNHYIYNDL

You really should read
perldoc perlre
perldoc perlreref
perldoc perlretut
You need the special variables #- and #+ if you need the positions. No need to try to compute them yourself.
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw( each_array );
my $source = 'AAAA BBCCC DD E FFFFF';
my $pattern = join '\s*', map { "($_+)" } qw( A B C D E F );
if ( $source =~ /$pattern/ ) {
my $it = each_array #-, #+;
$it->(); # discard overall match information;
while ( my ($start, $end) = $it->() ) {
printf "Start: %d - Length: %d\n", $start, $end - $start;
}
}
Start: 0 - Length: 4
Start: 7 - Length: 2
Start: 9 - Length: 3
Start: 15 - Length: 2
Start: 19 - Length: 1
Start: 26 - Length: 5

The result of a construct like
$line=~ /(HD)/g
is a list. Use while to step through the hits.

To match where the last match left off, use \G. perldoc perlre says (but consult your own installation's version's manual first):
The "\G" assertion can be used to
chain global matches (using "m//g"),
as described in "Regexp Quote-Like
Operators" in perlop. It is also
useful when writing "lex"-like
scanners, when you have several
patterns that you want to match
against consequent substrings of your
string, see the previous reference.
The actual location where "\G" will
match can also be influenced by using
"pos()" as an lvalue: see "pos" in
perlfunc. Note that the rule for
zero-length matches is modified
somewhat, in that contents to the left
of "\G" is not counted when
determining the length of the match.
Thus the following will not match
forever:
$str = 'ABC';
pos($str) = 1;
while (/.\G/g) {
print $&;
}

Related

How can I calculate the geometric center of a protein in Perl?

I have a PDB file which contains information about a specific protein. One of the information it holds is the positions of the different atoms (xyz coordinates).
The file is the following https://files.rcsb.org/view/6U9D.pdb . With this file I want to calculate the geometric center of the atoms. In theory I know what I need to do, but the script I wrote does not seem to work.
The first part of the program, before the for loop, is another part of the assignment which requires me to read the sequence and convert it from the 3 letter nomenclature to the 1 letter one. The part that interests me is the for loop until the end. I tried to pattern match in order to isolate the XYZ coordinates. Then I used a counter that I had set up in the beginning which is the $k variable. When I check the output on cygwin the only output I get is the sequence 0 0 0 instead of the sum of each dimension divided by $k. Any clue what has gone wrong?
$k=0;
open (IN, '6U9D.pdb.txt');
%amino_acid_conversion = (
ALA=>'A',TYR=>'Y',MET=>'M',LEU=>'L',CYS=>'C',
GLY=>'G',ARG=>'R',ASN=>'N',ASP=>'D',GLN=>'Q',
GLU=>'E',HIS=>'H',TRP=>'W',LYS=>'K',PHE=>'F',
PRO=>'P',SER=>'S',THR=>'T',ILE=>'I',VAL=>'V'
);
while (<IN>) {
if ($_=~m/HEADER\s+(.*)/){
print ">$1\n"; }
if ($_=~m/^SEQRES\s+\d+\s+\w+\s+\d+\s+(.*)/){
$seq.=$1;
$seq=~s/ //g;
}
}
for ($i=0;$i<=length $seq; $i+=3) {
print "$amino_acid_conversion{substr($seq,$i,3)}";
if ($_=~m/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\s+\d+\s+(\S+)\s+(\S+)\s+(\S+)/) {
$x+=$1; $y+=$2; $z+=$3; $k++;
}
}
print "\n";
#print $k;
$xgk=($x/$k); $ygk=($y/$k); $zgk=($z/$k);
print "$xgk $ygk $zgk \n";
I do not know bioinformatics but it seems like you should do something like this:
use feature qw(say);
use strict;
use warnings;
my $fn = '6U9D.pdb';
open ( my $IN, '<', $fn ) or die "Could not open file '$fn': $!";
my $seq = '';
my $x = 0;
my $y = 0;
my $z = 0;
my $k = 0;
while (<$IN>) {
if ($_ =~ m/HEADER\s+(.*)/) {
say ">$1";
}
if ($_=~m/^SEQRES\s+\d+\s+\w+\s+\d+\s+(.*)/){
$seq .= $1;
}
if ($_ =~ m/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\s+\d+\s+(\S+)\s+(\S+)\s+(\S+)/) {
$x+=$1; $y+=$2; $z+=$3; $k++;
}
}
close $IN;
$seq =~ s/ //g;
my %amino_acid_conversion = (
ALA=>'A',TYR=>'Y',MET=>'M',LEU=>'L',CYS=>'C',
GLY=>'G',ARG=>'R',ASN=>'N',ASP=>'D',GLN=>'Q',
GLU=>'E',HIS=>'H',TRP=>'W',LYS=>'K',PHE=>'F',
PRO=>'P',SER=>'S',THR=>'T',ILE=>'I',VAL=>'V'
);
my %unknown_keys;
my $conversion = '';
say "Sequence length: ", length $seq;
for (my $i=0; $i < length $seq; $i += 3) {
my $key = substr $seq, $i, 3;
if (exists $amino_acid_conversion{$key}) {
$conversion.= $amino_acid_conversion{$key};
}
else {
$unknown_keys{$key}++;
}
}
say "Conversion: $conversion";
say "Unknown keys: ", join ",", keys %unknown_keys;
say "Number of atoms: ", $k;
my $xgk=($x/$k);
my $ygk=($y/$k);
my $zgk=($z/$k);
say "Geometric center: $xgk $ygk $zgk";
This gives me the following output:
[...]
Number of atoms: 76015
Geometric center: 290.744642162734 69.196842162731 136.395842938893

Finding longest match between 2 files from pattern

I am having trouble implementing two files within this program. I am trying to to access the contents of file $Q and $s.
print "Input the K value \n";
$k = <>;
chomp $k;
print "Input T\n";
$t = <>;
chomp $t;
%Qkmer = ();
$i = 1;
$query=' ';
while ($line=<IN>) {
chomp($line);
if ($line=~ m/^>/ ) {
next;
}
$query=$query.$line;
$line=~ s/(^|\n)[\n\s]*/$1/g;
while (length($line) >= $k) {
$line =~ m/(.{$k})/;
if (! defined $Qkmer{$1}) {#every key not deined as the first match
$Qkmer{$1} = $i;
}
$i++;
$line = substr($line, 1, length($line) -1);
}
}
open(MYDATA, '<', "data.txt");
while ($line=<MYDATA>) { \
chomp($line);
%Skmer = (); # This initializes the hash called Skmer.
$j = 1;
if ($line=~ m/^>/ ) { #if the line starts with >
next; #start on next line #separated characters
}
$line=~ s/^\s+|\s+$//g ; #remove all spaces from file
while (length($line) >= $k) {
$line =~ m/(.{$k})/;#match any k characters and only k characters in dna
$Skmer{$1} = $j; #set the key position to $j and increase for each new key
$j++;
$line = substr($line, 1, length($line) -1); #this removes the first character in the current string
}
###(56)###for($Skmerkey(keys %Skmer)){
$i=$Skmer{$Skmerkey};
if(defined $Qkmer($Skmerkey)){
$j=$Qkmer($Skmerkey);
}
$S1=$line;
$S2=$query;
#arrayS1= split(//, $S1);
#array2= split(//, $S2);
$l=0;
while($arrayS1[$i-$l] eq $arrayS2[$j-$l]){
$l++;
}
$start=$i-$l;
$m=0;
while ($arrayS1[$i+$k+$m] eq $arrayS2[$j+$k+$m]) {
$m++;
}
$length=$l+$k+$m;
$match= substr($S1, $start, $length);
if($length>$t){
$longest=length($match);
print "Longest: $match of length $longest \n";
}
}
}###(83)###
The input files contain only strings of letters. For example:
File 1:
ahhtsagnchjgstffhjyfcsghnvzfhg
File2:
ggujvfbgfgkjfcijjjffcvvafcsghnvzfhgvugxckugcbhfcgh
ghnvzfhgvugxckHhfgjgcfujvftjbvdtkhvddgjcdgjxdjkfrh
ajdbvciyqdanvkjghnvzfhgvugxc
From a match of a word of length$k in file 1 in file 2, I check from that match in file 2 to left and to right of word for further matches. The final output is the longest match between File 1 and File 2 based on $k. Now I ge
With this code, I get a syntax error and I am not suer why because it looks correct to me:
syntax error at testk.pl line 56, near "$Skmerkey("
syntax error at testk.pl line 83, near "}"
Thank you.
use strict; # <--- Allways use this
use warnings; # <--- and this
use Data::Dumper;
my $k=3;
open(my $IN, '<', "File2"); # use $IN instead of depricated IN
my $line=0; # line number
my %kmer; # hash of arrays of all $k-letter "words" line/position
my #Q; # rows of Q-file
while(<$IN>) {
chomp;
next if /^>/;
s/^\s+|\s+$//g;
next if !$_;
my $pos=0;
push #Q, $_; # store source row
for(/(?=(.{$k}))/g) { # Capture $k letters. floating window with step 1 symbol
push #{$kmer{$_}}, [$line,$pos]; # store row number and position of "word"
$pos++;
}
$line++;
}
open($IN, '<', "File1");
$line=0;
while(<$IN>) { # Read S-file
chomp;
next if /^>/;
s/^\s+|\s+$//g;
next if !$_;
my $pos=0;
my $len=length($_); # length of row of S-file
my $s=$_; # Current row of S-file
my #ignore=(); # array for store information about match tails
for(/(?=(.{$k}))/g) {
next if ! $kmer{$_}; # "word" not found try to next
for(#{$kmer{$_}}) { # $kmer{word} contains array of lines/positions in Q
my($qline, $qpos)=#{$_};
# print "test $qline:$qpos ";
if( grep {$_->[0]==$qline && $_->[1]==$qpos } #ignore ) {
# this line/position already tested and included in found matching
# print "Ignore match tail $qline:$qpos\n";
next;
}
my $j=$k; # $k letters same, test after this point
my $qlen=length($Q[$qline]);
$j++ while( $pos+$j<$len && $qpos+$j<$qlen &&
substr($s,$pos+$j,1) eq substr($Q[$qline],$qpos+$j,1) );
print "MATCH FOUND: S-file line $line pos $pos, Q-file line $qline pos $qpos: ",
substr($s,$pos,$j),"\n";
push #ignore, [$qline, $qpos, $j]; # store positions and length of match
}
} continue { # Continue block works on all loops, include after "next"
$pos++;
#ignore=grep { # recalculate/filter position and length of all match tails
++$_->[1]; # increment position
(--$_->[2]) # decrement length
>= $k # and filter out lengths < $k
} #ignore;
# print Dumper(\#ignore);
}
$line++;
}

How to identify nth lines of n files in while<>

I have a code which adds all vectors in all files.
There can be any number of input files. For example first input file is:
0.55 0 0.3335 1.2
0.212 0 2.2025 1
and the second one is:
0.25 0 0.3333 1.0
0.1235 0 0.2454 1
What I get is the sum of all vectors, thus in result i get one vector
which is:
1.13550 0 3.1147 4.2
But I'm trying to sum the first vector of the first file with the first vector of the second file and so on. In result according to this example I should get 2 vectors.
For now I have this:
use strict;
use warnings;
if ($ARGV[0] ne "vector1.dat"){
die ("vector1.dat is necessary as first argument");
}
my #sum = 0;
my $dim = 0;
while (<>) {
#Ignore blank lines, hashtags
#and lines starting with $
if ($_ =~ /#/ || $_ =~ /^$/ || $_ =~ /^\s$/){
next;
}
my #vectors = split(" ", $_);
my $vector_length = #vectors;
if ($dim eq 0) {
$dim = $vector_length;
}
else {
if ($dim ne $vector_length) {
die ("Vector dimensions do not match. : $!");
}
}
for (my $i = 0; $i <= $#vectors; $i++) {
$sum[$i] += $vectors[$i];
}
}
$" = "\t\t";
print "\n --- \n #sum \n";
What I need is just to find out how to identify each file's nth line
and to sum the column values of those lines while keeping in mind, that there can be n number of files.
I saw filehandling question over here with similar issue, however
I didn't find my answer there.
Just looking for some suggestions and guidance. Got stuck on this.
Open each file yourself and use the $. variable to know which line you are on (or count the files yourself). Here's the basic structure:
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
while( <$fh> ) {
chomp;
$sum[ $. ] = ...; # $. is the line number
}
}
If you don't like $., you can use its longer name. You have to turn on English (which comes with Perl):
use English;
## use English qw( -no_match_vars ); # for v5.16 and earlier
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
while( <$fh> ) {
chomp;
$sum[ $INPUT_LINE_NUMBER ] = ...;
}
}
Or, you can count yourself, which might be handy if the vectors in the files don't line up by strict line number (perhaps because of comments or some other formatting oddity):
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
my $line = -1;
while( <$fh> ) {
$line++;
chomp;
$sum[ $line ] = ...;
}
}
The harder way is the answer bart gives which inspects eof at the end of every line to see if the magical ARGV handle is looking at a new file, and resetting $. if it is. It's an interesting trick but hardly anyone is going to understand what it's doing (or even notice it).
For the other part of the problem, I think you're doing the vector sum wrong, or using confusing variable names. A line is a vector, and the numbers in the lines are a component. A two dimensional array will work. The first index is the line number and the second in the component index:
while( <$fh> ) {
chomp;
... skip unwanted lines
my #components = split;
... various dimension checks
foreach my $i ( 0 .. $#components ) {
$sum[ $. ][ $i ] += $components[ $i ];
}
}
The Data::Dumper module is handy for complex data structures. You can also see the perldsc (Perl Data Structures Cookbook) documentation. The $. variable is found in perlvar.
$. is the line number of the most recently read file handle. close(ARGV) if eof; can be used to reset the file number between files (as documented in eof). (Note: eof() is different than eof.) So you now have line numbers.
The second problem you have is that you are adding vector components ($vectors[$i]) to a vectors ($sum[$i]). You need to add vector components to vectors components. Start by using more appropriate variable names.
This is what we get:
my #sum_vectors;
while (<>) {
s/#.*//; # Remove comments.
next if /^\s*$/; # Ignore blank lines.
my #vector = split;
if ($sum_vectors[$.] && #{ $sum_vectors[$.] } != #vector) {
die("$ARGV:$.: Vector dimensions do not match\n");
}
for my $i (0..$#vector) {
$sum_vectors[$.][$i] += $vector[$i];
}
} continue {
close(ARGV) if eof; # Reset line numbers for each file.
}
Two other errors fixed:
$! did not contain anything meaningful when you used it.
You ignored lines that contain comments, even if they contained valid data too.
Try this:
#!/usr/bin/perl
use strict;
use warnings;
if ($ARGV[0] ne "vector1.dat"){
die ("vector1.dat is necessary as first argument");
}
my %sum;
my $dim = 0;
my $vector_length;
my $line_number;
while (<>) {
#Ignore blank lines, hashtags
#and lines starting with $
if ($_ =~ /#/ || $_ =~ /^$/ || $_ =~ /^\s$/){
next;
}
my #vectors = split(" ", $_);
$vector_length = #vectors;
if ($dim eq 0) {
$dim = $vector_length;
}
else {
if ($dim ne $vector_length) {
die ("Vector dimensions do not match. : $!");
}
}
for (my $i = 0; $i <= $#vectors; $i++) {
$sum{$.}{$i} += $vectors[$i];
}
$line_number = $.;
$. = 0 if eof;
}
$" = "\t\t";
for (my $line=1; $line<=$line_number; $line++)
{
print $line;
for (my $vector=0; $vector<$vector_length; $vector++)
{
print " " . $sum{$line}{$vector};
}
print "\n";
}

Perl compare individual elements of two arrays

I have two files with two columns each:
FILE1
A B
1 #
2 #
3 !
4 %
5 %
FILE 2
A B
3 #
4 !
2 &
1 %
5 ^
The Perl script must compare column A in both both files, and only if they are equal, column B of FIlE 2 must be printed
So far I have the following code but all I get is an infinite loop with # from column B
use strict;
use warnings;
use 5.010;
print "enter site:"."\n";
chomp(my $s = <>);
print "enter protein:"."\n";
chomp(my $p = <>);
open( FILE, "< $s" ) or die;
open( OUT, "> PSP.txt" ) or die;
open( FILE2, "< $p" ) or die;
my #firstcol;
my #secondcol;
my #thirdcol;
while ( <FILE> )
{
next if $. <2;
chomp;
my #cols = split;
push #firstcol, $cols[0];
push #secondcol, $cols[1]."\t"."\t".$cols[3]."\t"."\t"."\t"."N\/A"."\n";
}
my #firstcol2;
my #secondcol2;
my #thirdcol2;
while ( <FILE2> )
{
next if $. <2;
my #cols2 = split(/\t/, $_);
push #firstcol2, $cols2[0];
push #secondcol2, $cols2[4]."\n";
}
my $size = #firstcol;
my $size2 = #firstcol2;
for (my $i = 0; $i <= #firstcol ; $i++) {
for (my $j = 0; $j <= #firstcol2; $j++) {
if ( $firstcol[$i] eq $firstcol2[$j] )
{
print $secondcol2[$i];
}
}
}
my (#first, #second);
while(<first>){
chomp;
my $foo = split / /, $_;
push #first , $foo;
}
while(<second>){
chomp;
my $bar = split / / , $_;
push #second, $bar;
}
my %first = #first;
my %second = #second;
Build a hash of the first file as %first and second file as %second with first column as key and second column as value.
for(keys %first)
{
print $second{$_} if exists $second{$_}
}
I couldn't check it as I am on mobile. hope that gives you an idea.
I assume that column A is ordered and that you actually want to compare the first entry in File 1 to the first entry in File 2, and so on.
If that's true, you have nested loop that you don't need. Simplify your last while as such:
for my $i (0..$#firstcol) {
if ( $firstcol[$i] eq $firstcol2[$i] )
{
print $secondcol2[$i];
}
}
Also, if you're at all concerned about the files being of different length, then you can adjust the loop:
use List::Util qw(min);
for my $i (0..min($#firstcol, $#firstcol2)) {
Additional Note: You aren't chomping your data in the second file loop while ( <FILE2> ). That might introduce a bug later.
If your files are called file1.txt and file2.txt the next:
use Modern::Perl;
use Path::Class;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } file("file$_.txt")->slurp for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
say $line2->[1] if ($line1->[0] eq $line2->[0]);
}
prints:
B
^
equals in column1 only the lines A and 5
without the CPAN modules - produces the same result
use strict;
use warnings;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } do { local(#ARGV)="file$_.txt";<> } for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
print $line2->[1],"\n" if ($line1->[0] eq $line2->[0]);
}

How can I skip some block content while reading in Perl

I plan to skip the block content which include the start line of "MaterializeU4()" with the subroutin() read_block below. But failed.
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines;
my $block_started = 0;
while( my $line = <$fh> ) {
# how to correct my code below? I don't need the 2nd block content.
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 0) ) ;
if( $block_started ) {
last if $line =~ /^\s*$/;
push #lines, $line;
}
}
return \#lines if #lines;
return;
}
Data as below:
__DATA__
status DynTest = <dynamic 100>
vid = 10002
name = "DynTest"
units = ""
status VIDNAME9000 = <U4 MaterializeU4()>
vid = 9000
name = "VIDNAME9000"
units = "degC"
status DynTest = <U1 100>
vid = 100
name = "Hello"
units = ""
Output:
<StatusVariables>
<SVID logicalName="DynTest" type="L" value="100" vid="10002" name="DynTest" units=""></SVID>
<SVID logicalName="DynTest" type="L" value="100" vid="100" name="Hello" units=""></SVID>
</StatusVariables>
[Updated]
I print the value of index($line, "MaterializeU4"), it output 25.
Then I updated the code as below
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 25)
Now it works.
Any comments are welcome about my practice.
Perl already has an operator to keep track of blocks. It's called the "flip-flop" operator:
Try this out:
while ( <DATA> ) {
next if /\Q<U4 MaterializeU4()>\E/../^\s*$/;
push #lines, $_;
}
The value of /\Q<U4 MaterializeU4()>\E/../^\s*$/ will be true when it sees a line that matches the starting regex and it will stop being true after it sees a line matching the second expression.
First, using a regex instead of index is probably better since you can tune it to the exact format of status string if you may decide to be stricter than just "substring exists"
I would suggest as one solution adding a second flag to skip the block contents if it's a MaterializeU4 block, as follows:
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Empty return for skippable (Materialize4U) block!!!
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = <$fh> ) {
if ($line =~ /^status.*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Here's a slightly modified sample I tested using codepad.org:
Code:
use Data::Dumper;
my #all_lines = (
"s 1" ,"b 1" ,""
, "s MaterializeU4" ,"b 2" ,""
, "s 3" ,"b 3" ,""
);
while (#all_lines) {
my $block = read_block();
print Data::Dumper->Dump([$block]);
}
exit 0;
sub read_block {
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = shift #all_lines) {
if ($line =~ /^s .*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Output:
$VAR1 = [
's 1',
'b 1'
];
$VAR1 = [];
$VAR1 = [
's 3',
'b 3'
];
On successful match of a substring, index returns the position of the substring, which could be any value >= 0. On "failure", index returns -1.
The way you are using index
index($line, "MaterializeU4") != 0
will be true for all lines except for a line that begins with the string "MaterializeU4".
It looks like you already know a little bit about Perl regular expressions. Why not use one in this case, too?
++$block_started if $line =~ /status/ && $line =~ /MaterializeU4/;
Another issue I see is that you set $block_started to begin capturing lines, but you never set it to zero at the end of the "block", say, when $line is empty. I'm not sure if that's what you wanted to do.