Finding longest match between 2 files from pattern - perl

I am having trouble implementing two files within this program. I am trying to to access the contents of file $Q and $s.
print "Input the K value \n";
$k = <>;
chomp $k;
print "Input T\n";
$t = <>;
chomp $t;
%Qkmer = ();
$i = 1;
$query=' ';
while ($line=<IN>) {
chomp($line);
if ($line=~ m/^>/ ) {
next;
}
$query=$query.$line;
$line=~ s/(^|\n)[\n\s]*/$1/g;
while (length($line) >= $k) {
$line =~ m/(.{$k})/;
if (! defined $Qkmer{$1}) {#every key not deined as the first match
$Qkmer{$1} = $i;
}
$i++;
$line = substr($line, 1, length($line) -1);
}
}
open(MYDATA, '<', "data.txt");
while ($line=<MYDATA>) { \
chomp($line);
%Skmer = (); # This initializes the hash called Skmer.
$j = 1;
if ($line=~ m/^>/ ) { #if the line starts with >
next; #start on next line #separated characters
}
$line=~ s/^\s+|\s+$//g ; #remove all spaces from file
while (length($line) >= $k) {
$line =~ m/(.{$k})/;#match any k characters and only k characters in dna
$Skmer{$1} = $j; #set the key position to $j and increase for each new key
$j++;
$line = substr($line, 1, length($line) -1); #this removes the first character in the current string
}
###(56)###for($Skmerkey(keys %Skmer)){
$i=$Skmer{$Skmerkey};
if(defined $Qkmer($Skmerkey)){
$j=$Qkmer($Skmerkey);
}
$S1=$line;
$S2=$query;
#arrayS1= split(//, $S1);
#array2= split(//, $S2);
$l=0;
while($arrayS1[$i-$l] eq $arrayS2[$j-$l]){
$l++;
}
$start=$i-$l;
$m=0;
while ($arrayS1[$i+$k+$m] eq $arrayS2[$j+$k+$m]) {
$m++;
}
$length=$l+$k+$m;
$match= substr($S1, $start, $length);
if($length>$t){
$longest=length($match);
print "Longest: $match of length $longest \n";
}
}
}###(83)###
The input files contain only strings of letters. For example:
File 1:
ahhtsagnchjgstffhjyfcsghnvzfhg
File2:
ggujvfbgfgkjfcijjjffcvvafcsghnvzfhgvugxckugcbhfcgh
ghnvzfhgvugxckHhfgjgcfujvftjbvdtkhvddgjcdgjxdjkfrh
ajdbvciyqdanvkjghnvzfhgvugxc
From a match of a word of length$k in file 1 in file 2, I check from that match in file 2 to left and to right of word for further matches. The final output is the longest match between File 1 and File 2 based on $k. Now I ge
With this code, I get a syntax error and I am not suer why because it looks correct to me:
syntax error at testk.pl line 56, near "$Skmerkey("
syntax error at testk.pl line 83, near "}"
Thank you.

use strict; # <--- Allways use this
use warnings; # <--- and this
use Data::Dumper;
my $k=3;
open(my $IN, '<', "File2"); # use $IN instead of depricated IN
my $line=0; # line number
my %kmer; # hash of arrays of all $k-letter "words" line/position
my #Q; # rows of Q-file
while(<$IN>) {
chomp;
next if /^>/;
s/^\s+|\s+$//g;
next if !$_;
my $pos=0;
push #Q, $_; # store source row
for(/(?=(.{$k}))/g) { # Capture $k letters. floating window with step 1 symbol
push #{$kmer{$_}}, [$line,$pos]; # store row number and position of "word"
$pos++;
}
$line++;
}
open($IN, '<', "File1");
$line=0;
while(<$IN>) { # Read S-file
chomp;
next if /^>/;
s/^\s+|\s+$//g;
next if !$_;
my $pos=0;
my $len=length($_); # length of row of S-file
my $s=$_; # Current row of S-file
my #ignore=(); # array for store information about match tails
for(/(?=(.{$k}))/g) {
next if ! $kmer{$_}; # "word" not found try to next
for(#{$kmer{$_}}) { # $kmer{word} contains array of lines/positions in Q
my($qline, $qpos)=#{$_};
# print "test $qline:$qpos ";
if( grep {$_->[0]==$qline && $_->[1]==$qpos } #ignore ) {
# this line/position already tested and included in found matching
# print "Ignore match tail $qline:$qpos\n";
next;
}
my $j=$k; # $k letters same, test after this point
my $qlen=length($Q[$qline]);
$j++ while( $pos+$j<$len && $qpos+$j<$qlen &&
substr($s,$pos+$j,1) eq substr($Q[$qline],$qpos+$j,1) );
print "MATCH FOUND: S-file line $line pos $pos, Q-file line $qline pos $qpos: ",
substr($s,$pos,$j),"\n";
push #ignore, [$qline, $qpos, $j]; # store positions and length of match
}
} continue { # Continue block works on all loops, include after "next"
$pos++;
#ignore=grep { # recalculate/filter position and length of all match tails
++$_->[1]; # increment position
(--$_->[2]) # decrement length
>= $k # and filter out lengths < $k
} #ignore;
# print Dumper(\#ignore);
}
$line++;
}

Related

Perl program to look for k-mer with specific sequence

I am trying to enhance a perl program I have previously written so that it recognizes top 1000 length 23 k-mers that ends with GG and print out the k-mers that only appears once in the sequence. However, no matter where I add the reg exp, I am unable to get the expected result.
The code I have:
#!/usr/bin/perl
use strict;
use warnings;
my $k = 23;
my $input = 'Fasta.fasta';
my $output = 'Fasta2.fasta';
my $match_count = 0;
#Open File
unless ( open( FASTA, "<", $input ) ) {
die "Unable to open fasta file", $!;
}
#Unwraps the FASTA format file
$/ = ">";
#Separate header and sequence
#Remove spaces
unless ( open( OUTPUT, ">", $output ) ) {
die "Unable to open file", $!;
}
<FASTA>; # discard 'first' 'empty' record
my %seen;
while ( my $line = <FASTA> ) {
chomp $line;
my ( $header, #seq ) = split( /\n/, $line );
my $sequence = join '', #seq;
for ( length($sequence) >= $k ) {
$sequence =~ m/([ACTG]{21}[G]{2})/g;
for my $i ( 0 .. length($sequence) - $k ) {
my $kmer = substr( $sequence, $i, $k );
##while ($kmer =~ m/([ACTG]{21}[G]{2})/g){
$match_count = $match_count + 1;
print OUTPUT ">crispr_$match_count", "\n", "$kmer", "\n" unless $seen{$kmer}++;
}
}
}
The input fasta file looks like this:
> >2L type=chromosome_arm; loc=2L:1..23011544; ID=2L; dbxref=REFSEQ:NT_033779,GB:AE014134; MD5=bfdfb99d39fa5174dae1e2ecd8a231cd; length=23011544; release=r5.54; species=Dmel;
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT
TTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGT
GCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGA
TGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCA
TTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTG
CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG
CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATT
TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATAT
TGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAA
ACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTG
CCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCG
AGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATG
GTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA
TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA
GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTA
TATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCG
GAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGC
CAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTT
TACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATAT
and so on...
The expected outcome (print out the 23k-mers with GG ending that only appear once in the sequence) I am hoping to get:
>crispr_1
GGGTGGAGCTCCCGAAATGCAGG
>crispr_2
TTAATAAATATTGACACAGCGGG
>crispr_3
ATCGTGGGGCGTTTTGTGAAAGG
>crispr_4
AGTTTTTCACATAATCAGACAGG
>crispr_5
GTGTTGGATGAGTGTCCTCTGGG
>crispr_6
ATAGGTTGGTTGTTTTAAAAGGG
>crispr_7
AAATTTTTGTTGCCACTGAATGG
>crispr_8
AAGTTTCGAACTACGATGGTTGG
>crispr_9
CATGCTTTGTGGAAATAAGTCGG
>crispr_10
CACAGTGGGTGTTTGCACCTCGG
.... and so on
The current code I did create a fasta file with following:
>crispr_1
CGACAATGCACGACAGAGGAAGC
>crispr_2
GACAATGCACGACAGAGGAAGCA
>crispr_3
ACAATGCACGACAGAGGAAGCAG
>crispr_4
CAATGCACGACAGAGGAAGCAGA
>crispr_5
AATGCACGACAGAGGAAGCAGAA
>crispr_6
ATGCACGACAGAGGAAGCAGAAC
>crispr_7
TGCACGACAGAGGAAGCAGAACA
>crispr_8
GCACGACAGAGGAAGCAGAACAG
>crispr_9
CACGACAGAGGAAGCAGAACAGA
>crispr_10
ACGACAGAGGAAGCAGAACAGAT
.... and so on
while if I remove the
for (length($sequence) >=$k){
$sequence =~m/([ACTG]{21}[G]{2})/g;
and add the ##while ($kmer =~ m/([ACTG]{21}[G]{2})/g){
while ($kmer =~ m/([ACTG]{21}[G]{2})/g){
I am getting fasta file (with results which is not numbered correctly and unable to distinguish between duplicated and unique sequences):
>crispr_1
CATTTTCTCTCCCATATTATAGG
>crispr_2
ATTTTCTCTCCCATATTATAGGG
>crispr_3
TATTGTGCTCTTTGATTTTTTGG
>crispr_4
GATTTTTTGGCAACCCAAAATGG
>crispr_5
TTTTTGGCAACCCAAAATGGTGG
>crispr_6
TTGGCAACCCAAAATGGTGGCGG
>crispr_7
ACGACAGAGAGAGAGAGCAGCGG
>crispr_8
AAATCGACAATGCACGACAGAGG
>crispr_91
TATTGTGATCTTCGATTTTTTGG
>crispr_93
TTTTTGGCAACCCAAAATGGAGG
.... and so on
I have attempted to move the regex around my code, but none of them generated the expected result. I do not know what I did wrong over here. I have not add the exit the program when count reaches 1000 into the code yet.
Thanks in advance!
I'm not sure I understand your question completely, but could the following be what you need.
<FASTA>; # discard 'first' 'empty' record
my %data;
while (my $line = <FASTA>){
chomp $line;
my($header, #seq) = split(/\n/, $line);
my $sequence = join '', #seq;
for my $i (0 .. length($sequence) - $k) {
my $kmer = substr($sequence, $i, $k);
$data{$kmer}++ if $kmer =~ /GG$/;
}
}
my $i = 0;
for my $kmer (sort {$data{$b} <=> $data{$a}} keys %data) {
printf "crispr_%d\n%s appears %d times\n", ++$i, $kmer, $data{$kmer};
last if $i == 1000;
}
Some output on a file I have is:
crispr_1
ggttttccggcacccgggcctgg appears 4 times
crispr_2
ccgagctgggcgagaagtagggg appears 4 times
crispr_3
gccgagctgggcgagaagtaggg appears 4 times
crispr_4
gcacccgggcctgggtggcaggg appears 4 times
crispr_5
agcagcgggatcgggttttccgg appears 4 times
crispr_6
gctgggcgagaagtaggggaggg appears 4 times
crispr_7
cccttctgcttcagtgtgaaagg appears 4 times
crispr_8
gtggcagggaagaatgtgccggg appears 4 times
crispr_9
gatcgggttttccggcacccggg appears 4 times
crispr_10
tgagggaaagtgctgctgctggg appears 4 times
crispr_11
agctgggcgagaagtaggggagg appears 4 times
. . . .
ggcacccgggcctgggtggcagg appears 4 times
crispr_50
gaatctctttactgcctggctgg appears 4 times
crispr_51
accacaacattgacagttggtgg appears 2 times
crispr_52
caacattgacagttggtggaggg appears 2 times
crispr_53
catgctcatcgtatctgtgttgg appears 2 times
crispr_54
gattaatgaagtggttattttgg appears 2 times
crispr_55
gaaaccacaacattgacagttgg appears 2 times
crispr_56
aacattgacagttggtggagggg appears 2 times
crispr_57
gacttgatcgattaatgaagtgg appears 2 times
crispr_58
acaacattgacagttggtggagg appears 2 times
crispr_59
gaaccatatattgttatcactgg appears 2 times
crispr_60
ccacagcgcccacttcaaggtgg appears 1 times
crispr_61
ctgctcctgggtgtgagcagagg appears 1 times
crispr_62
ccatatattatctgtggtttcgg appears 1 times
. . . .
Update
To get the results you mentioned in your comment (below), replace the output code with:
my $i = 1;
while (my ($kmer, $count) = each %data) {
next unless $count == 1;
print "crispr_$i\n$kmer\n";
last if $i++ == 1000;
}
To answer my own comment to get first 1000.
<FASTA>; # discard 'first' 'empty' record
my %seen;
my #kmers;
while (my $line = <FASTA>){
chomp $line;
my($header, #seq) = split(/\n/, $line);
my $sequence = join '', #seq;
for my $i (0 .. length($sequence) - $k) {
my $kmer = substr($sequence, $i, $k);
if ($kmer =~ /GG$/) {
push #kmers, $kmer unless $seen{$kmer}++;
}
}
}
my $i = 1;
for my $kmer (#kmers) {
next unless $seen{$kmer} == 1;
print "crispr_$i\n$kmer\n";
last if $i++ == 1000;
}
Answer To check for uniqueness of final 12 chars ending in GG, the code below achieves that.
if ($kmer =~ /(.{10}GG)$/) {
my $substr = $1;
push #kmers, $kmer unless $seen{$substr}++;
}
my $i = 1;
for my $kmer (#kmers) {
my $substr = substr $kmer, -12;
next unless $seen{$substr} == 1;
print "crispr_$i\n$kmer\n";
last if $i++ == 1000;
}
Actually, this code line
$sequence =~m/([ACTG]{21}[G]{2})/g;
this line is just for the regex match, if you try to print this $sequence, it will surely print out the original result.
please add the code segement like this
if($sequence =~/([ACTG]{21}[G]{2}$)/g)
{
}#please remember to match the end with $.
BTW,It looks like the multiple for loop to parse this data is not very reasonable, the parse speed is without the best-efficiency.

How to identify nth lines of n files in while<>

I have a code which adds all vectors in all files.
There can be any number of input files. For example first input file is:
0.55 0 0.3335 1.2
0.212 0 2.2025 1
and the second one is:
0.25 0 0.3333 1.0
0.1235 0 0.2454 1
What I get is the sum of all vectors, thus in result i get one vector
which is:
1.13550 0 3.1147 4.2
But I'm trying to sum the first vector of the first file with the first vector of the second file and so on. In result according to this example I should get 2 vectors.
For now I have this:
use strict;
use warnings;
if ($ARGV[0] ne "vector1.dat"){
die ("vector1.dat is necessary as first argument");
}
my #sum = 0;
my $dim = 0;
while (<>) {
#Ignore blank lines, hashtags
#and lines starting with $
if ($_ =~ /#/ || $_ =~ /^$/ || $_ =~ /^\s$/){
next;
}
my #vectors = split(" ", $_);
my $vector_length = #vectors;
if ($dim eq 0) {
$dim = $vector_length;
}
else {
if ($dim ne $vector_length) {
die ("Vector dimensions do not match. : $!");
}
}
for (my $i = 0; $i <= $#vectors; $i++) {
$sum[$i] += $vectors[$i];
}
}
$" = "\t\t";
print "\n --- \n #sum \n";
What I need is just to find out how to identify each file's nth line
and to sum the column values of those lines while keeping in mind, that there can be n number of files.
I saw filehandling question over here with similar issue, however
I didn't find my answer there.
Just looking for some suggestions and guidance. Got stuck on this.
Open each file yourself and use the $. variable to know which line you are on (or count the files yourself). Here's the basic structure:
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
while( <$fh> ) {
chomp;
$sum[ $. ] = ...; # $. is the line number
}
}
If you don't like $., you can use its longer name. You have to turn on English (which comes with Perl):
use English;
## use English qw( -no_match_vars ); # for v5.16 and earlier
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
while( <$fh> ) {
chomp;
$sum[ $INPUT_LINE_NUMBER ] = ...;
}
}
Or, you can count yourself, which might be handy if the vectors in the files don't line up by strict line number (perhaps because of comments or some other formatting oddity):
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
my $line = -1;
while( <$fh> ) {
$line++;
chomp;
$sum[ $line ] = ...;
}
}
The harder way is the answer bart gives which inspects eof at the end of every line to see if the magical ARGV handle is looking at a new file, and resetting $. if it is. It's an interesting trick but hardly anyone is going to understand what it's doing (or even notice it).
For the other part of the problem, I think you're doing the vector sum wrong, or using confusing variable names. A line is a vector, and the numbers in the lines are a component. A two dimensional array will work. The first index is the line number and the second in the component index:
while( <$fh> ) {
chomp;
... skip unwanted lines
my #components = split;
... various dimension checks
foreach my $i ( 0 .. $#components ) {
$sum[ $. ][ $i ] += $components[ $i ];
}
}
The Data::Dumper module is handy for complex data structures. You can also see the perldsc (Perl Data Structures Cookbook) documentation. The $. variable is found in perlvar.
$. is the line number of the most recently read file handle. close(ARGV) if eof; can be used to reset the file number between files (as documented in eof). (Note: eof() is different than eof.) So you now have line numbers.
The second problem you have is that you are adding vector components ($vectors[$i]) to a vectors ($sum[$i]). You need to add vector components to vectors components. Start by using more appropriate variable names.
This is what we get:
my #sum_vectors;
while (<>) {
s/#.*//; # Remove comments.
next if /^\s*$/; # Ignore blank lines.
my #vector = split;
if ($sum_vectors[$.] && #{ $sum_vectors[$.] } != #vector) {
die("$ARGV:$.: Vector dimensions do not match\n");
}
for my $i (0..$#vector) {
$sum_vectors[$.][$i] += $vector[$i];
}
} continue {
close(ARGV) if eof; # Reset line numbers for each file.
}
Two other errors fixed:
$! did not contain anything meaningful when you used it.
You ignored lines that contain comments, even if they contained valid data too.
Try this:
#!/usr/bin/perl
use strict;
use warnings;
if ($ARGV[0] ne "vector1.dat"){
die ("vector1.dat is necessary as first argument");
}
my %sum;
my $dim = 0;
my $vector_length;
my $line_number;
while (<>) {
#Ignore blank lines, hashtags
#and lines starting with $
if ($_ =~ /#/ || $_ =~ /^$/ || $_ =~ /^\s$/){
next;
}
my #vectors = split(" ", $_);
$vector_length = #vectors;
if ($dim eq 0) {
$dim = $vector_length;
}
else {
if ($dim ne $vector_length) {
die ("Vector dimensions do not match. : $!");
}
}
for (my $i = 0; $i <= $#vectors; $i++) {
$sum{$.}{$i} += $vectors[$i];
}
$line_number = $.;
$. = 0 if eof;
}
$" = "\t\t";
for (my $line=1; $line<=$line_number; $line++)
{
print $line;
for (my $vector=0; $vector<$vector_length; $vector++)
{
print " " . $sum{$line}{$vector};
}
print "\n";
}

Perl compare individual elements of two arrays

I have two files with two columns each:
FILE1
A B
1 #
2 #
3 !
4 %
5 %
FILE 2
A B
3 #
4 !
2 &
1 %
5 ^
The Perl script must compare column A in both both files, and only if they are equal, column B of FIlE 2 must be printed
So far I have the following code but all I get is an infinite loop with # from column B
use strict;
use warnings;
use 5.010;
print "enter site:"."\n";
chomp(my $s = <>);
print "enter protein:"."\n";
chomp(my $p = <>);
open( FILE, "< $s" ) or die;
open( OUT, "> PSP.txt" ) or die;
open( FILE2, "< $p" ) or die;
my #firstcol;
my #secondcol;
my #thirdcol;
while ( <FILE> )
{
next if $. <2;
chomp;
my #cols = split;
push #firstcol, $cols[0];
push #secondcol, $cols[1]."\t"."\t".$cols[3]."\t"."\t"."\t"."N\/A"."\n";
}
my #firstcol2;
my #secondcol2;
my #thirdcol2;
while ( <FILE2> )
{
next if $. <2;
my #cols2 = split(/\t/, $_);
push #firstcol2, $cols2[0];
push #secondcol2, $cols2[4]."\n";
}
my $size = #firstcol;
my $size2 = #firstcol2;
for (my $i = 0; $i <= #firstcol ; $i++) {
for (my $j = 0; $j <= #firstcol2; $j++) {
if ( $firstcol[$i] eq $firstcol2[$j] )
{
print $secondcol2[$i];
}
}
}
my (#first, #second);
while(<first>){
chomp;
my $foo = split / /, $_;
push #first , $foo;
}
while(<second>){
chomp;
my $bar = split / / , $_;
push #second, $bar;
}
my %first = #first;
my %second = #second;
Build a hash of the first file as %first and second file as %second with first column as key and second column as value.
for(keys %first)
{
print $second{$_} if exists $second{$_}
}
I couldn't check it as I am on mobile. hope that gives you an idea.
I assume that column A is ordered and that you actually want to compare the first entry in File 1 to the first entry in File 2, and so on.
If that's true, you have nested loop that you don't need. Simplify your last while as such:
for my $i (0..$#firstcol) {
if ( $firstcol[$i] eq $firstcol2[$i] )
{
print $secondcol2[$i];
}
}
Also, if you're at all concerned about the files being of different length, then you can adjust the loop:
use List::Util qw(min);
for my $i (0..min($#firstcol, $#firstcol2)) {
Additional Note: You aren't chomping your data in the second file loop while ( <FILE2> ). That might introduce a bug later.
If your files are called file1.txt and file2.txt the next:
use Modern::Perl;
use Path::Class;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } file("file$_.txt")->slurp for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
say $line2->[1] if ($line1->[0] eq $line2->[0]);
}
prints:
B
^
equals in column1 only the lines A and 5
without the CPAN modules - produces the same result
use strict;
use warnings;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } do { local(#ARGV)="file$_.txt";<> } for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
print $line2->[1],"\n" if ($line1->[0] eq $line2->[0]);
}

How can I skip some block content while reading in Perl

I plan to skip the block content which include the start line of "MaterializeU4()" with the subroutin() read_block below. But failed.
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines;
my $block_started = 0;
while( my $line = <$fh> ) {
# how to correct my code below? I don't need the 2nd block content.
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 0) ) ;
if( $block_started ) {
last if $line =~ /^\s*$/;
push #lines, $line;
}
}
return \#lines if #lines;
return;
}
Data as below:
__DATA__
status DynTest = <dynamic 100>
vid = 10002
name = "DynTest"
units = ""
status VIDNAME9000 = <U4 MaterializeU4()>
vid = 9000
name = "VIDNAME9000"
units = "degC"
status DynTest = <U1 100>
vid = 100
name = "Hello"
units = ""
Output:
<StatusVariables>
<SVID logicalName="DynTest" type="L" value="100" vid="10002" name="DynTest" units=""></SVID>
<SVID logicalName="DynTest" type="L" value="100" vid="100" name="Hello" units=""></SVID>
</StatusVariables>
[Updated]
I print the value of index($line, "MaterializeU4"), it output 25.
Then I updated the code as below
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 25)
Now it works.
Any comments are welcome about my practice.
Perl already has an operator to keep track of blocks. It's called the "flip-flop" operator:
Try this out:
while ( <DATA> ) {
next if /\Q<U4 MaterializeU4()>\E/../^\s*$/;
push #lines, $_;
}
The value of /\Q<U4 MaterializeU4()>\E/../^\s*$/ will be true when it sees a line that matches the starting regex and it will stop being true after it sees a line matching the second expression.
First, using a regex instead of index is probably better since you can tune it to the exact format of status string if you may decide to be stricter than just "substring exists"
I would suggest as one solution adding a second flag to skip the block contents if it's a MaterializeU4 block, as follows:
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Empty return for skippable (Materialize4U) block!!!
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = <$fh> ) {
if ($line =~ /^status.*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Here's a slightly modified sample I tested using codepad.org:
Code:
use Data::Dumper;
my #all_lines = (
"s 1" ,"b 1" ,""
, "s MaterializeU4" ,"b 2" ,""
, "s 3" ,"b 3" ,""
);
while (#all_lines) {
my $block = read_block();
print Data::Dumper->Dump([$block]);
}
exit 0;
sub read_block {
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = shift #all_lines) {
if ($line =~ /^s .*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Output:
$VAR1 = [
's 1',
'b 1'
];
$VAR1 = [];
$VAR1 = [
's 3',
'b 3'
];
On successful match of a substring, index returns the position of the substring, which could be any value >= 0. On "failure", index returns -1.
The way you are using index
index($line, "MaterializeU4") != 0
will be true for all lines except for a line that begins with the string "MaterializeU4".
It looks like you already know a little bit about Perl regular expressions. Why not use one in this case, too?
++$block_started if $line =~ /status/ && $line =~ /MaterializeU4/;
Another issue I see is that you set $block_started to begin capturing lines, but you never set it to zero at the end of the "block", say, when $line is empty. I'm not sure if that's what you wanted to do.

Find multiple substrings in strings and record location

The following is the script for finding consecutive substrings in strings.
use strict;
use warnings;
my $file="Sample.txt";
open(DAT, $file) || die("Could not open file!");
#worry about these later
#my $regexp1 = "motif1";
#my $regexp2 = "motif2";
#my $regexp3 = "motif3";
#my $regexp4 = "motif4";
my $sequence;
while (my $line = <DAT>) {
if ($line=~ /(HDWFLSFKD)/g){
{
print "its found index location: ",
pos($line), "-", pos($line)+length($1), "\n";
}
if ($line=~ /(HD)/g){
print "motif found and its locations is: \n";
pos($line), "-", pos($line)+length($1), "\n\n";
}
if ($line=~ /(K)/g){
print "motif found and its location is: \n";
pos($line), "-",pos($line)+length($1), "\n\n";
}
if ($line=~ /(DD)/g){
print "motif found and its location is: \n";
pos($line), "-", pos($line)+length($1), "\n\n";
}
}else {
$sequence .= $line;
print "came in else\n";
}
}
It matches substring1 with string and prints out position where substring1 matched. The problem lies in finding the rest of the substrings. For substrings2 it starts again from the beginning of the string (instead of starting from the position where substring1 was found). The problem is that every time it calculates position it starts from the beginning of string instead of starting from the position of the previously found substring. Since substrings are consecutive substring1, substring2, substring3, substring4, their positions have to occur after the previous respectively.
Try this perl program
use strict;
use warnings;
use feature qw'say';
my $file="Sample.txt";
open( my $dat, '<', $file) || die("Could not open file!");
my #regex = qw(
HDWFLSFKD
HD
K
DD
);
my $sequence;
while( my $line = <$dat> ){
chomp $line;
say 'Line: ', $.;
# reset the position of variable $line
# pos is an lvalue subroutine
pos $line = 0;
for my $regex ( #regex ){
$regex = quotemeta $regex;
if( scalar $line =~ / \G (.*?) ($regex) /xg ){
say $regex, ' found at location (', $-[2], '-', $+[2], ')';
if( $1 ){
say " but skipped: \"$1\" at location ($-[1]-$+[1])";
}
}else{
say 'Unable to find ', $regex;
# end loop
last;
}
}
}
I'm not a perl expert but you can use $- and $+ to track index location for last regex match found.
Below is code built on top of your code that explains this.
use strict;
use warnings;
my $file="sample.txt";
open(DAT, $file) || die("Could not open file!");
open (OUTPUTFILE, '>data.txt');
my $sequence;
my $someVar = 0;
my $sequenceNums = 1;
my $motif1 = "(HDWFLSFKD)";
my $motif2 = "(HD)";
my $motif3 = "(K)";
my $motif4 = "(DD)";
while (my $line = <DAT>)
{
$someVar = 0;
print "\nSequence $sequenceNums: $line\n";
print OUTPUTFILE "\nSequence $sequenceNums: $line\n";
if ($line=~ /$motif1/g)
{
&printStuff($sequenceNums, "motif1", $motif1, "$-[0]-$+[0]");
$someVar = 1;
}
if ($line=~ /$motif2/g and $someVar == 1)
{
&printStuff($sequenceNums, "motif2", $motif2, "$-[0]-$+[0]");
$someVar = 2;
}
if ($line=~ /$motif3/g and $someVar == 2)
{
&printStuff($sequenceNums, "motif3", $motif4, "$-[0]-$+[0]");
$someVar = 3;
}
if ($line=~ /$motif4/g and $someVar == 3)
{
&printStuff($sequenceNums, "motif4", $motif4, "$-[0]-$+[0]");
}
else
{
$sequence .= $line;
if ($someVar == 0)
{
&printWrongStuff($sequenceNums, "motif1", $motif1);
}
elsif ($someVar == 1)
{
&printWrongStuff($sequenceNums, "motif2", $motif2);
}
elsif ($someVar == 2)
{
&printWrongStuff($sequenceNums, "motif3", $motif3);
}
elsif ($someVar == 3)
{
&printWrongStuff($sequenceNums, "motif4", $motif4);
}
}
$sequenceNums++;
}
sub printStuff
{
print "Sequence: $_[0] $_[1]: $_[2] index location: $_[3] \n";
print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] index location: $_[3]\n";
}
sub printWrongStuff
{
print "Sequence: $_[0] $_[1]: $_[2] was not found\n";
print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] was not found\n";
}
close (OUTPUTFILE);
close (DAT);
Sample input:
MLTSHQKKFHDWFLSFKDSNNYNHDSKQNHSIKDDIFNRFNHYIYNDLGIRTIA
MLTSHQKKFSNNYNSKQNHSIKDIFNRFNHYIYNDLGIRTIA
MLTSHQKKFSNNYNSKHDWFLSFKDQNHSIKDIFNRFNHYIYNDL
You really should read
perldoc perlre
perldoc perlreref
perldoc perlretut
You need the special variables #- and #+ if you need the positions. No need to try to compute them yourself.
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw( each_array );
my $source = 'AAAA BBCCC DD E FFFFF';
my $pattern = join '\s*', map { "($_+)" } qw( A B C D E F );
if ( $source =~ /$pattern/ ) {
my $it = each_array #-, #+;
$it->(); # discard overall match information;
while ( my ($start, $end) = $it->() ) {
printf "Start: %d - Length: %d\n", $start, $end - $start;
}
}
Start: 0 - Length: 4
Start: 7 - Length: 2
Start: 9 - Length: 3
Start: 15 - Length: 2
Start: 19 - Length: 1
Start: 26 - Length: 5
The result of a construct like
$line=~ /(HD)/g
is a list. Use while to step through the hits.
To match where the last match left off, use \G. perldoc perlre says (but consult your own installation's version's manual first):
The "\G" assertion can be used to
chain global matches (using "m//g"),
as described in "Regexp Quote-Like
Operators" in perlop. It is also
useful when writing "lex"-like
scanners, when you have several
patterns that you want to match
against consequent substrings of your
string, see the previous reference.
The actual location where "\G" will
match can also be influenced by using
"pos()" as an lvalue: see "pos" in
perlfunc. Note that the rule for
zero-length matches is modified
somewhat, in that contents to the left
of "\G" is not counted when
determining the length of the match.
Thus the following will not match
forever:
$str = 'ABC';
pos($str) = 1;
while (/.\G/g) {
print $&;
}