How can I skip some block content while reading in Perl - perl

I plan to skip the block content which include the start line of "MaterializeU4()" with the subroutin() read_block below. But failed.
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines;
my $block_started = 0;
while( my $line = <$fh> ) {
# how to correct my code below? I don't need the 2nd block content.
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 0) ) ;
if( $block_started ) {
last if $line =~ /^\s*$/;
push #lines, $line;
}
}
return \#lines if #lines;
return;
}
Data as below:
__DATA__
status DynTest = <dynamic 100>
vid = 10002
name = "DynTest"
units = ""
status VIDNAME9000 = <U4 MaterializeU4()>
vid = 9000
name = "VIDNAME9000"
units = "degC"
status DynTest = <U1 100>
vid = 100
name = "Hello"
units = ""
Output:
<StatusVariables>
<SVID logicalName="DynTest" type="L" value="100" vid="10002" name="DynTest" units=""></SVID>
<SVID logicalName="DynTest" type="L" value="100" vid="100" name="Hello" units=""></SVID>
</StatusVariables>
[Updated]
I print the value of index($line, "MaterializeU4"), it output 25.
Then I updated the code as below
$block_started++ if ( ($line =~ /^(status)/) && (index($line, "MaterializeU4") != 25)
Now it works.
Any comments are welcome about my practice.

Perl already has an operator to keep track of blocks. It's called the "flip-flop" operator:
Try this out:
while ( <DATA> ) {
next if /\Q<U4 MaterializeU4()>\E/../^\s*$/;
push #lines, $_;
}
The value of /\Q<U4 MaterializeU4()>\E/../^\s*$/ will be true when it sees a line that matches the starting regex and it will stop being true after it sees a line matching the second expression.

First, using a regex instead of index is probably better since you can tune it to the exact format of status string if you may decide to be stricter than just "substring exists"
I would suggest as one solution adding a second flag to skip the block contents if it's a MaterializeU4 block, as follows:
# Read a constant definition block from a file handle.
# void return when there is no data left in the file.
# Empty return for skippable (Materialize4U) block!!!
# Otherwise return an array ref containing lines to in the block.
sub read_block {
my $fh = shift;
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = <$fh> ) {
if ($line =~ /^status.*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Here's a slightly modified sample I tested using codepad.org:
Code:
use Data::Dumper;
my #all_lines = (
"s 1" ,"b 1" ,""
, "s MaterializeU4" ,"b 2" ,""
, "s 3" ,"b 3" ,""
);
while (#all_lines) {
my $block = read_block();
print Data::Dumper->Dump([$block]);
}
exit 0;
sub read_block {
my #lines = ();
my $block_started = 0;
my $block_ignore = 0;
while (my $line = shift #all_lines) {
if ($line =~ /^s .*?((MaterializeU4)?)/) {
$block_started = 1;
$block_ignore = 1 if $1;
}
last if $line =~ /^\s*$/ && $block_started;
push #lines, $line unless $block_ignore;
}
return \#lines if #lines || $block_started;
return;
}
Output:
$VAR1 = [
's 1',
'b 1'
];
$VAR1 = [];
$VAR1 = [
's 3',
'b 3'
];

On successful match of a substring, index returns the position of the substring, which could be any value >= 0. On "failure", index returns -1.
The way you are using index
index($line, "MaterializeU4") != 0
will be true for all lines except for a line that begins with the string "MaterializeU4".
It looks like you already know a little bit about Perl regular expressions. Why not use one in this case, too?
++$block_started if $line =~ /status/ && $line =~ /MaterializeU4/;
Another issue I see is that you set $block_started to begin capturing lines, but you never set it to zero at the end of the "block", say, when $line is empty. I'm not sure if that's what you wanted to do.

Related

Hash incorrectly tracking counts, runtime long

I am working on a program in Perl and my output is wrong and taking forever to process. The code is meant to take in a large DNA sequence file, read through it in 15 letter increments (kmers), stepping forward 1 position at a time. I'm supposed to enter the kmer sequences into a hash, with their value being the number of incidences of that kmer- meaning each key should be unique and when a duplicate is found, it should increase the count for that particular kmer. I know from my Prof. expected output file, that I have too many lines, so it is allowing duplicates and not counting correctly. It's also running 5+ minutes, so I have to Ctrl+C to escape. When I go look at kmers.txt, the file is at least written and formatted correctly.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# countKmers.pl
# Open file /scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta
# Identify all k-mers of length 15, load them into a hash
# and count the number of occurences of each k-mer. Each
# unique k-mer and its' count will be written to file
# kmers.txt
#Create an empty hash
my %kMersHash = ();
#Open a filehandle for the output file kmers.txt
unless ( open ( KMERS, ">", "kmers.txt" ) ) {
die $!;
}
#Call subroutine to load Fly Chromosome 2L
my $sequenceRef = loadSequence("/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta");
my $kMer = 15; #Set the size of the sliding window
my $stepSize = 1; #Set the step size
for (
#The sliding window's start position is 0
my $windowStart = 0;
#Prevent going past end of the file
$windowStart <= ( length($$sequenceRef) - $kMer );
#Advance the window by the step size
$windowStart += $stepSize
)
{
#Get the substring from $windowStart for length $kMer
my $kMerSeq = substr( $$sequenceRef, $windowStart, $kMer );
#Call the subroutine to iterate through the kMers
processKMers($kMerSeq);
}
sub processKMers {
my ($kMerSeq) = #_;
#Initialize $kCount with at least 1 occurrence
my $kCount = 1;
#If the key already exists, the count is
#increased and changed in the hash
if ( not exists $kMersHash{$kMerSeq} ) {
#The hash key=>value is loaded: kMer=>count
$kMersHash{$kMerSeq} = $kCount;
}
else {
#Increment the count
$kCount ++;
#The hash is updated
$kMersHash{$kMerSeq} = $kCount;
}
#Print out the hash to filehandle KMERS
for (keys %kMersHash) {
print KMERS $_, "\t", $kMersHash{$_}, "\n";
}
}
sub loadSequence {
#Get my sequence file name from the parameter array
my ($sequenceFile) = #_;
#Initialize my sequence to the empty string
my $sequence = "";
#Open the sequence file
unless ( open( FASTA, "<", $sequenceFile ) ) {
die $!;
}
#Loop through the file line-by-line
while (<FASTA>) {
#Assign the line, which is in the default
#variable to a named variable for readability.
my $line = $_;
#Chomp to get rid of end-of-line characters
chomp($line);
#Check to see if this is a FASTA header line
if ( $line !~ /^>/ ) {
#If it's not a header line append it
#to my sequence
$sequence .= $line;
}
}
#Return a reference to the sequence
return \$sequence;
}
Here's how I would write your application. The processKMers subroutine boils down to just incrementing a hash element, so I've removed that. I've also altered the identifiers to be match the snake_case that is more usual in Perl code, and I didn't see any point in load_sequence returning a reference to the sequence so I've changed it to return the string itself
use strict;
use warnings 'all';
use constant FASTA_FILE => '/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta';
use constant KMER_SIZE => 15;
use constant STEP_SIZE => 1;
my $sequence = load_sequence( FASTA_FILE );
my %kmers;
for (my $offset = 0;
$offset + KMER_SIZE <= length $sequence;
$offset += STEP_SIZE ) {
my $kmer_seq = substr $sequence, $start, KMER_SIZE;
++$kmers{$kmer_seq};
}
open my $out_fh, '>', 'kmers.txt' or die $!;
for ( keys %kmers ) {
printf $out_fh "%s\t%d\n", $_, $kmers{$_};
}
sub load_sequence {
my ( $sequence_file ) = #_;
my $sequence = "";
open my $fh, '<', $sequence_file or die $!;
while ( <$fh> ) {
next if /^>/;
chomp;
$sequence .= $_;
}
return $sequence;
}
Here's a neater way to increment a hash element without using ++ on the hash directly
my $n;
if ( exists $kMersHash{$kMerSeq} ) {
$n = $kMersHash{$kMerSeq};
}
else {
$n = 0;
}
++$n;
$kMersHash{$kMerSeq} = $n;
Everything looks fine in your code besides processKMers. The main issues:
$kCount is not persistent between calls to processKMers, so in your else statement, $kCount will always be 2
You are printing every time you call processKMers, which is what is slowing you down. Printing frequently slows down your process significantly, you should wait until the end of your program and print once.
Keeping your code mostly the same:
sub processKMers {
my ($kMerSeq) = #_;
if ( not exists $kMersHash{$kMerSeq} ) {
$kMersHash{$kMerSeq} = 1;
}
else {
$kMersHash{$kMerSeq}++;
}
}
Then you want to move your print logic to immediately after your for-loop.

Finding longest match between 2 files from pattern

I am having trouble implementing two files within this program. I am trying to to access the contents of file $Q and $s.
print "Input the K value \n";
$k = <>;
chomp $k;
print "Input T\n";
$t = <>;
chomp $t;
%Qkmer = ();
$i = 1;
$query=' ';
while ($line=<IN>) {
chomp($line);
if ($line=~ m/^>/ ) {
next;
}
$query=$query.$line;
$line=~ s/(^|\n)[\n\s]*/$1/g;
while (length($line) >= $k) {
$line =~ m/(.{$k})/;
if (! defined $Qkmer{$1}) {#every key not deined as the first match
$Qkmer{$1} = $i;
}
$i++;
$line = substr($line, 1, length($line) -1);
}
}
open(MYDATA, '<', "data.txt");
while ($line=<MYDATA>) { \
chomp($line);
%Skmer = (); # This initializes the hash called Skmer.
$j = 1;
if ($line=~ m/^>/ ) { #if the line starts with >
next; #start on next line #separated characters
}
$line=~ s/^\s+|\s+$//g ; #remove all spaces from file
while (length($line) >= $k) {
$line =~ m/(.{$k})/;#match any k characters and only k characters in dna
$Skmer{$1} = $j; #set the key position to $j and increase for each new key
$j++;
$line = substr($line, 1, length($line) -1); #this removes the first character in the current string
}
###(56)###for($Skmerkey(keys %Skmer)){
$i=$Skmer{$Skmerkey};
if(defined $Qkmer($Skmerkey)){
$j=$Qkmer($Skmerkey);
}
$S1=$line;
$S2=$query;
#arrayS1= split(//, $S1);
#array2= split(//, $S2);
$l=0;
while($arrayS1[$i-$l] eq $arrayS2[$j-$l]){
$l++;
}
$start=$i-$l;
$m=0;
while ($arrayS1[$i+$k+$m] eq $arrayS2[$j+$k+$m]) {
$m++;
}
$length=$l+$k+$m;
$match= substr($S1, $start, $length);
if($length>$t){
$longest=length($match);
print "Longest: $match of length $longest \n";
}
}
}###(83)###
The input files contain only strings of letters. For example:
File 1:
ahhtsagnchjgstffhjyfcsghnvzfhg
File2:
ggujvfbgfgkjfcijjjffcvvafcsghnvzfhgvugxckugcbhfcgh
ghnvzfhgvugxckHhfgjgcfujvftjbvdtkhvddgjcdgjxdjkfrh
ajdbvciyqdanvkjghnvzfhgvugxc
From a match of a word of length$k in file 1 in file 2, I check from that match in file 2 to left and to right of word for further matches. The final output is the longest match between File 1 and File 2 based on $k. Now I ge
With this code, I get a syntax error and I am not suer why because it looks correct to me:
syntax error at testk.pl line 56, near "$Skmerkey("
syntax error at testk.pl line 83, near "}"
Thank you.
use strict; # <--- Allways use this
use warnings; # <--- and this
use Data::Dumper;
my $k=3;
open(my $IN, '<', "File2"); # use $IN instead of depricated IN
my $line=0; # line number
my %kmer; # hash of arrays of all $k-letter "words" line/position
my #Q; # rows of Q-file
while(<$IN>) {
chomp;
next if /^>/;
s/^\s+|\s+$//g;
next if !$_;
my $pos=0;
push #Q, $_; # store source row
for(/(?=(.{$k}))/g) { # Capture $k letters. floating window with step 1 symbol
push #{$kmer{$_}}, [$line,$pos]; # store row number and position of "word"
$pos++;
}
$line++;
}
open($IN, '<', "File1");
$line=0;
while(<$IN>) { # Read S-file
chomp;
next if /^>/;
s/^\s+|\s+$//g;
next if !$_;
my $pos=0;
my $len=length($_); # length of row of S-file
my $s=$_; # Current row of S-file
my #ignore=(); # array for store information about match tails
for(/(?=(.{$k}))/g) {
next if ! $kmer{$_}; # "word" not found try to next
for(#{$kmer{$_}}) { # $kmer{word} contains array of lines/positions in Q
my($qline, $qpos)=#{$_};
# print "test $qline:$qpos ";
if( grep {$_->[0]==$qline && $_->[1]==$qpos } #ignore ) {
# this line/position already tested and included in found matching
# print "Ignore match tail $qline:$qpos\n";
next;
}
my $j=$k; # $k letters same, test after this point
my $qlen=length($Q[$qline]);
$j++ while( $pos+$j<$len && $qpos+$j<$qlen &&
substr($s,$pos+$j,1) eq substr($Q[$qline],$qpos+$j,1) );
print "MATCH FOUND: S-file line $line pos $pos, Q-file line $qline pos $qpos: ",
substr($s,$pos,$j),"\n";
push #ignore, [$qline, $qpos, $j]; # store positions and length of match
}
} continue { # Continue block works on all loops, include after "next"
$pos++;
#ignore=grep { # recalculate/filter position and length of all match tails
++$_->[1]; # increment position
(--$_->[2]) # decrement length
>= $k # and filter out lengths < $k
} #ignore;
# print Dumper(\#ignore);
}
$line++;
}

How to identify nth lines of n files in while<>

I have a code which adds all vectors in all files.
There can be any number of input files. For example first input file is:
0.55 0 0.3335 1.2
0.212 0 2.2025 1
and the second one is:
0.25 0 0.3333 1.0
0.1235 0 0.2454 1
What I get is the sum of all vectors, thus in result i get one vector
which is:
1.13550 0 3.1147 4.2
But I'm trying to sum the first vector of the first file with the first vector of the second file and so on. In result according to this example I should get 2 vectors.
For now I have this:
use strict;
use warnings;
if ($ARGV[0] ne "vector1.dat"){
die ("vector1.dat is necessary as first argument");
}
my #sum = 0;
my $dim = 0;
while (<>) {
#Ignore blank lines, hashtags
#and lines starting with $
if ($_ =~ /#/ || $_ =~ /^$/ || $_ =~ /^\s$/){
next;
}
my #vectors = split(" ", $_);
my $vector_length = #vectors;
if ($dim eq 0) {
$dim = $vector_length;
}
else {
if ($dim ne $vector_length) {
die ("Vector dimensions do not match. : $!");
}
}
for (my $i = 0; $i <= $#vectors; $i++) {
$sum[$i] += $vectors[$i];
}
}
$" = "\t\t";
print "\n --- \n #sum \n";
What I need is just to find out how to identify each file's nth line
and to sum the column values of those lines while keeping in mind, that there can be n number of files.
I saw filehandling question over here with similar issue, however
I didn't find my answer there.
Just looking for some suggestions and guidance. Got stuck on this.
Open each file yourself and use the $. variable to know which line you are on (or count the files yourself). Here's the basic structure:
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
while( <$fh> ) {
chomp;
$sum[ $. ] = ...; # $. is the line number
}
}
If you don't like $., you can use its longer name. You have to turn on English (which comes with Perl):
use English;
## use English qw( -no_match_vars ); # for v5.16 and earlier
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
while( <$fh> ) {
chomp;
$sum[ $INPUT_LINE_NUMBER ] = ...;
}
}
Or, you can count yourself, which might be handy if the vectors in the files don't line up by strict line number (perhaps because of comments or some other formatting oddity):
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
my $line = -1;
while( <$fh> ) {
$line++;
chomp;
$sum[ $line ] = ...;
}
}
The harder way is the answer bart gives which inspects eof at the end of every line to see if the magical ARGV handle is looking at a new file, and resetting $. if it is. It's an interesting trick but hardly anyone is going to understand what it's doing (or even notice it).
For the other part of the problem, I think you're doing the vector sum wrong, or using confusing variable names. A line is a vector, and the numbers in the lines are a component. A two dimensional array will work. The first index is the line number and the second in the component index:
while( <$fh> ) {
chomp;
... skip unwanted lines
my #components = split;
... various dimension checks
foreach my $i ( 0 .. $#components ) {
$sum[ $. ][ $i ] += $components[ $i ];
}
}
The Data::Dumper module is handy for complex data structures. You can also see the perldsc (Perl Data Structures Cookbook) documentation. The $. variable is found in perlvar.
$. is the line number of the most recently read file handle. close(ARGV) if eof; can be used to reset the file number between files (as documented in eof). (Note: eof() is different than eof.) So you now have line numbers.
The second problem you have is that you are adding vector components ($vectors[$i]) to a vectors ($sum[$i]). You need to add vector components to vectors components. Start by using more appropriate variable names.
This is what we get:
my #sum_vectors;
while (<>) {
s/#.*//; # Remove comments.
next if /^\s*$/; # Ignore blank lines.
my #vector = split;
if ($sum_vectors[$.] && #{ $sum_vectors[$.] } != #vector) {
die("$ARGV:$.: Vector dimensions do not match\n");
}
for my $i (0..$#vector) {
$sum_vectors[$.][$i] += $vector[$i];
}
} continue {
close(ARGV) if eof; # Reset line numbers for each file.
}
Two other errors fixed:
$! did not contain anything meaningful when you used it.
You ignored lines that contain comments, even if they contained valid data too.
Try this:
#!/usr/bin/perl
use strict;
use warnings;
if ($ARGV[0] ne "vector1.dat"){
die ("vector1.dat is necessary as first argument");
}
my %sum;
my $dim = 0;
my $vector_length;
my $line_number;
while (<>) {
#Ignore blank lines, hashtags
#and lines starting with $
if ($_ =~ /#/ || $_ =~ /^$/ || $_ =~ /^\s$/){
next;
}
my #vectors = split(" ", $_);
$vector_length = #vectors;
if ($dim eq 0) {
$dim = $vector_length;
}
else {
if ($dim ne $vector_length) {
die ("Vector dimensions do not match. : $!");
}
}
for (my $i = 0; $i <= $#vectors; $i++) {
$sum{$.}{$i} += $vectors[$i];
}
$line_number = $.;
$. = 0 if eof;
}
$" = "\t\t";
for (my $line=1; $line<=$line_number; $line++)
{
print $line;
for (my $vector=0; $vector<$vector_length; $vector++)
{
print " " . $sum{$line}{$vector};
}
print "\n";
}

How do I compare two lines (if they are equal or not equal) of a file read inside a while loop?

I have a file like this one below, where the line starting with a number is an ID for my sample and the following lines are the data.
10001;02/07/98;;PI;M^12/12/59^F^^SP^09/12/55
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D16S539
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D7S820
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D13S317
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D5S818
10002;02/07/98;;RJ;F^20/04/86^SP^
;;;;;F1|SP1;;;12;10;12;11;;D10S212
;;;;;F1|SP1;;;8;8;10;8;;D7S820
;;;;;F1|SP1;;;12;11;14;11;;D13S317
;;;;;F1|SP1;;;13;12;13;8;;D5S818
For the lines containing the data, I would like to test whether or not the fields 6-11 are the same because I want the data only if they are not equal to each other (in the first case they are all '9').
So I thought about splitting the lines and store them as an array, and then compare the arrays with the ~~ operator. But how do I do that if I'm reading the file inside a while loop and the array is redefined each line?
Or maybe there is better ways to do that.
Thanks in advance!
This is a pseudocode to illustrate what I want to do:
open FILE, $ARGV[0] or die $!;
while (<FILE>) {
chomp;
my #field = split /;/;
if ($field[0] eq '') {
if #fields[6 .. 11] is not equal to #fields[6 .. 11] in all the next lines {
do my calculation;
}
}
}
Am I correct in saying that data really represents two records? If so, you want to accumulate the lines for for the full record.
my #super_rec;
while (<>) {
chomp;
my #fields = split /;/;
if ($fields[0] ne '') {
process_rec(\#super_rec) if #super_rec;
#super_rec = \#fields;
} else {
push #super_rec, \#fields;
}
}
process_rec(\#super_rec) if #super_rec;
Then, your question can be answered.
sub process_rec {
my ($super_rec) = #_;
my ($rec, #subrecs) = #$super_rec;
my $do_calc = 0;
for my $i (1..$#subrecs) {
if ( $subrecs[0][ 6] ne $subrecs[$i][ 6]
|| $subrecs[0][ 7] ne $subrecs[$i][ 7]
|| $subrecs[0][ 8] ne $subrecs[$i][ 8]
|| $subrecs[0][ 9] ne $subrecs[$i][ 9]
|| $subrecs[0][10] ne $subrecs[$i][10]
|| $subrecs[0][11] ne $subrecs[$i][11]
) {
$do_calc = 1;
last;
}
}
if ($do_calc) {
...
}
}
I assume you're looking to compare data across lines, not within a single line. If I've got that wrong, ignore the rest of my answer.
The way I would do it is to re-join fields 6 through 11 as a string. Keep the data from the first line as $firstdata, and compare data from each successive line as $nextdata. Each time the data don't match, you up the $differences counter. When you get an ID line, check to see if the previous $differences was greater than zero and if so do your calculation (you may need to save the ID line and other fields in some other variables). Then re-initialize the $differences and $firstdata variable.
my $firstdata = "";
my $nextdata = "";
my $differences = 0;
open FILE, $ARGV[0] or die $!;
while (<FILE>) {
chomp;
my #field = split /;/;
if ($field[0] eq '') {
$nextdata = join(';', #fields[6..11]);
if ($firstdata && ($nextdata ne $firstdata)) {
$differences++;
} else {
$firstdata = $nextdata;
}
} else {
if ($differences) {
# do your calculation for previous ID
}
$firstdata = "";
$differences = 0;
}
}
if ($differences) {
# do your calculation one last time for the last ID
}
Here's a way to do it with Regex. This might be inefficient than other methods, if the indices are fixed from 6 to 11, and are known to be those only, because it will traverse entire String: -
open FILE, $ARGV[0] or die $!;
while (<FILE>) {
chomp;
my $num = 0;
my $same = 1;
while (/;(\d+);/) {
if ($num == 0) { $num = $1; }
elsif ($1 != $num) { $same = 0; last; }
# Substitute current digit matched with x (or any char)
# to avoid infinite loop
s/$1/x/;
}
if ($same) {
print "All digits same";
}
}
Using the Text::CSV_XS module you can do something like this:
use strict;
use warnings;
use Text::CSV_XS;
use feature 'say';
my $csv = Text::CSV_XS->new({
sep_char => ";",
binary => 1,
});
my %data;
my #hdrs; # store initial order of headers
my $hdr;
while (my $row = $csv->getline(*DATA)) {
if ($row->[0] =~ /^\d+$/) {
$csv->combine(#$row) or die "Cannot combine: " .
$csv->error_diag();
$hdr = $csv->string(); # recreate the header
push #hdrs, $hdr; # save list of headers
} else {
push #{ $data{$hdr} }, [ #{$row}[6..11] ];
}
}
for (#hdrs) {
say "$_\n arrays are: " . (is_identical($data{$_}) ? "same":"diff");
}
sub is_identical {
my $last;
for (#{$_[0]}) { # argument is two-dimensional array
$last //= $_;
return 0 unless ( #$_ ~~ #$last );
}
return 1; # default = all arrays were identical
}
__DATA__
10001;02/07/98;;PI;M^12/12/59^F^^SP^09/12/55
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D16S539
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D7S820
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D13S317
;;;;;M1|F1|SP1;9;9;9;9;9;9;;D5S818
10002;02/07/98;;RJ;F^20/04/86^SP^
;;;;;F1|SP1;;;12;10;12;11;;D10S212
;;;;;F1|SP1;;;8;8;10;8;;D7S820
;;;;;F1|SP1;;;12;11;14;11;;D13S317
;;;;;F1|SP1;;;13;12;13;8;;D5S818
Output:
10001;02/07/98;;PI;M^12/12/59^F^^SP^09/12/55
arrays are: same
10002;02/07/98;;RJ;F^20/04/86^SP^
arrays are: diff

Find multiple substrings in strings and record location

The following is the script for finding consecutive substrings in strings.
use strict;
use warnings;
my $file="Sample.txt";
open(DAT, $file) || die("Could not open file!");
#worry about these later
#my $regexp1 = "motif1";
#my $regexp2 = "motif2";
#my $regexp3 = "motif3";
#my $regexp4 = "motif4";
my $sequence;
while (my $line = <DAT>) {
if ($line=~ /(HDWFLSFKD)/g){
{
print "its found index location: ",
pos($line), "-", pos($line)+length($1), "\n";
}
if ($line=~ /(HD)/g){
print "motif found and its locations is: \n";
pos($line), "-", pos($line)+length($1), "\n\n";
}
if ($line=~ /(K)/g){
print "motif found and its location is: \n";
pos($line), "-",pos($line)+length($1), "\n\n";
}
if ($line=~ /(DD)/g){
print "motif found and its location is: \n";
pos($line), "-", pos($line)+length($1), "\n\n";
}
}else {
$sequence .= $line;
print "came in else\n";
}
}
It matches substring1 with string and prints out position where substring1 matched. The problem lies in finding the rest of the substrings. For substrings2 it starts again from the beginning of the string (instead of starting from the position where substring1 was found). The problem is that every time it calculates position it starts from the beginning of string instead of starting from the position of the previously found substring. Since substrings are consecutive substring1, substring2, substring3, substring4, their positions have to occur after the previous respectively.
Try this perl program
use strict;
use warnings;
use feature qw'say';
my $file="Sample.txt";
open( my $dat, '<', $file) || die("Could not open file!");
my #regex = qw(
HDWFLSFKD
HD
K
DD
);
my $sequence;
while( my $line = <$dat> ){
chomp $line;
say 'Line: ', $.;
# reset the position of variable $line
# pos is an lvalue subroutine
pos $line = 0;
for my $regex ( #regex ){
$regex = quotemeta $regex;
if( scalar $line =~ / \G (.*?) ($regex) /xg ){
say $regex, ' found at location (', $-[2], '-', $+[2], ')';
if( $1 ){
say " but skipped: \"$1\" at location ($-[1]-$+[1])";
}
}else{
say 'Unable to find ', $regex;
# end loop
last;
}
}
}
I'm not a perl expert but you can use $- and $+ to track index location for last regex match found.
Below is code built on top of your code that explains this.
use strict;
use warnings;
my $file="sample.txt";
open(DAT, $file) || die("Could not open file!");
open (OUTPUTFILE, '>data.txt');
my $sequence;
my $someVar = 0;
my $sequenceNums = 1;
my $motif1 = "(HDWFLSFKD)";
my $motif2 = "(HD)";
my $motif3 = "(K)";
my $motif4 = "(DD)";
while (my $line = <DAT>)
{
$someVar = 0;
print "\nSequence $sequenceNums: $line\n";
print OUTPUTFILE "\nSequence $sequenceNums: $line\n";
if ($line=~ /$motif1/g)
{
&printStuff($sequenceNums, "motif1", $motif1, "$-[0]-$+[0]");
$someVar = 1;
}
if ($line=~ /$motif2/g and $someVar == 1)
{
&printStuff($sequenceNums, "motif2", $motif2, "$-[0]-$+[0]");
$someVar = 2;
}
if ($line=~ /$motif3/g and $someVar == 2)
{
&printStuff($sequenceNums, "motif3", $motif4, "$-[0]-$+[0]");
$someVar = 3;
}
if ($line=~ /$motif4/g and $someVar == 3)
{
&printStuff($sequenceNums, "motif4", $motif4, "$-[0]-$+[0]");
}
else
{
$sequence .= $line;
if ($someVar == 0)
{
&printWrongStuff($sequenceNums, "motif1", $motif1);
}
elsif ($someVar == 1)
{
&printWrongStuff($sequenceNums, "motif2", $motif2);
}
elsif ($someVar == 2)
{
&printWrongStuff($sequenceNums, "motif3", $motif3);
}
elsif ($someVar == 3)
{
&printWrongStuff($sequenceNums, "motif4", $motif4);
}
}
$sequenceNums++;
}
sub printStuff
{
print "Sequence: $_[0] $_[1]: $_[2] index location: $_[3] \n";
print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] index location: $_[3]\n";
}
sub printWrongStuff
{
print "Sequence: $_[0] $_[1]: $_[2] was not found\n";
print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] was not found\n";
}
close (OUTPUTFILE);
close (DAT);
Sample input:
MLTSHQKKFHDWFLSFKDSNNYNHDSKQNHSIKDDIFNRFNHYIYNDLGIRTIA
MLTSHQKKFSNNYNSKQNHSIKDIFNRFNHYIYNDLGIRTIA
MLTSHQKKFSNNYNSKHDWFLSFKDQNHSIKDIFNRFNHYIYNDL
You really should read
perldoc perlre
perldoc perlreref
perldoc perlretut
You need the special variables #- and #+ if you need the positions. No need to try to compute them yourself.
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw( each_array );
my $source = 'AAAA BBCCC DD E FFFFF';
my $pattern = join '\s*', map { "($_+)" } qw( A B C D E F );
if ( $source =~ /$pattern/ ) {
my $it = each_array #-, #+;
$it->(); # discard overall match information;
while ( my ($start, $end) = $it->() ) {
printf "Start: %d - Length: %d\n", $start, $end - $start;
}
}
Start: 0 - Length: 4
Start: 7 - Length: 2
Start: 9 - Length: 3
Start: 15 - Length: 2
Start: 19 - Length: 1
Start: 26 - Length: 5
The result of a construct like
$line=~ /(HD)/g
is a list. Use while to step through the hits.
To match where the last match left off, use \G. perldoc perlre says (but consult your own installation's version's manual first):
The "\G" assertion can be used to
chain global matches (using "m//g"),
as described in "Regexp Quote-Like
Operators" in perlop. It is also
useful when writing "lex"-like
scanners, when you have several
patterns that you want to match
against consequent substrings of your
string, see the previous reference.
The actual location where "\G" will
match can also be influenced by using
"pos()" as an lvalue: see "pos" in
perlfunc. Note that the rule for
zero-length matches is modified
somewhat, in that contents to the left
of "\G" is not counted when
determining the length of the match.
Thus the following will not match
forever:
$str = 'ABC';
pos($str) = 1;
while (/.\G/g) {
print $&;
}