Array elements to match in regressively using perl script - perl

My Input file:
my $inp = "sample.txt";
#Sample.txt
As the HF exchange `\mathcal{\mathsf{}}` operator adopted
in, the same HF exchange operator is adopted in without further
optimization. However, the remaining `\mathbb{\mathbbm{}}`
`\mathbm{\mathbf{}}`, `\mathbf{\mathit{}}`. When compared with those
adopted in the MR hybrid functionals developed by Henderson {\it et al.}
for different `\mathrm{\mathscr{}}`, `\mathsf{\mathfrak{}}`
My Array Elements:
my #arr = qw(boldsymbol mathbb mathbbm mathbf mathcal mathbf mathit mathbf mathcal mathfrak mathit mathrm mathscr mathsf);
My concern is need to check the below pattern:
\\$arr[0]{$arr[1] ... \\$arr[0]{$arr[2] .... \\$arr[0]{$arr[3] ... \\$arr[0]{$arr[13]
...
...
\\$arr[13]{$arr[0] ... \\$arr[13]{$arr[1] ... \\$arr[13]{$arr[2] ... \\$arr[13]{$arr[13]
For Example:
\boldsymbol{\mathbb} and \\boldsymbol{\mathbbm} ...
\mathbb{\boldsymbol} and \\mathbb{\mathbbm} ...
#My Ist attempt
readFileinString($inp,\$inpcnt);
my $i = 1; my $j = 1; my $cls = $#arr;
while($inpcnt=~m/\\$arr[$i]\{\\$arr[$j]/g)
{
print "LL: $&\n";
$j += 1;
if($j == $cls) { $i++; }
}
#IInd Attempt
my (#arr1,#arr2) = ();
while(<>)
{
chomp;
push(#arr1, $_);
}
my $join1 = join "|", #arr1;
my $join2 = join "|", #arr1;
#print "($join1)\{($join2)";
while($str=~m/($join1)\{($join2)/g)
{
print "Matched: $&\n";
}
#------------------>Reading a file
sub readFileinString
#------------------>
{
my $File = shift;
my $string = shift;
use File::Basename;
my $filenames = basename($File);
open(FILE1, "<$File") or die "\nFailed Reading File: [$File]\n\tReason: $!";
read(FILE1, $$string, -s $File, 0);
close(FILE1);
}
__DATA__
\boldsymbol
\mathbb
\mathbbm
\mathbf
\mathcal
\mathbf
\mathit
\mathbf
\mathcal
\mathfrak
\mathit
\mathrm
\mathscr
\mathsf
Could you please anyone guide me where I am doing wrong on this coding flow.

Loop over the list of terms twice, nested. This will result in the cartesian product.
use 5.026;
use strictures;
use Data::Munge qw(list2re);
my #markup = qw(boldsymbol mathbb mathbbm mathbf mathcal mathbf mathit
mathbf mathcal mathfrak mathit mathrm mathscr mathsf);
my $BS = '\\'; # a single backslash
my #expressions;
for my $first_term (#markup) {
for my $second_term (#markup) {
push #expressions, "$BS${first_term}{$BS$second_term"
}
}
my $regex = list2re #expressions;
my $input = <<'';
As the HF exchange \mathcal{\mathsf{}} operator adopted
in, the same HF exchange operator is adopted in without further
optimization. However, the remaining \mathbb{\mathbbm{}}
\mathbm{\mathbf{}}, \mathbf{\mathit{}}. When compared with those
adopted in the MR hybrid functionals developed by Henderson {\it et al.}
for different \mathrm{\mathscr{}} \mathsf{\mathfrak{}}
my #results = $input =~ m/$regex/gms;
# (
# '\\mathcal{\\mathsf',
# '\\mathbb{\\mathbbm',
# '\\mathbf{\\mathit',
# '\\mathrm{\\mathscr',
# '\\mathsf{\\mathfrak'
# )

Related

Perl :Not a HASH reference while creating nested hash from multi dimension arrays

I want to create a nested hash by reading values from multi dimention arrays which are separated by -> e.g.
Array 1: key1->key2->key3->value
Array 2: key1->key2->value
Array 3: key1->value
When key has value and sub keys as well e.g. key2 has value and another key key3 as well then get error "Not a HASH reference ".Seem it is overwriting previous hash and considering it array.
Help is appreciated. I have tried to debug and print the values of variables and output by using dumper module and see that it is ARRAY reference and not hash.
in order to repro, please create .txt files e.g. from 1 to 3.txt in any folder and have below content in these files 1.txt : /TEST-TAG = ABC->DEF->fma->GHI/ 2.txt:/*TEST-TAG = ABC->DEF->fma 3.txt:/*TEST-TAG = ABC->DEF and then have in perl script
#!/usr/bin/perl
use strict;
use warnings;
my #lines=`grep -R 'TEST-TAG =' <FOLDER where .txt files present>`;
my $hash;
#parse the lines which has pattern /\*TEST-TAG = ABC->DEF->fma->GHI\*/
foreach (#lines)
{
print "line is $_\n";
my($all_cat) = $_ =~ /\=(.*)\*\//;
print "all cat is $all_cat\n";
my($testname) = $_ =~ /\/.*\/(.*)\./;
print "testname is $testname\n";
if (!$all_cat eq "") {
$all_cat =~ s/ //g;
my #ts = split(',', $all_cat);
print "ts is #ts\n";
my $i;
foreach (#ts) {
my #allfeat = split('->',$_);
my $count = scalar #allfeat;
for ($i = 0; $i<$count; $i++) {
my #temparr = #allfeat[$i..$count-1];
print "temparr is #temparr\n";
push #temparr, $testname;
ToNestedHash($hash, #temparr);
}
}
}
}
sub ToNestedHash {
my $ref = \shift;
print "sandeep in ref $ref\n";
print "sandeep in ref", ref($ref), "\n";
my $h = $$ref;
print "sandeep h $h\n";
my $value = pop;
print "sandeep value is $value\n";
print "sandeep array is #_\n";
print "refrence", ref($h), "\n";
foreach my $i (#_) {
print " before INDEX $i\n";
print Dumper($ref);
$ref =\$$ref->{ $i };
print "after INDEX $i\n";
print Dumper($ref);
}
if (!isinlist(\#{$$ref},$value)) {
push #{$$ref}, $value;
}
return $h;
}
# If element exists in the list
sub isinlist {
my ($aref, $key) = ($_[0], $_[1]);
foreach my $elem (#$aref){
if ($elem eq $key) {
return 1;
}
}
return 0;
}
I get this output with debug prints
line is File.txt:/*TEST-TAG = ABC->DEF->fma->GHI*/
all cat is ABC->DEF->fma->GHI
testname is hmma_884_row_row_f16_f16
ts is ABC->DEF->fma->GHI
temparr is ABC DEF fma GHI
sandeep in ref REF(0x12a1048)
sandeep in refREF
sandeep h HASH(0x12a09a0)
sandeep value is hmma_884_row_row_f16_f16
sandeep array is ABC DEF fma GHI
refrenceHASH
REF
temparr is DEF fma GHI
sandeep in ref REF(0x12a1048)
sandeep in refREF
sandeep h HASH(0x12a09a0)
sandeep value is hmma_884_row_row_f16_f16
sandeep array is DEF fma GHI
refrenceHASH
REF
temparr is fma GHI
sandeep in ref REF(0x12a1048)
sandeep in refREF
sandeep h HASH(0x12a09a0)
sandeep value is hmma_884_row_row_f16_f16
sandeep array is fma GHI
refrenceHASH
Not a HASH reference at createjson.pl line 80.
problematic line is $ref =\$$ref->{$_} foreach (#_);
After sleeping on it, I realized where you were trying to go with this more. Your issue of concern is the fact that you're trying to use some hash values as both arrays and as hashes. There are two approaches to dealing with this. Detect and handle it, or avoid it. The avoid code is much cleaner, so I'll show that.
As I mentioned in my original answer, I'm not sure what you had in mind for the 'Dumper' lines, but Data::Dump is probably a useful replacement for what you were using, with less complication than the Data::Dumper module that I was thinking you were somehow managing to use. I also chose to still provide a replacement for your file name regex, as I still don't want to bother with a full path name.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dump;
my #lines = `grep -R 'TEST-TAG =' foo`;
my $hash;
$| = 1; # keep STDOUT and STDERR together
#parse the lines which has pattern /\*TEST-TAG = ABC->DEF->fma->GHI\*/
foreach (#lines) {
print "line is $_\n";
my($all_cat) = $_ =~ /\=(.*)\*\//;
print "all cat is $all_cat\n";
my($testname) = $_ =~ /(?:.*\/)?(.*?)\./;
print "testname is $testname\n";
if ($all_cat ne "") {
$all_cat =~ s/ //g;
my #ts = split(',', $all_cat);
print "ts is #ts\n";
my $i;
foreach (#ts) {
my #allfeat = split('->',$_);
my $count = scalar #allfeat;
for ($i = 0; $i<$count; $i++) {
my #temparr = #allfeat[$i..$count-1];
print "temparr is #temparr\n";
push #temparr, $testname;
ToNestedHash($hash, #temparr);
}
}
}
}
sub ToNestedHash {
my $ref = \shift;
print "sandeep in ref ";
dd $ref;
my $h = $$ref;
print "sandeep h ";
dd $h;
my $value = pop;
print "sandeep value is $value\n";
print "sandeep array is #_\n";
print "refrence", ref($h), "\n";
foreach my $i (#_) {
print " before INDEX $i\n";
dd $ref;
$ref =\$$ref->{ $i };
print "after INDEX $i\n";
dd $ref;
}
$ref =\$$ref->{ _ARRAY };
if (!isinlist(\#{$$ref},$value)) {
push #{$$ref}, $value;
}
return $h;
}
# If element exists in the list
sub isinlist {
my ($aref, $key) = ($_[0], $_[1]);
foreach my $elem (#$aref){
if ($elem eq $key) {
return 1;
}
}
return 0;
}

Perl: Compare Two CSV Files and Print out matches (modifying this code)

I am very new at perl and had discovered the solution at:
Perl: Compare Two CSV Files and Print out differences
I have gone through dozens of other solutions and this comes closest, except that instead of finding the differences between 2 CSV files, I want to find where the second CSV file matches the first one in column and row. How could I modify the following script to find the matches in column/row instead of the differences. I am hoping to dissect this code and learn arrays from there, but wanted to find out the solution to this application. Much thanks.
use strict;
my #arr1;
my #arr2;
my $a;
open(FIL,"a.txt") or die("$!");
while (<FIL>)
{chomp; $a=$_; $a =~ s/[\t;, ]*//g; push #arr1, $a if ($a ne '');};
close(FIL);
open(FIL,"b.txt") or die("$!");
while (<FIL>)
{chomp; $a=$_; $a =~ s/[\t;, ]*//g; push #arr2, $a if ($a ne '');};
close(FIL);
my %arr1hash;
my %arr2hash;
my #diffarr;
foreach(#arr1) {$arr1hash{$_} = 1; }
foreach(#arr2) {$arr2hash{$_} = 1; }
foreach $a(#arr1)
{
if (not defined($arr2hash{$a}))
{
push #diffarr, $a;
}
}
foreach $a(#arr2)
{
if (not defined($arr1hash{$a}))
{
push #diffarr, $a;
}
}
print "Diff:\n";
foreach $a(#diffarr)
{
print "$a\n";
}
# You can print to a file instead, by: print FIL "$a\n";
ok, I realize that this was more what I was looking for:
use strict;
use warnings;
use feature qw(say);
use autodie;
use constant {
FILE_1 => "file1.txt",
FILE_2 => "file2.txt",
};
#
# Load Hash #1 with value from File #1
#
my %hash1;
open my $file1_fh, "<", FILE_1;
while ( my $value = <$file1_fh> ) {
chomp $value;
$hash1{$value} = 1;
}
close $file1_fh;
#
# Load Hash #2 with value from File #2
#
my %hash2;
open my $file2_fh, "<", FILE_2;
while ( my $value = <$file2_fh> ) {
chomp $value;
$hash2{$value} = 1;
}
close $file2_fh;
Now I want to search file2's hash to check if there are ANY matches from file1's hash. That is where I am stuck
With new code suggestion, code now looks like this
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;
use constant {
FILE_1 => "masterlist.csv",
FILE_2 => "pastebin.csv",
};
#
# Load Hash #1 with value from File #1
#
my %hash1;
open my $file1_fh, "<", FILE_1;
while ( my $value = <$file1_fh> ) {
chomp $value;
$hash1{$value} = 1;
}
close $file1_fh;
my %hash2;
open my $file2_fh, "<", FILE_2;
while ( my $value = <$file2_fh> ) {
chomp $value;
if ( $hash1{$value} ) {
print "Match found $value\n";
$hash2{$value}++;
}
}
close $file2_fh;
print "Matches found:\n";
foreach my $key ( keys %hash2 ) {
print "$key found $hash2{$key} times\n";
}
I updated one part with split() and it seems to work, but have to test more to confirm if it fits the solution I'm looking for or I have more work to do one it
#
# Load Hash #1 with value from File #1
#
my %hash1;
open my $file1_fh, "<", FILE_1;
while ( my $value = <$file1_fh> ) {
chomp $value;
$hash1{$value} = ( %hash1, (split(/,/, $_))[1,2] );
}
close $file1_fh;
So, with your code there - you've read in 'file1' to a hash.
Why not instead of reading file 2 into a hash, do instead:
my %hash2;
open my $file2_fh, "<", FILE_2;
while ( my $value = <$file2_fh> ) {
chomp $value;
if ( $hash1{$value} ) {
print "Match found $value\n";
$hash2{$value}++;
}
}
close $file2_fh;
print "Matches found:\n";
foreach my $key ( keys %hash2 ) {
print "$key found $hash2{$key} times\n";
}
I think this code identifies every place that a data field in file A matches a data field in file B (at least it does on my limited test data):
use strict;
use warnings;
my #arr1;
my #arr2;
# a.txt -> #arr1
my $file_a_name = "poster_a.txt";
open(FIL,$file_a_name) or die("$!");
my $a_line_counter = 0;
while (my $a_line = <FIL>)
{
$a_line_counter = $a_line_counter + 1;
chomp($a_line);
my #fields = (split /,/,$a_line);
my $num_fields = scalar(#fields);
s{^\s+|\s+$}{}g foreach #fields;
push #arr1, \#fields if ( $num_fields ne 0);
};;
close(FIL);
my $file_b_name = "poster_b.txt";
open(FIL,$file_b_name) or die("$!");
while (my $b_line = <FIL>)
{
chomp($b_line);
my #fields = (split /,/,$b_line);
my $num_fields = scalar(#fields);
s{^\s+|\s+$}{}g foreach #fields;
push #arr2, \#fields if ( $num_fields ne 0)
};
close(FIL);
# b.txt -> #arr2
#print "\n",#arr2, "\n";
my #match_array;
my $file_a_line_ctr = 1;
foreach my $file_a_line_fields (#arr1)
{
my $file_a_column_ctr = 1;
foreach my $file_a_line_field (#{$file_a_line_fields})
{
my $file_b_line_ctr = 1;
foreach my $file_b_line_fields(#arr2)
{
my $file_b_column_ctr = 1;
foreach my $file_b_field (#{$file_b_line_fields})
{
if ( $file_b_field eq $file_a_line_field )
{
my $match_info =
"$file_a_name line $file_a_line_ctr column $file_a_column_ctr" .
" (${file_a_line_field}) matches: " .
"$file_b_name line $file_b_line_ctr column $file_b_column_ctr ";
push(#match_array, $match_info);
print "$match_info \n";
}
$file_b_column_ctr = $file_b_column_ctr + 1;
}
$file_b_line_ctr = $file_b_line_ctr + 1;
}
$file_a_column_ctr = $file_a_column_ctr + 1;
}
$file_a_line_ctr = $file_a_line_ctr + 1;
}
print "there were ", scalar(#match_array)," matches\n";

Degeneracy of characters when searching for a specific sub-string

I have the following script which searches for specified substrings within an input string (a DNA sequence). I was wondering if anybody could help out with being able to specify degeneracy of specific characters. For example, instead of searching for GATC (or anything consisting solely of G's, T's, A's and C's), I could instead search for GRTNA where R = A or G and where N = A, G, C or T. I would need to be able to specify quite a few of these in a long list within the script. Many thanks for any help or tips!
use warnings;
use strict;
#User Input
my $usage = "Usage (OSX Terminal): perl <$0> <FASTA File> <Results Directory + Filename>\n";
#Reading formatted FASTA/FA files
sub read_fasta {
my ($in) = #_;
my $sequence = "";
while(<$in>) {
my $line = $_;
chomp($line);
if($line =~ /^>/){ next }
else { $sequence .= $line }
}
return(\$sequence);
}
#Scanning for restriction sites and length-output
open(my $in, "<", shift);
open(my $out, ">", shift);
my $DNA = read_fasta($in);
print "DNA is: \n $$DNA \n";
my $len = length($$DNA);
print "\n DNA Length is: $len \n";
my #pats=qw( GTTAAC );
for (#pats) {
my $m = () = $$DNA =~ /$_/gi;
print "\n Total DNA matches to $_ are: $m \n";
}
my $pat=join("|",#pats);
my #cutarr = split(/$pat/, $$DNA);
for (#cutarr) {
my $len = length($_);
print $out "$len \n";
}
close($out);
close($in);
GRTNA would correspond to the pattern G[AG]T[AGCT]A.
It looks like you could do this by writing
for (#pats) {
s/R/[AG]/g;
s/N/[AGCT]/g;
}
before
my $pat = join '|', #pats;
my #cutarr = split /$pat/, $$DNA;
but I'm not sure I can help you with the requirement that "I would need to be able to specify quite a few of these in a long list within the script". I think it would be best to put your sequences in a separate text file rather than embed the list directly into the program.
By the way, wouldn't it be simpler just to
return $sequence
from your read_fasta subroutine? Returning a reference just means you have to dereference it everywhere with $$DNA. I suggest that it should look like this
sub read_fasta {
my ($fh) = #_;
my $sequence;
while (<$fh>) {
unless (/^>/) {
chomp;
$sequence .= $_;
}
}
return $sequence;
}

Perl compare individual elements of two arrays

I have two files with two columns each:
FILE1
A B
1 #
2 #
3 !
4 %
5 %
FILE 2
A B
3 #
4 !
2 &
1 %
5 ^
The Perl script must compare column A in both both files, and only if they are equal, column B of FIlE 2 must be printed
So far I have the following code but all I get is an infinite loop with # from column B
use strict;
use warnings;
use 5.010;
print "enter site:"."\n";
chomp(my $s = <>);
print "enter protein:"."\n";
chomp(my $p = <>);
open( FILE, "< $s" ) or die;
open( OUT, "> PSP.txt" ) or die;
open( FILE2, "< $p" ) or die;
my #firstcol;
my #secondcol;
my #thirdcol;
while ( <FILE> )
{
next if $. <2;
chomp;
my #cols = split;
push #firstcol, $cols[0];
push #secondcol, $cols[1]."\t"."\t".$cols[3]."\t"."\t"."\t"."N\/A"."\n";
}
my #firstcol2;
my #secondcol2;
my #thirdcol2;
while ( <FILE2> )
{
next if $. <2;
my #cols2 = split(/\t/, $_);
push #firstcol2, $cols2[0];
push #secondcol2, $cols2[4]."\n";
}
my $size = #firstcol;
my $size2 = #firstcol2;
for (my $i = 0; $i <= #firstcol ; $i++) {
for (my $j = 0; $j <= #firstcol2; $j++) {
if ( $firstcol[$i] eq $firstcol2[$j] )
{
print $secondcol2[$i];
}
}
}
my (#first, #second);
while(<first>){
chomp;
my $foo = split / /, $_;
push #first , $foo;
}
while(<second>){
chomp;
my $bar = split / / , $_;
push #second, $bar;
}
my %first = #first;
my %second = #second;
Build a hash of the first file as %first and second file as %second with first column as key and second column as value.
for(keys %first)
{
print $second{$_} if exists $second{$_}
}
I couldn't check it as I am on mobile. hope that gives you an idea.
I assume that column A is ordered and that you actually want to compare the first entry in File 1 to the first entry in File 2, and so on.
If that's true, you have nested loop that you don't need. Simplify your last while as such:
for my $i (0..$#firstcol) {
if ( $firstcol[$i] eq $firstcol2[$i] )
{
print $secondcol2[$i];
}
}
Also, if you're at all concerned about the files being of different length, then you can adjust the loop:
use List::Util qw(min);
for my $i (0..min($#firstcol, $#firstcol2)) {
Additional Note: You aren't chomping your data in the second file loop while ( <FILE2> ). That might introduce a bug later.
If your files are called file1.txt and file2.txt the next:
use Modern::Perl;
use Path::Class;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } file("file$_.txt")->slurp for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
say $line2->[1] if ($line1->[0] eq $line2->[0]);
}
prints:
B
^
equals in column1 only the lines A and 5
without the CPAN modules - produces the same result
use strict;
use warnings;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } do { local(#ARGV)="file$_.txt";<> } for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
print $line2->[1],"\n" if ($line1->[0] eq $line2->[0]);
}

Find multiple substrings in strings and record location

The following is the script for finding consecutive substrings in strings.
use strict;
use warnings;
my $file="Sample.txt";
open(DAT, $file) || die("Could not open file!");
#worry about these later
#my $regexp1 = "motif1";
#my $regexp2 = "motif2";
#my $regexp3 = "motif3";
#my $regexp4 = "motif4";
my $sequence;
while (my $line = <DAT>) {
if ($line=~ /(HDWFLSFKD)/g){
{
print "its found index location: ",
pos($line), "-", pos($line)+length($1), "\n";
}
if ($line=~ /(HD)/g){
print "motif found and its locations is: \n";
pos($line), "-", pos($line)+length($1), "\n\n";
}
if ($line=~ /(K)/g){
print "motif found and its location is: \n";
pos($line), "-",pos($line)+length($1), "\n\n";
}
if ($line=~ /(DD)/g){
print "motif found and its location is: \n";
pos($line), "-", pos($line)+length($1), "\n\n";
}
}else {
$sequence .= $line;
print "came in else\n";
}
}
It matches substring1 with string and prints out position where substring1 matched. The problem lies in finding the rest of the substrings. For substrings2 it starts again from the beginning of the string (instead of starting from the position where substring1 was found). The problem is that every time it calculates position it starts from the beginning of string instead of starting from the position of the previously found substring. Since substrings are consecutive substring1, substring2, substring3, substring4, their positions have to occur after the previous respectively.
Try this perl program
use strict;
use warnings;
use feature qw'say';
my $file="Sample.txt";
open( my $dat, '<', $file) || die("Could not open file!");
my #regex = qw(
HDWFLSFKD
HD
K
DD
);
my $sequence;
while( my $line = <$dat> ){
chomp $line;
say 'Line: ', $.;
# reset the position of variable $line
# pos is an lvalue subroutine
pos $line = 0;
for my $regex ( #regex ){
$regex = quotemeta $regex;
if( scalar $line =~ / \G (.*?) ($regex) /xg ){
say $regex, ' found at location (', $-[2], '-', $+[2], ')';
if( $1 ){
say " but skipped: \"$1\" at location ($-[1]-$+[1])";
}
}else{
say 'Unable to find ', $regex;
# end loop
last;
}
}
}
I'm not a perl expert but you can use $- and $+ to track index location for last regex match found.
Below is code built on top of your code that explains this.
use strict;
use warnings;
my $file="sample.txt";
open(DAT, $file) || die("Could not open file!");
open (OUTPUTFILE, '>data.txt');
my $sequence;
my $someVar = 0;
my $sequenceNums = 1;
my $motif1 = "(HDWFLSFKD)";
my $motif2 = "(HD)";
my $motif3 = "(K)";
my $motif4 = "(DD)";
while (my $line = <DAT>)
{
$someVar = 0;
print "\nSequence $sequenceNums: $line\n";
print OUTPUTFILE "\nSequence $sequenceNums: $line\n";
if ($line=~ /$motif1/g)
{
&printStuff($sequenceNums, "motif1", $motif1, "$-[0]-$+[0]");
$someVar = 1;
}
if ($line=~ /$motif2/g and $someVar == 1)
{
&printStuff($sequenceNums, "motif2", $motif2, "$-[0]-$+[0]");
$someVar = 2;
}
if ($line=~ /$motif3/g and $someVar == 2)
{
&printStuff($sequenceNums, "motif3", $motif4, "$-[0]-$+[0]");
$someVar = 3;
}
if ($line=~ /$motif4/g and $someVar == 3)
{
&printStuff($sequenceNums, "motif4", $motif4, "$-[0]-$+[0]");
}
else
{
$sequence .= $line;
if ($someVar == 0)
{
&printWrongStuff($sequenceNums, "motif1", $motif1);
}
elsif ($someVar == 1)
{
&printWrongStuff($sequenceNums, "motif2", $motif2);
}
elsif ($someVar == 2)
{
&printWrongStuff($sequenceNums, "motif3", $motif3);
}
elsif ($someVar == 3)
{
&printWrongStuff($sequenceNums, "motif4", $motif4);
}
}
$sequenceNums++;
}
sub printStuff
{
print "Sequence: $_[0] $_[1]: $_[2] index location: $_[3] \n";
print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] index location: $_[3]\n";
}
sub printWrongStuff
{
print "Sequence: $_[0] $_[1]: $_[2] was not found\n";
print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] was not found\n";
}
close (OUTPUTFILE);
close (DAT);
Sample input:
MLTSHQKKFHDWFLSFKDSNNYNHDSKQNHSIKDDIFNRFNHYIYNDLGIRTIA
MLTSHQKKFSNNYNSKQNHSIKDIFNRFNHYIYNDLGIRTIA
MLTSHQKKFSNNYNSKHDWFLSFKDQNHSIKDIFNRFNHYIYNDL
You really should read
perldoc perlre
perldoc perlreref
perldoc perlretut
You need the special variables #- and #+ if you need the positions. No need to try to compute them yourself.
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw( each_array );
my $source = 'AAAA BBCCC DD E FFFFF';
my $pattern = join '\s*', map { "($_+)" } qw( A B C D E F );
if ( $source =~ /$pattern/ ) {
my $it = each_array #-, #+;
$it->(); # discard overall match information;
while ( my ($start, $end) = $it->() ) {
printf "Start: %d - Length: %d\n", $start, $end - $start;
}
}
Start: 0 - Length: 4
Start: 7 - Length: 2
Start: 9 - Length: 3
Start: 15 - Length: 2
Start: 19 - Length: 1
Start: 26 - Length: 5
The result of a construct like
$line=~ /(HD)/g
is a list. Use while to step through the hits.
To match where the last match left off, use \G. perldoc perlre says (but consult your own installation's version's manual first):
The "\G" assertion can be used to
chain global matches (using "m//g"),
as described in "Regexp Quote-Like
Operators" in perlop. It is also
useful when writing "lex"-like
scanners, when you have several
patterns that you want to match
against consequent substrings of your
string, see the previous reference.
The actual location where "\G" will
match can also be influenced by using
"pos()" as an lvalue: see "pos" in
perlfunc. Note that the rule for
zero-length matches is modified
somewhat, in that contents to the left
of "\G" is not counted when
determining the length of the match.
Thus the following will not match
forever:
$str = 'ABC';
pos($str) = 1;
while (/.\G/g) {
print $&;
}