match and print each selection from array - perl

I have a script which collects configuration data from a database and stores it in an array. I need to print find each interface and print it's config detail for a device. I will only post the parts that I need help with.
So firstly, here is an extract from the array:
track 2 interface GigabitEthernet1/6 line-protocol
!
!
!
!
!
interface Port-channel10.1
description Enclosure 1 - 3040-1b
switchport
switchport trunk encapsulation dot
switchport trunk allowed vlan 200, 202
switchport mode trunk
logging event link-status
logging event trunk-status
logging event bundle-status
logging event spanning-tree status
shutdown
interface Ethernet1/20
description tx to something
switchport mode trunk
switchport trunk allowed vlan 200-300
spanning-tree guard root
speed 1000
duplex full
interface Bundle-Ether2.1
description Bundle link to something
service-policy input qos-pol1
vrf V17:vodanet
ipv4 address 10.1.1.1 255.0.0.0
encapsulation 250
interface Ethernet1/1
description some interface desc
switchport mode trunk
switchport trunk allowed vlan 200-299
spanning-tree guard root
duplex full
no negotiate auto
clock timezone GMT 2 0
line console
terminal length 48
...
So each interface does not necessarily have the same amount of config detail following it. I basically need to take each of these interface configurations and print them individually. I will be storing these in smaller config-let flat files. So I need to print from interface ... till the last item before the next interface, but I am lost in searching. This is my latest attempt, but only works if I search for 2 different start/end strings, like so:
foreach (#array) {
if (/^interface/) {
$counter = 1;
} elsif (/duplex/) {
$counter =0;
} elsif ($counter) {
print;
}
}
The problem here is I am missing the first and last string, and not all interfaces end with duplex. I was thinking of using the beginning of line interface and then each double white space delimited items after but unsure how. Can someone please help me in finding a solution.
Edit
So to clarify array it is simply lines grabbed from a database and pushed to #array

The problem: An array (#lines) has "interface sections" which start with an /^interface/ line and run until the next interface line. The last section stops at the first unindented line. Parse interface sections from the array.
One way: Find indices of each interface line in the array #lines. Then elements of #lines between successive indices are the interface sections. The last one is found separately.
my #idx = grep { $lines[$_] =~ /^interface/ } 0..$#lines;
for my $i (0..$#idx-1) {
say "Interface:";
say "\t$_" for #lines[$idx[$i]..$idx[$i+1]-1];
}
# The last interface section stops at the first unindented line
say "Interface:";
for my $i ($idx[-1] .. $#lines) {
last if $lines[$i] =~ /^\S/ and $lines[$i] !~ /^interface/;
say "\t$lines[$i]";
}
The last interface segment goes until the first unindented line, as clarified in comments.
Prints "Interface:" are there only to visually distinguish sections; the lines for each interface can be added to an arrayref instead of printing, for example. Tested on the posted sample.
Another way: Iterate over #lines and for each /^interface/ line add a new arrayref to the array with all sections. Then the lines are added to that, last, arrayref. Exit on the first unindented line which isn't /^interface/
The program gets lines from the submitted file (or lines.txt), with the question's sample.
use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd);
use Path::Tiny;
my $file = shift #ARGV or 'lines.txt';
my #lines = path($file)->lines({chomp=>1});
my #if_sections; # store arrayrefs with lines for each interface
for (#lines) {
if (/^interface/) {
push #if_sections, [ $_ ]; # add arrayref for new interface
next;
}
elsif (/^\S/) { last }
push #{$if_sections[-1]}, $_; # add lines to the last arrayref
}
dd \#if_sections;
Uses the handy Path::Tiny to read the file. The module runs checks and croaks on failures.
The dd from Data::Dump is used to easily show the data structure, for convenience. This is an array with elements that are array references, so to work with it
foreach my $iface (#if_sections) {
say "Interface section:";
foreach my $line (#$iface) {
say $line;
}
}
what can be written more compactly in a number of ways.
See the tutorial perlreftut and
the complex data structure cookbook perldsc.

Related

Program argument is 100 but returns the value as 0100

Right now I am trying to do an assignment where I have to
- Extract information from an HTML file
- Save it to a scalar
- Run a regular expression to find the number of seats available in the designated course (the program argument is the course number for example 100 for ICS 100)
- If the course has multiple sessions, I have to find the sum of the seats available and print
- The output is just the number of seats available
The problem here is that when I was debugging and checking to make sure that my variable I have the program arg saved to was storing the correct value, it was storing the values with an extra 0 behind it.
ex.) perl filename.pl 100
ARGV[0] returns as 0100
I've tried storing the True regular expression values to an array, saving using multiple scalar variables, and changing my regular expression but none worked.
die "Usage: perl NameHere_seats.pl course_number" if (#ARGV < 1);
# This variable will store the .html file contents
my $fileContents;
# This variable will store the sum of seats available in the array #seatAvailable
my $sum = 0;
# This variable will store the program argument
my $courseNum = $ARGV[0];
# Open the file to read contents all at once
open (my $fh, "<", "fa19_ics_class_availability.html") or die ("Couldn't open 'fa19_ics_class_availability.html'\n");
# use naked brakets to limit the $/
{
#use local $/ to get <$fh> to read the whole file, and not one line
local $/;
$fileContents = <$fh>;
}
# Close the file handle
close $fh;
# Uncomment the line below to check if you've successfully extracted the text
# print $fileContents;
# Check if the course exists
die "No courses matched...\n" if ($ARGV[0] !~ m/\b(1[0-9]{2}[A-Z]?|2[0-8][0-9][A-Z]?|29[0-3])[A-Z]?\b/);
while ($fileContents =~ m/$courseNum(.+?)align="center">(\d)</) {
my $num = $2;
$sum = $sum + $num;
}
print $sum;
# Use this line as error checking to make sure #ARGV[0] is storing proper number
print $courseNum;
The current output I am receiving when program argument is 100 is just 0, and I assume it's because the regular expression is not catching any values as true therefore the sum remains at a value of 0. The output should be 15...
This is a link to the .html page > https://laulima.hawaii.edu/access/content/user/emeyer/ics/215/FA19/01/perl/fa19_ics_class_availability.html
You're getting "0100" because you have two print() statements.
print $sum;
...
print $courseNum;
And because there are no newlines or other output between them, you get the two values printed out next to each other. $sum is '0' and $courseNum is '100'.
So why is $sum zero? Well, that's because your regex isn't picking up the data you want it to match. Your regex looks like this:
m/$courseNum(.+?)align="center">(\d)</
You're looking for $courseNum followed by a number of other characters, followed by 'align="center">' and then your digit. This doesn't work for a number of reasons.
The string "100" appears many times in your text. Many times it doesn't even mean a course number (e.g. "100%"). Perhaps you should look for something more precise (ICS $coursenum).
The .+? doesn't do what you think it does. The dot doesn't match newline characters unless you use the /s option on the match operator.
But even if you fix those first two problems, it still won't work as there are a number of numeric table cells for each course and you're doing nothing to ensure that you're grabbing the last one. Your current code will get the "Curr. Enrolled" column, not the "Seats Avail" one.
This is a non-trivial HTML parsing problem. It shouldn't be addressed using regexes (HTML should never be parsed using regexes). You should look at one of the HTML parsing modules from CPAN - I think I'd use Web::Query.
Update: An example solution using Web::Query:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use File::Basename;
use Web::Query;
my $course_num = shift
or die 'Usage: perl ' . basename $0 . " course_number\n";
my $source = 'fa19_ics_class_availability.html';
open my $fh, '<', $source
or die "Cannot open '$source': $!\n";
my $html = do { local $/; <$fh> };
my $count_free;
wq($html)
# Get each table row in the table
->find('table.listOfClasses tr')
->each(sub {
my ($i, $elem) = #_;
my #tds;
# Get each <td> in the <tr>
$elem->find('td')->each(sub { push #tds, $_[1] });
# Ignore rows that don't have 13 columns
return if #tds != 13;
# Ignore rows that aren't about the right course
return if $tds[2]->text ne "ICS $course_num";
# Add the number of available places
$count_free += $tds[8]->text;
});
say $count_free;

delete previous and next lines in perl

I have the following file:
#TWEETY:150:000000000-ACFKE:1:2104:27858:17965
AAATTAGCAAAAAACAATAACAAAACTGGGAAAATGCAATTTAACAACGAAAATTTTCCGAGAACTTGAAAGCGTACGAAAACGATACGCTCC
+
D1FFFB11FDG00EE0FFFA1110FAA1F/ABA0FGHEGDFEEFGDBGGGGFEHBFDDG/FE/EGH1#GF#F0AEEEEFHGGFEFFCEC/>EE
#TWEETY:150:000000000-ACFKE:1:1105:22044:20029
AAAAAATATTAAAACTACGAATGCATAAATTATTTCGTTCGAAATAAACTCACACTCGTAACATTGAACTACGCGCTCC
+
CCFDDDFGGGGGGGGGGHGGHHHHGHHHHHHHHHHHHHHHGHHGHHHHHHHHHHHHHGHGHGGHHHHHHGHHEGGGGGG
#TWEETY:150:000000000-ACFKE:1:2113:14793:7182
TATATAAAGCGAGAGTAGAAACTTTTTAATTGACGCGGCGAGAAAGTATATAGCAACAAGCGAGCACCCGCTCC
+
BBFFFFFGGGGFFGGFGHHHHHHHHHHHHHHHHHGGAEEEAFGGGHHFEGHHGHHHHHGHHGGGGFHHGG?EEG
#TWEETY:150:000000000-ACFKE:1:2109:5013:22093
AAAAAAATAATTCATATCGCCATATCGACTGACAGATAATCTATCTATAATCATAACTTTTCCCTCGCTCC
+
DAFAADDGF1EAGG3EG3A00ECGDFFAEGFCHHCAGHBGEAGBFDEDGGHBGHGFGHHFHHHBDG?/FA/
#TWEETY:150:000000000-ACFKE:1:2106:25318:19875
+
CCCCCCCCCCCCGGGGGGGGGGGGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
The lines are in groups of four (each time there is a name, starting with #TWEETY, a string of letters, a + character, and another string of letters).
The second and fourth lines should have the same number of characters.
But there are cases where the second line is empty, as in the last four lines.
In these cases, I would like to get rid of the whole block (the previous line before the empty line and the next two lines).
I have just started perl and have been trying to write a script for my problem, but am having a hard time. Does anyone have some feedback?
Thanks!
Keep an array buffer of the last four lines. When it's full, check the second line, print the lines or not, empty the buffer, repeat.
#!/usr/bin/perl
use warnings;
use strict;
my #buffer;
sub output {
print #buffer unless 1 == length $buffer[1];
#buffer = ();
}
while (<>) {
if (4 == #buffer) {
output();
}
push #buffer, $_;
}
output(); # Don't forget to process the last four lines.
Yes. Start with looking at $/ and set it so you can work on a chunk at a time. I would suggest you can treat # as a record separator in your example.
Then iterate your records using a while loop. E.g. while ( <> ) {
Use split on \n to turn the current chunk into an array of lines.
Perform your test on the appropriate lines, and either print - or not - depending on whether it passed.
If you get stuck with that, then I'm sure a specific question including your code and where you're having problems will be well received here.
If you chunk the data correctly, this becomes almost trivial.
#!/usr/bin/perl
use strict;
use warnings;
# Use '#TWEETY' as the record separator to make it
# easy to chunk the data.
local $/ = '#TWEETY';
while (<DATA>) {
# The first entry will be empty (as the separator
# is the first thing in the file). Skip that record.
next unless /\S/;
# Skip any records with two consecutive newlines
# (as they will be the ones with the empty line 2)
next if /\n\n/;
# Print the remaining records
# (with $/ stuck back on the front)
print "$/$_";
}
__DATA__
#TWEETY:150:000000000-ACFKE:1:2104:27858:17965
AAATTAGCAAAAAACAATAACAAAACTGGGAAAATGCAATTTAACAACGAAAATTTTCCGAGAACTTGAAAGCGTACGAAAACGATACGCTCC
+
D1FFFB11FDG00EE0FFFA1110FAA1F/ABA0FGHEGDFEEFGDBGGGGFEHBFDDG/FE/EGH1#GF#F0AEEEEFHGGFEFFCEC/>EE
#TWEETY:150:000000000-ACFKE:1:1105:22044:20029
AAAAAATATTAAAACTACGAATGCATAAATTATTTCGTTCGAAATAAACTCACACTCGTAACATTGAACTACGCGCTCC
+
CCFDDDFGGGGGGGGGGHGGHHHHGHHHHHHHHHHHHHHHGHHGHHHHHHHHHHHHHGHGHGGHHHHHHGHHEGGGGGG
#TWEETY:150:000000000-ACFKE:1:2113:14793:7182
TATATAAAGCGAGAGTAGAAACTTTTTAATTGACGCGGCGAGAAAGTATATAGCAACAAGCGAGCACCCGCTCC
+
BBFFFFFGGGGFFGGFGHHHHHHHHHHHHHHHHHGGAEEEAFGGGHHFEGHHGHHHHHGHHGGGGFHHGG?EEG
#TWEETY:150:000000000-ACFKE:1:2109:5013:22093
AAAAAAATAATTCATATCGCCATATCGACTGACAGATAATCTATCTATAATCATAACTTTTCCCTCGCTCC
+
DAFAADDGF1EAGG3EG3A00ECGDFFAEGFCHHCAGHBGEAGBFDEDGGHBGHGFGHHFHHHBDG?/FA/
#TWEETY:150:000000000-ACFKE:1:2106:25318:19875
+
CCCCCCCCCCCCGGGGGGGGGGGGGGGGGGGGGGGGFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
Thanks everyone for the feedback!
It was all really useful. Thanks to your suggestions, I explored all the options and learned the unless statement.
The easiest solution given my existing code, was just to add an unless statement at the end.
### Write to output, but remove non-desired Gs
open OUT, ">$outfile";
my #accorder = #{$store0{"accorder"}};
foreach my $acc (#accorder){
# retrieve seq(2nd line) and qual(4th line)
my $seq = $store0{$acc}{"seq"};
my $qual = $store0{$acc}{"qual"};
# clean out polyG at end
$seq =~ s/G{3,}.{0,1}$//;
my $lenseq = length($seq);
my $lenqual = length($qual);
my $startqual = $lenqual - $lenseq;
$qual = substr($qual, 0, $lenseq);
#the above was in order to remove multiple G characters at the end of the
#second line, which is what led to empty lines (lines that were made up of
#only Gs got cut out)
# print to output, unless sequence has become empty
unless($lenseq == 0){ #this is the unless statement I added
print OUT "\#$acc\n$seq\n+\n$qual\n";
}
}
close(OUT);

Using Perl to find and fix errors in CSV files

I am dealing with very large amounts of data. Every now and then there is a slip up. I want to identify each row with an error, under a condition of my choice. With that I want the row number along with the line number of each erroneous row. I will be running this script on a handful of files and I will want to output the report to one.
So here is my example data:
File_source,ID,Name,Number,Date,Last_name
1.csv,1,Jim,9876,2014-08-14,Johnson
1.csv,2,Jim,9876,2014-08-14,smith
1.csv,3,Jim,9876,2014-08-14,williams
1.csv,4,Jim,9876,not_a_date,jones
1.csv,5,Jim,9876,2014-08-14,dean
1.csv,6,Jim,9876,2014-08-14,Ruzyck
Desired output:
Row#5,4.csv,4,Jim,9876,not_a_date,jones (this is an erroneous row)
The condition I have chosen is print to output if anything in the date field is not a date.
As you can see, my desired output contains the line number where the error occurred, along with the data itself.
After I have my output that shows the lines within each file that are in error, I want to grab that line from the untouched original CSV file to redo (both modified and original files contain the same amount of rows). After I have a file of these redone rows, I can omit and clean up where needed to prevent interruption of an import.
Folder structure will contain:
Modified: 4.txt
Original: 4.csv
I have something started here, written in Perl, which by the logic will at least return the rows I need. However I believe my syntax is a little off and I do not know how to plug in the other subroutines.
Code:
$count = 1;
while (<>) {
unless ($F[4] =~ /\d+[-]\d+[-]\d+/)
print "Row#" . $count++ . "," . "$_";
}
The code above is supposed to give me my erroneous rows, but to be able to extract them from the originals is beyond me. The above code also contains some syntax errors.
This will do as you ask.
Please be certain that none of the fields in the data can ever contain a comma , otherwise you will need to use Text::CSV to process it instead of just a simple split.
use strict;
use warnings;
use 5.010;
use autodie;
open my $fh, '<', 'example.csv';
<$fh>; # Skip header
while (<$fh>) {
my #fields = split /,/;
if( $fields[4] !~ /^\d{4}-\d{2}-\d{2}$/ ) {
print "Row#$.,$_";
}
}
output
Row#5,4.csv,4,Jim,9876,not_a_date,jones
Update
If you want to process a number of files then you need this instead.
The close ARGV at the end of the loop is there so that the line counter $. is reset to
1 at the start of each file. Without it it just continues from 1 upwards across all the files.
You would run this like
rob#Samurai-U:~$ perl findbad.pl *.csv
or you could list the files individually, separated by spaces.
For the test I have created files 1.csv and 2.csv which are identical to your example data except that the first field of each line is the name of the file containing the data.
You may not want the line in the output that announces each file name, in which case you should replace the entire first if block with just next if $. == 1.
use strict;
use warnings;
#ARGV = map { glob qq{"$_"} } #ARGV; # For Windows
while (<>) {
if ($. == 1) {
print "\n\nFile: $ARGV\n\n";
next;
}
my #fields = split /,/;
unless ( $fields[4] =~ /^\d{4}-\d{2}-\d{2}$/ ) {
printf "Row#%d,%s", $., $_;
}
close ARGV if eof ARGV;
}
output
File: 1.csv
Row#5,1.csv,4,Jim,9876,not_a_date,jones
File: 2.csv
Row#5,2.csv,4,Jim,9876,not_a_date,jones

Value not stored in file <Perl>

I am new to perl. My motive is to read some value from a txt file and use it in my perl script.
The text file is something like this ::
Servers ::
(local)
Tomas-Server1
Tomas-Server2
.........**
What i need to do is to get the 3rd line value (Tomas-Server2) and use it in my perl script. basically its calling the 3rd value to the perl script.
I have written a basic code for it ::
my($ServName1,$ServName,$ServName3) = getservername($servername);
my ($filenam) = 'data.txt';
my #Param = ();
open(INFILE,"<$filenam") or die "Couldn't open $filenam for reading\n";
while(<INFILE>) {
chop($_);
push(#Param,$_);
}
close(INFILE);
return #Param;
}
But when i try to use the "$ServName2" value , it does not return anything. I guess the value contained should be "(local)" for it.
There a few problems with you example data, but I think I understand what you are trying to do. Assuming that you want the third server value (and not just line #3) you would probably want to have code that looks something like this...
#!perl
#some includes
use strict;
use warnings;
use Data::Dumper;
#set some vars
my $TargetFileName="data_file.txt";
my $TargetServerNumber=3;
#call the sub
my #ServerArray=GetServerName($TargetFileName);
#did we get anything back?
if(#ServerArray) {
#yep, we got some content
print "Server number ".$TargetServerNumber."/".($#ServerArray+1)." is \"".$ServerArray[$TargetServerNumber-1]."\"\n"; #array numbers are 0-based
} else {
#nope, we read the file but didn't any matching content
print "No servers found in file \"".$TargetFileName."\".\n";
} #end if
print "\n";
print "Here's what was loaded into \#ServerArray...\n";
print Data::Dumper::Dumper(#ServerArray); #so you can see the full content of #ServerArray
print "All Done\n";
exit;
#----- subs go here -----
sub GetServerName {
my $DataFileName=shift; #pull in the first arg - this alters #_ for the rest of the sub, so take care when doing this
my #ReturnArray;
#do some QA
if(!$DataFileName) {die "You need to provide a file name to use this sub.\n"} #do we have a file name?
if(!stat($DataFileName)) {die "The requested file name of \"".$DataFileName."\" does not exist.\n";} #does the file exist?
open(INFILE, "<".$DataFileName) or die "Unable to open \"".$DataFileName."\" - ".$!;
#ok, read the file content
while(my $Line=<INFILE>) {
chop($Line);
$Line=~s/^\s+//g; #remove leading white spaces
$Line=~s/\s+$//g; #remove trailing white spaces
if(!$Line) {next;} #blank line, skip it
#check for the headers
if($Line=~m/^Servers/) {next;} #skip lines beginning with "Servers"
if($Line=~m/^\(local\)/) {next;} #skip lines beginning with "(local)"
#if we get here, we must want the line content, so add it to the array
push(#ReturnArray, $Line);
} #end while
close(INFILE);
#send the array data back, if any
return #ReturnArray
} #end GetServerName sub
__END__
stuff below here does not need to be commented
I admit this is not the best way to approach the problem, but like most Perl hacks, it works. You'll notice the code is a bit overkill and does validation checks before passing data to some operations - get into the habit of doing this in any language you work in. This will help you return more meaningful errors when things go wrong and will also catch data related problems. The sample code also does a bit of data cleanup because leading and trailing will likely cause you problems later on. Blank lines (after cleaning) are also removed.
BTW, the data file I used as an example look like this...
Servers ::
(local)
Tomas-Server1
Tomas-Server2
Tomas-Server3
Tomas-Server4
Tomas-Server5
Tomas-ServerLast
You never define a $ServName2 variable. The variable between $ServName1 and $ServName3 is just named $ServName, with no number.

Why does my Perl for loop exit early?

I am trying to get a perl loop to work that is working from an array that contains 6 elements. I want the loop to pull out two elements from the array, perform certain functions, and then loop back and pull out the next two elements from the array until the array runs out of elements. Problem is that the loop only pulls out the first two elements and then stops. Some help here would be greatly apperaciated.
my open(infile, 'dnadata.txt');
my #data = < infile>;
chomp #data;
#print #data; #Debug
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
my $i=0;
my $j=0;
my #matrix =();
for(my $i=0; $i<2; $i++){
for( my $j=0; $j<$aalen; $j++){
$matrix[$i][$j] = 0;
}
}
The guidelines for this program states that the program should ignore the presence of gaps in the program. which means that DNA code that is matched up with a gap should be ignored. So the code that is pushed through needs to have alignments linked with gaps removed.
I need to modify the length of the array by two since I am comparing two sequence in this part of the loop.
#$lemseqcomp = $lenarray / 2;
#print $lenseqcomp;
#I need to initialize these saclar values.
$junk1 = " ";
$junk2 = " ";
$seq1 = " ";
$seq2 = " ";
This is the loop that is causeing issues. I belive that the first loop should move back to the array and pull out the next element each time it loops but it doesn't.
for($i=0; $i<$lenarray; $i++){
#This code should remove the the last value of the array once and
#then a second time. The sequences should be the same length at this point.
my $last1 =pop(#data1);
my $last2 =pop(#data1);
for($i=0; $i<length($last1); $i++){
my $letter1 = substr($last1, $i, 1);
my $letter2 = substr($last2, $i, 1);
if(($letter1 eq '-')|| ($letter2 eq '-')){
#I need to put the sequences I am getting rid of somewhere. Here is a good place as any.
$junk1 = $letter1 . $junk1;
$junk2 = $letter1 . $junk2;
}
else{
$seq1 = $letter1 . $seq1;
$seq2 = $letter2 . $seq2;
}
}
}
print "$seq1\n";
print "$seq2\n";
print "#data1\n";
I am actually trying to create a substitution matrix from scratch and return the data. The reason why the code looks weird, is because it isn't actually finished yet and I got stuck.
This is the test sequence if anyone is curious.
YFRFR
YF-FR
FRFRFR
ARFRFR
YFYFR-F
YFRFRYF
First off, if you're going to work with sequence data, use BioPerl. Life will be so much easier. However...
Since you know you'll be comparing the lines from your input file as pairs, it makes sense to read them into a datastructure that reflects that. As elsewhere suggested, an array like #data[[line1, line2],[line3,line4]) ensures that the correct pairs of lines are always together.
What I'm not clear on what you're trying to do is:
a) are you generating a consensus
sequence where the 2 sequences are
difference only by gaps
b) are your 2 sequences significantly
different and you're trying to
exclude the non-aligning parts and
then generate a consensus?
So, does the first pair represent your data, or is it more like the second?
ATCG---AAActctgGGGGG--taGC
ATCGcccAAActctgGGGGGTTtaGC
ATCG---AAActctgGGGGG--taGCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
ATCGcccAAActctgGGGGGTTtaGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
The problem is that you're using $i as the counter variable for both your loops, so the inner loop modifies the counter out from under the outer loop. Try changing the inner loop's counter to $j, or using my to localize them properly.
Don't store your values as an array, store as a two-dimensional array:
my #dataset = ([$val1, $val2], [$val3, $val4]);
or
my #dataset;
push (#dataset, [$val_n1, $val_n2]);
Then:
for my $value (#dataset) {
### Do stuff with $value->[0] and $value->[1]
}
There are lots of strange things in your code: you are initializing a matrix then not using it; reading a whole file into an array; scanning a string C style but then not doing anything with the unmatched values; and finally, just printing the two last processed values (which, in your case, are the two first elements of your array, since you are using pop.)
Here's a guess.
use strict;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
# Preparing a regular expression. This is kind of useful if processing large
# amounts of data. This will match anything that is not in the string above.
my $regex = qr([^$aminoacids]);
# Our work function.
sub do_something {
my ($a, $b) = #_;
$a =~ s/$regex//g; # removing unwanted characters
$b =~ s/$regex//g; # ditto
# Printing, saving, whatever...
print "Something: $a - $b\n";
return ($a, $b);
}
my $prev;
while (<>) {
chomp;
if ($prev) {
do_something($prev, $_);
$prev = undef;
} else {
$prev = $_;
}
}
print STDERR "Warning: trailing data: $prev\n"
if $prev;
Since you are a total Perl/programming newbie, I am going to show a rewrite of your first code block, then I'll offer you some general advice and links.
Let's look at your first block of sample code. There is a lot of stuff all strung together, and it's hard to follow. I, personally, am too dumb to remember more than a few things at a time, so I chop problems into small pieces that I can understand. This is (was) known as 'chunking'.
One easy way to chunk your program is use write subroutines. Take any particular action or idea that is likely to be repeated or would make the current section of code long and hard to understand, and wrap it up into a nice neat package and get it out of the way.
It also helps if you add space to your code to make it easier to read. Your mind is already struggling to grok the code soup, why make things harder than necessary? Grouping like things, using _ in names, blank lines and indentation all help. There are also conventions that can help, like making constant values (values that cannot or should not change) all capital letters.
use strict; # Using strict will help catch errors.
use warnings; # ditto for warnings.
use diagnostics; # diagnostics will help you understand the error messages
# Put constants at the top of your program.
# It makes them easy to find, and change as needed.
my $AMINO_ACIDS = 'ARNDCQEGHILKMFPSTWYV';
my $AMINO_COUNT = length($AMINO_ACIDS);
my $DATA_FILE = 'dnadata.txt';
# Here I am using subroutines to encapsulate complexity:
my #data = read_data_file( $DATA_FILE );
my #matrix = initialize_matrix( 2, $amino_count, 0 );
# now we are done with the first block of code and can do more stuff
...
# This section down here looks kind of big, but it is mostly comments.
# Remove the didactic comments and suddenly the code is much more compact.
# Here are the actual subs that I abstracted out above.
# It helps to document your subs:
# - what they do
# - what arguments they take
# - what they return
# Read a data file and returns an array of dna strings read from the file.
#
# Arguments
# data_file => path to the data file to read
sub read_data_file {
my $data_file = shift;
# Here I am using a 3 argument open, and a lexical filehandle.
open( my $infile, '<', $data_file )
or die "Unable to open dnadata.txt - $!\n";
# I've left slurping the whole file intact, even though it can be very inefficient.
# Other times it is just what the doctor ordered.
my #data = <$infile>;
chomp #data;
# I return the data array rather than a reference
# to keep things simple since you are just learning.
#
# In my code, I'd pass a reference.
return #data;
}
# Initialize a matrix (or 2-d array) with a specified value.
#
# Arguments
# $i => width of matrix
# $j => height of matrix
# $value => initial value
sub initialize_matrix {
my $i = shift;
my $j = shift;
my $value = shift;
# I use two powerful perlisms here: map and the range operator.
#
# map is a list contsruction function that is very very powerful.
# it calls the code in brackets for each member of the the list it operates against.
# Think of it as a for loop that keeps the result of each iteration,
# and then builds an array out of the results.
#
# The range operator `..` creates a list of intervening values. For example:
# (1..5) is the same as (1, 2, 3, 4, 5)
my #matrix = map {
[ ($value) x $i ]
} 1..$j;
# So here we make a list of numbers from 1 to $j.
# For each member of the list we
# create an anonymous array containing a list of $i copies of $value.
# Then we add the anonymous array to the matrix.
return #matrix;
}
Now that the code rewrite is done, here are some links:
Here's a response I wrote titled "How to write a program". It offers some basic guidelines on how to approach writing software projects from specification. It is aimed at beginners. I hope you find it helpful. If nothing else, the links in it should be handy.
For a beginning programmer, beginning with Perl, there is no better book than Learning Perl.
I also recommend heading over to Perlmonks for Perl help and mentoring. It is an active Perl specific community site with very smart, friendly people who are happy to help you. Kind of like Stack Overflow, but more focused.
Good luck!
Instead of using a C-style for loop, you can read data from an array two elements at a time using splice inside a while loop:
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# stuff...
}
I've cleaned up some of your other code below:
use strict;
use warnings;
open(my $infile, '<', 'dnadata.txt');
my #data = <$infile>;
close $infile;
chomp #data;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
# initialize a 2 x 21 array for holding the amino acid data
my $matrix;
foreach my $i (0 .. 1)
{
foreach my $j (0 .. $aalen-1)
{
$matrix->[$i][$j] = 0;
}
}
# Process all letters in the DNA data
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# do something... not sure what?
# you appear to want to look up the letters in a reference table, perhaps $aminoacids?
}