Perl - Data comparison taking huge time

Perl - Data comparison taking huge time - perl

open(INFILE1,"INPUT.txt");
my $modfile = 'Data.txt';
open MODIFIED,'>',$modfile or die "Could not open $modfile : $!";
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
my ($tablename1, $colname1,$sql1) = split(/\t/, $line1);
my ($tablename2, $colname2,$sql2) = split(/\t/, $line2);
if ($tablename1 eq $tablename2)
{
my $sth1 = $dbh->prepare($sql1);
$sth1->execute;
my $hash_ref1 = $sth1->fetchall_hashref('KEY');
my $sth2 = $dbh->prepare($sql2);
$sth2->execute;
my $hash_ref2 = $sth2->fetchall_hashref('KEY');
my #fieldname = split(/,/, $colname1);
my $colcnt=0;
my $rowcnt=0;
foreach $key1 ( keys(%{$hash_ref1}) )
{
foreach (#fieldname)
{
$colname =$_;
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
my $strvalue2='';
#val2 = $hash_ref2->{$key1}->{$colname};
if (defined #val2)
{
my #filtered = grep /#val2/, #metadata2;
my $strvalue2 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
if ($strvalue1 ne $strvalue2 )
{
$colcnt = $colcnt + 1;
print MODIFIED "$tablename1\t$colname\t$strvalue1\t$strvalue2\n";
}
}
}
if ($colcnt>0)
{
print "modified count is $colcnt\n";
}
%$hash_ref1 = ();
%$hash_ref2 = ();
}
The program is Read input file in which every line contrain three strings seperated by tab. First is TableName, Second is ALL Column Name with commas in between and third contain the sql to be run. As this utlity is doing comparison of data, so there are two rows for every tablename. One for each DB. So data needs to be picked from each respective db's and then compared column by column.
SQL returns as ID in the result set and if the value is coming from db then it needs be translated to a string by reading from a array (that array contains 100K records with Key and value seperated by ||)
Now I ran this for one set of tables which contains 18K records in each db. There are 8 columns picked from db in each sql. So for every record out of 18K, and then for every field in that record i.e. 8, this script is taking a lot of time.
My question is if someone can look and see if it can be imporoved so that it takes less time.
File contents sample
INPUT.TXT
TABLENAME COL1,COL2 select COL1,COL2 from TABLENAME where ......
TABLENAMEB COL1,COL2 select COL1,COL2 from TABLENAMEB where ......
Metadata array contains something like this(there are two i.e. for each db)
111||Code 1
222||Code 2
Please suggest

Your code does look a bit unusual, and could gain clarity from using subroutines vs. just using loops and conditionals. Here are a few other suggestions.
The excerpt
for (;;) {
my $line1 = <INFILE1>;
last if not defined $line1;
my $line2 = <INFILE1>;
last if not defined $line2;
...;
}
is overly complicated: Not everyone knows the C-ish for(;;) idiom. You have lots of code duplication. And aren't you actually saying loop while I can read two lines?
while (defined(my $line1 = <INFILE1>) and defined(my $line2 = <INFILE1>)) {
...;
}
Yes, that line is longer, but I think it's a bit more self-documenting.
Instead of doing
if ($tablename1 eq $tablename2) { the rest of the loop }
you could say
next if $tablename1 eq $tablename2;
the rest of the loop;
and save a level of intendation. And better intendation equals better readability makes it easier to write good code. And better code might perform better.
What are you doing at foreach $key1 (keys ...) — something tells me you didn't use strict! (Just a hint: lexical variables with my can perform slightly better than global variables)
Also, doing $colname = $_ inside a for-loop is a dumb thing, for the same reason.
for my $key1 (keys ...) {
...;
for my $colname (#fieldname) { ... }
}
my $strvalue1='';
#val1 = $hash_ref1->{$key1}->{$colname};
if (defined #val1)
{
my #filtered = grep /#val1/, #metadata;
my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
}
I don't think this does what you think it does.
From the $hash_ref1 you retrive a single element, then assign that element to an array (a collection of multiple values).
Then you called defined on this array. An array cannot be undefined, and what you are doing is quite deprecated. Calling defined function on a collection returns info about the memory management, but does not indicate ① whether the array is empty or ② whether the first element in that array is defined.
Interpolating an array into a regex isn't likely to be useful: The elements of the array are joined with the value of $", usually a whitespace, and the resulting string treated as a regex. This will wreak havoc if there are metacharacters present.
When you only need the first value of a list, you can force list context, but assign to a single scalar like
my ($filtered) = produce_a_list;
This frees you from weird subscripts you don't need and that only slow you down.
Then you assign to a $strvalue1 variable you just declared. This shadows the outer $strvalue1. They are not the same variable. So after the if branch, you still have the empty string in $strvalue1.
I would write this code like
my $val1 = $hash_ref1->{$key1}{$colname};
my $strvalue1 = defined $val1
? do {
my ($filtered) = grep /\Q$val1/, #metadata;
substr $filtered, 2 + index $filtered, '||'
} : '';
But this would be even cheaper if you pre-split #metadata into pairs and test for equality with the correct field. This would remove some of the bugs that are still lurking in that code.
$x = $x + 1 is commonly written $x++.
Emptying the hashrefs at the end of the iteration is unneccessary: The hashrefs are assigned to a new value at the next iteration of the loop. Also, it is unneccessary to assist Perls garbage collection for such simple tasks.
About the metadata: 100K records is a lot, so either put it in a database itself, or at the very least a hash. Especially for so many records, using a hash is a lot faster than looping through all entries and using slow regexes … aargh!
Create the hash from the file, once at the beginning of the program
my %metadata;
while (<METADATA>) {
chomp;
my ($key, $value) = split /\|\|/;
$metadata{$key} = $value; # assumes each key only has one value
}
Simply look up the key inside the loop
my $strvalue1 = defined $val1 ? $metadata{$val1} // '' : ''
That should be so much faster.
(Oh, and please consider using better names for variables. $strvalue1 doesn't tell me anything, except that it is a stringy value (d'oh). $val1 is even worse.)

This is not really an answer but it won't really fit well in a comment either so, until you provide some more information, here are some observations.
Inside you inner for loop, there is:
#val1 = $hash_ref1->{$key1}->{$colname};
Did you mean #val1 = #{ $hash_ref1->{$key1}->{$colname} };?
Later, you check if (defined #val1)? What did you really want to check? As perldoc -f defined points out:
Use of "defined" on aggregates (hashes and arrays) is
deprecated. It used to report whether memory for that aggregate
had ever been allocated. This behavior may disappear in future
versions of Perl. You should instead use a simple test for size:
In your case, if (defined #val1) will always be true.
Then, you have my #filtered = grep /#val1/, #metadata; Where did #metadata come from? What did you actually intend to check?
Then you have my $strvalue1 = substr(#filtered[0],index(#filtered[0],'||') + 2);
There is some interesting stuff going on in there.
You will need to verbalize what you are actually trying to do.
I strongly suspect there is a single SQL query you can run that will give you what you want but we first need to know what you want.

Related

How to use hash in perl to reduce the use of loops?

For example, first, I have identified several key words. Then, in another file, I want to find lines that matches these key words exactly.
Using loop, I would loop through the file for each key word. So if there are 10 key words, I would loop the file 10 times. But if the file is huge, this method would be very time inefficient.
I heard proper use of hash could ease such a situation but never saw an example. So exactly how to use hash to achieve the same goal while using loop only once? Could you provide an example?

You have a solution whose time taken is proportional to the size of the file and to the number of keywords (O(N*K)). You would like a solution where the time taken is not affected by the number of keywords (O(N)).
You could indeed achieve this using a hash.
my #keywords = ...;
my %keywords = map { $_ => 1 } #keywords;
while (<$fh>) {
my #words = split ' '; # To be improved.
if (grep { $keywords{$_} } #words) { # Replace with foreach loop to exit ASAP.
...
}
}
I'd use a regex match.
my #keywords = ...;
my $pat = join '|', map quotemeta, #keywords;
my $re = qr/$pat/;
while (<$fh>) {
if (/(?:^|\s)$re(?:\s|\z)/) { # To be improved.
...
}
}

Retain quotes on CSV fields that were quoted in the input

I have a CSV file such that a few of the fields are quoted regardless of whether they need to be. What I wish to do is load this file, modify a few of the values, and produce the modified CSV with the quoted fields intact.
I'm currently using Perl's Text::CSV package to attempt to solve this problem, but have ran into a bit of a roadblock. The following is a small test script to demonstrate the problem:
use Text::CSV;
my $csv = Text::CSV->new ({'binary' => 1, 'allow_loose_quotes' => 1, 'keep_meta_info' => 1});
my $line = q^hello,"world"^;
print qq^input: $line\n^;
$csv->parse($line);
my #flds = $csv->fields();
$csv->combine(#flds);
print 'output: ', $csv->string(), "\n";
produces:
input: hello,"world"
output: hello,world
According to Text::CSV's documentation, an is_quoted() function exists to test if a field had been quoted in the input, but if I use this to add surrounding quotes to a field, I get unexpected results:
my $csv = Text::CSV->new ({'binary' => 1, 'allow_loose_quotes' => 1, 'keep_meta_info' => 1});
my $line = q^hello,"world"^;
print qq^input: $line\n^;
$csv->parse($line);
my #flds = $csv->fields();
for my $idx (0..$#flds) {
if ($csv->is_quoted($idx)) {
$flds[$idx] = qq^"$flds[$idx]"^;
}
}
$csv->combine(#flds);
print 'output: ', $csv->string(), "\n";
Producing:
input: hello,"world"
output: hello,"""world"""
where I believe the quotes I've added before the combine() are being seen as part of the field, and so are being escaped with a second double quote as combine() is processing.
What would be the best way to ensure quoted fields are left intact from input to output? I'm not certain the application will accept always_quote'ed fields... Is there some combination of Text::CSV object attributes that will allow for keeping quotes intact? Or perhaps am I left with adjusting the record post-combine?

It's a shame but it appears that while keep_meta_info gives you access to the metadata there's no option to tell Text::CSV to reapply the is_quoted state on output.
Depending on how complex your record is you could just reassemble it yourself. But then you'd have to cope with changes to string fields that were previously safely unquoted but after your processing now require quotes. That will depend on the types of changes you introduce, i.e. whether or not you ever expect that a previously "safe" string value will become unsafe. If the answer is "never" (i.e. 0.00000% chance), then you should just do the reassembly yourself and document what you've done.
Post-processing would require that you CSV-parse the string to handle the possibility of commas and other unsafe characters inside strings, so that may not be an option.
Or, you could dive into the code for Text::CSV and implement the desired functionality. I.e. allow the user to force quoting of a specific field on output. I played around with it, and it looks like part of the required mechanism might be in place but unfortunately all I have access to is the XS version, which delegates to native code, so I can't delve deeper at this time. This is as far as I got:
Original combine method. Note the setting of _FFLAGS to undef.
sub combine
{
my $self = shift;
my $str = "";
$self->{_FIELDS} = \#_;
$self->{_FFLAGS} = undef;
$self->{_STATUS} = (#_ > 0) && $self->Combine (\$str, \#_, 0);
$self->{_STRING} = \$str;
$self->{_STATUS};
} # combine
My attempt. I guessed that the second argument to Combine might be the flags, but since the (lowercase) combine API is based on receiving an array and not an arrayref, there's no way to pass two arrays in. I changed it to expect two arrayrefs and tried passing the second to Combine but that failed with "Can't call method "print" on unblessed reference".
sub combine2
{
my $self = shift;
my $str = "";
my $f = shift;
my $g = shift;
$self->{_FIELDS} = $f;
$self->{_FFLAGS} = $g;
$self->{_STATUS} = (#$f > 0) && $self->Combine (\$str, $f, $g);
$self->{_STRING} = \$str;
$self->{_STATUS};
} # combine

Parsing CSV files, finding columns and remembering them

I am trying to figure out a way to do this, I know it should be possible. A little background first.
I want to automate the process of creating the NCBI Sequin block for submitting DNA sequences to GenBank. I always end up creating a table that lists the species name, the specimen ID value, the type of sequences, and finally the location of the the collection. It is easy enough for me to export this into a tab-delimited file. Right now I do something like this:
while ($csv) {
foreach ($_) {
if ($_ =! m/table|species|accession/i) {
#csv = split('\t', $csv);
print NEWFILE ">[species=$csv[0]] [molecule=DNA] [moltype=genomic] [country=$csv[2]] [spec-id=$csv[1]]\n";
}
else {
next;
}
}
}
I know that is messy, and I just typed up something similar to what I have by memory (don't have script on any of my computers at home, only at work).
Now that works for me fine right now because I know which columns the information I need (species, location, and ID number) are in.
But is there a way (there must be) for me to find the columns that are for the needed info dynamically? That is, no matter the order of the columns the correct info from the correct column goes to the right place?
The first row will usually as Table X (where X is the number of the table in the publication), the next row will usually have the column headings of interest and are nearly universal in title. Nearly all tables will have standard headings to search for and I can just use | in my pattern matching.

First off, I would be remiss if I didn’t recommend the excellent Text::CSV_XS module; it does a much more reliable job of reading CSV files, and can even handle the column-mapping scheme that Barmar referred to above.
That said, Barmar has the right approach, though it ignores the "Table X" row being a separate row entirely. I recommend taking an explicit approach, perhaps something like this (and this is going to have a bit more detail just to make things clear; I would probably write it more tightly in production code):
# Assumes the file has been opened and that the filehandle is stored in $csv_fh.
# Get header information first.
my $hdr_data = {};
while( <$csv_fh> ) {
if( ! $hdr_data->{'table'} && /Table (\d+)/ ) {
$hdr_data->{'table'} = $1;
next;
}
if( ! $hdr_data->{'species'} && /species/ ) {
my $n = 0;
# Takes the column headers as they come, creating
# a map between the column name and column number.
# Assumes that column names are case-insensitively
# unique.
my %columns = map { lc($_) => $n++ } split( /\t/ );
# Now pick out exactly the columns we want.
foreach my $thingy ( qw{ species accession country } ) {
$hdr_data->{$thingy} = $columns{$thingy};
}
last;
}
}
# Now process the rest of the lines.
while( <$csv_fh> ) {
my $col = split( /\t/ );
printf NEWFILE ">[species=%s] [molecule=DNA] [moltype=genomic] [country=%s] [spec-id=%s]\n",
$col[$hdr_data->{'species'}],
$col[$hdr_data->{'country'}],
$col[$hdr_data->{'accession'}];
}
Some variation of that will get you close to what you need.

Create a hash that maps column headings to column numbers:
my %columns;
...
if (/table|species|accession/i) {
my #headings = split('\t');
my $col = 0;
foreach my $col (#headings) {
$columns{"\L$col"} = $col++;
}
}
Then you can use $csv[$columns{'species'}].

Simplest way to match array of strings to search in perl?

What I want to do is check an array of strings against my search string and get the corresponding key so I can store it. Is there a magical way of doing this with Perl, or am I doomed to using a loop? If so, what is the most efficient way to do this?
I'm relatively new to Perl (I've only written 2 other scripts), so I don't know a lot of the magic yet, just that Perl is magic =D
Reference Array: (1 = 'Canon', 2 = 'HP', 3 = 'Sony')
Search String: Sony's Cyber-shot DSC-S600
End Result: 3

UPDATE:
Based on the results of discussion in this question, depending on your intent/criteria of what constitutes "not using a loop", the map based solution below (see "Option #1) may be the most concise solution, provided that you don't consider map a loop (the short version of the answers is: it's a loop as far as implementation/performance, it's not a loop from language theoretical point of view).
Assuming you don't care whether you get "3" or "Sony" as the answer, you can do it without a loop in a simple case, by building a regular expression with "or" logic (|) from the array, like this:
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
my $combined_search = join("|",#strings);
my #which_found = ($search_in =~ /($combined_search)/);
print "$which_found[0]\n";
Result from my test run: Sony
The regular expression will (once the variable $combined_search is interpolated by Perl) take the form /(Canon|HP|Sony)/ which is what you want.
This will NOT work as-is if any of the strings contain regex special characters (such as | or ) ) - in that case you need to escape them
NOTE: I personally consider this somewhat cheating, because in order to implement join(), Perl itself must do a loop somewhere inside the interpeter. So this answer may not satisfy your desire to remain loop-less, depending on whether you wanted to avoid a loop for performance considerations, of to have cleaner or shorter code.
P.S. To get "3" instead of "Sony", you will have to use a loop - either in an obvious way, by doing 1 match in a loop underneath it all; or by using a library that saves you from writing the loop yourself but will have a loop underneath the call.
I will provide 3 alternative solutions.
#1 option: - my favorite. Uses "map", which I personally still consider a loop:
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
my $combined_search = join("|",#strings);
my #which_found = ($search_in =~ /($combined_search)/);
print "$which_found[0]\n";
die "Not found" unless #which_found;
my $strings_index = 0;
my %strings_indexes = map {$_ => $strings_index++} #strings;
my $index = 1 + $strings_indexes{ $which_found[0] };
# Need to add 1 since arrays in Perl are zero-index-started and you want "3"
#2 option: Uses a loop hidden behind a nice CPAN library method:
use List::MoreUtils qw(firstidx);
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
my $combined_search = join("|",#strings);
my #which_found = ($search_in =~ /($combined_search)/);
die "Not Found!"; unless #which_found;
print "$which_found[0]\n";
my $index_of_found = 1 + firstidx { $_ eq $which_found[0] } #strings;
# Need to add 1 since arrays in Perl are zero-index-started and you want "3"
#3 option: Here's the obvious loop way:
my $found_index = -1;
my #strings = ("Canon", "HP", "Sony");
my $search_in = "Sony's Cyber-shot DSC-S600";
foreach my $index (0..$#strings) {
next if $search_in !~ /$strings[$index]/;
$found_index = $index;
last; # quit the loop early, which is why I didn't use "map" here
}
# Check $found_index against -1; and if you want "3" instead of "2" add 1.

Here is a solution that builds a regular expression with embedded code to increment the index as perl moves through the regex:
my #brands = qw( Canon HP Sony );
my $string = "Sony's Cyber-shot DSC-S600";
use re 'eval'; # needed to use the (?{ code }) construct
my $index = -1;
my $regex = join '|' => map "(?{ \$index++ })\Q$_" => #brands;
print "index: $index\n" if $string =~ $regex;
# prints 2 (since Perl's array indexing starts with 0)
The string that is prepended to each brand first increments the index, and then tries to match the brand (escaped with quotemeta (as \Q) to allow for regex special characters in the brand names).
When the match fails, the regex engine moves past the alternation | and then the pattern repeats.
If you have multiple strings to match against, be sure to reset $index before each. Or you can prepend (?{$index = -1}) to the regex string.

An easy way is just to use a hash and regex:
my $search = "your search string";
my %translation = (
'canon' => 1,
'hp' => 2,
'sony' => 3
);
for my $key ( keys %translation ) {
if ( $search =~ /$key/i ) {
return $translation{$key};
)
}
Naturally the return can just as easily be a print. You can also surround the entire thing in a while loop with:
while(my $search = <>) {
#your $search is declared = to <> and now gets its values from STDIN or strings piped to this script
}
Please also take a look at perl's regex features at perlre
and take a look at perl's data structures at perlref
EDIT
as was just pointed out to me you were trying to steer away from using a loop. Another method would be to use perl's map function. Take a look here.

You can also take a look at Regexp::Assemble, which will take a collection of sub-regexes and build a single super-regex from them that can then be used to test for all of them at once (and gives you the text which matched the regex, of course). I'm not sure that it's the best solution if you're only looking at three strings/regexes that you want to match, but it's definitely the way to go if you have a substantially larger target set - the project I initially used it on has a library of some 1500 terms that it's matching against and it performs very well.

Why does my Perl for loop exit early?

I am trying to get a perl loop to work that is working from an array that contains 6 elements. I want the loop to pull out two elements from the array, perform certain functions, and then loop back and pull out the next two elements from the array until the array runs out of elements. Problem is that the loop only pulls out the first two elements and then stops. Some help here would be greatly apperaciated.
my open(infile, 'dnadata.txt');
my #data = < infile>;
chomp #data;
#print #data; #Debug
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
my $i=0;
my $j=0;
my #matrix =();
for(my $i=0; $i<2; $i++){
for( my $j=0; $j<$aalen; $j++){
$matrix[$i][$j] = 0;
}
}
The guidelines for this program states that the program should ignore the presence of gaps in the program. which means that DNA code that is matched up with a gap should be ignored. So the code that is pushed through needs to have alignments linked with gaps removed.
I need to modify the length of the array by two since I am comparing two sequence in this part of the loop.
#$lemseqcomp = $lenarray / 2;
#print $lenseqcomp;
#I need to initialize these saclar values.
$junk1 = " ";
$junk2 = " ";
$seq1 = " ";
$seq2 = " ";
This is the loop that is causeing issues. I belive that the first loop should move back to the array and pull out the next element each time it loops but it doesn't.
for($i=0; $i<$lenarray; $i++){
#This code should remove the the last value of the array once and
#then a second time. The sequences should be the same length at this point.
my $last1 =pop(#data1);
my $last2 =pop(#data1);
for($i=0; $i<length($last1); $i++){
my $letter1 = substr($last1, $i, 1);
my $letter2 = substr($last2, $i, 1);
if(($letter1 eq '-')|| ($letter2 eq '-')){
#I need to put the sequences I am getting rid of somewhere. Here is a good place as any.
$junk1 = $letter1 . $junk1;
$junk2 = $letter1 . $junk2;
}
else{
$seq1 = $letter1 . $seq1;
$seq2 = $letter2 . $seq2;
}
}
}
print "$seq1\n";
print "$seq2\n";
print "#data1\n";
I am actually trying to create a substitution matrix from scratch and return the data. The reason why the code looks weird, is because it isn't actually finished yet and I got stuck.
This is the test sequence if anyone is curious.
YFRFR
YF-FR
FRFRFR
ARFRFR
YFYFR-F
YFRFRYF

First off, if you're going to work with sequence data, use BioPerl. Life will be so much easier. However...
Since you know you'll be comparing the lines from your input file as pairs, it makes sense to read them into a datastructure that reflects that. As elsewhere suggested, an array like #data[[line1, line2],[line3,line4]) ensures that the correct pairs of lines are always together.
What I'm not clear on what you're trying to do is:
a) are you generating a consensus
sequence where the 2 sequences are
difference only by gaps
b) are your 2 sequences significantly
different and you're trying to
exclude the non-aligning parts and
then generate a consensus?
So, does the first pair represent your data, or is it more like the second?
ATCG---AAActctgGGGGG--taGC
ATCGcccAAActctgGGGGGTTtaGC
ATCG---AAActctgGGGGG--taGCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
ATCGcccAAActctgGGGGGTTtaGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

The problem is that you're using $i as the counter variable for both your loops, so the inner loop modifies the counter out from under the outer loop. Try changing the inner loop's counter to $j, or using my to localize them properly.

Don't store your values as an array, store as a two-dimensional array:
my #dataset = ([$val1, $val2], [$val3, $val4]);
or
my #dataset;
push (#dataset, [$val_n1, $val_n2]);
Then:
for my $value (#dataset) {
### Do stuff with $value->[0] and $value->[1]
}

There are lots of strange things in your code: you are initializing a matrix then not using it; reading a whole file into an array; scanning a string C style but then not doing anything with the unmatched values; and finally, just printing the two last processed values (which, in your case, are the two first elements of your array, since you are using pop.)
Here's a guess.
use strict;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
# Preparing a regular expression. This is kind of useful if processing large
# amounts of data. This will match anything that is not in the string above.
my $regex = qr([^$aminoacids]);
# Our work function.
sub do_something {
my ($a, $b) = #_;
$a =~ s/$regex//g; # removing unwanted characters
$b =~ s/$regex//g; # ditto
# Printing, saving, whatever...
print "Something: $a - $b\n";
return ($a, $b);
}
my $prev;
while (<>) {
chomp;
if ($prev) {
do_something($prev, $_);
$prev = undef;
} else {
$prev = $_;
}
}
print STDERR "Warning: trailing data: $prev\n"
if $prev;

Since you are a total Perl/programming newbie, I am going to show a rewrite of your first code block, then I'll offer you some general advice and links.
Let's look at your first block of sample code. There is a lot of stuff all strung together, and it's hard to follow. I, personally, am too dumb to remember more than a few things at a time, so I chop problems into small pieces that I can understand. This is (was) known as 'chunking'.
One easy way to chunk your program is use write subroutines. Take any particular action or idea that is likely to be repeated or would make the current section of code long and hard to understand, and wrap it up into a nice neat package and get it out of the way.
It also helps if you add space to your code to make it easier to read. Your mind is already struggling to grok the code soup, why make things harder than necessary? Grouping like things, using _ in names, blank lines and indentation all help. There are also conventions that can help, like making constant values (values that cannot or should not change) all capital letters.
use strict; # Using strict will help catch errors.
use warnings; # ditto for warnings.
use diagnostics; # diagnostics will help you understand the error messages
# Put constants at the top of your program.
# It makes them easy to find, and change as needed.
my $AMINO_ACIDS = 'ARNDCQEGHILKMFPSTWYV';
my $AMINO_COUNT = length($AMINO_ACIDS);
my $DATA_FILE = 'dnadata.txt';
# Here I am using subroutines to encapsulate complexity:
my #data = read_data_file( $DATA_FILE );
my #matrix = initialize_matrix( 2, $amino_count, 0 );
# now we are done with the first block of code and can do more stuff
...
# This section down here looks kind of big, but it is mostly comments.
# Remove the didactic comments and suddenly the code is much more compact.
# Here are the actual subs that I abstracted out above.
# It helps to document your subs:
# - what they do
# - what arguments they take
# - what they return
# Read a data file and returns an array of dna strings read from the file.
#
# Arguments
# data_file => path to the data file to read
sub read_data_file {
my $data_file = shift;
# Here I am using a 3 argument open, and a lexical filehandle.
open( my $infile, '<', $data_file )
or die "Unable to open dnadata.txt - $!\n";
# I've left slurping the whole file intact, even though it can be very inefficient.
# Other times it is just what the doctor ordered.
my #data = <$infile>;
chomp #data;
# I return the data array rather than a reference
# to keep things simple since you are just learning.
#
# In my code, I'd pass a reference.
return #data;
}
# Initialize a matrix (or 2-d array) with a specified value.
#
# Arguments
# $i => width of matrix
# $j => height of matrix
# $value => initial value
sub initialize_matrix {
my $i = shift;
my $j = shift;
my $value = shift;
# I use two powerful perlisms here: map and the range operator.
#
# map is a list contsruction function that is very very powerful.
# it calls the code in brackets for each member of the the list it operates against.
# Think of it as a for loop that keeps the result of each iteration,
# and then builds an array out of the results.
#
# The range operator `..` creates a list of intervening values. For example:
# (1..5) is the same as (1, 2, 3, 4, 5)
my #matrix = map {
[ ($value) x $i ]
} 1..$j;
# So here we make a list of numbers from 1 to $j.
# For each member of the list we
# create an anonymous array containing a list of $i copies of $value.
# Then we add the anonymous array to the matrix.
return #matrix;
}
Now that the code rewrite is done, here are some links:
Here's a response I wrote titled "How to write a program". It offers some basic guidelines on how to approach writing software projects from specification. It is aimed at beginners. I hope you find it helpful. If nothing else, the links in it should be handy.
For a beginning programmer, beginning with Perl, there is no better book than Learning Perl.
I also recommend heading over to Perlmonks for Perl help and mentoring. It is an active Perl specific community site with very smart, friendly people who are happy to help you. Kind of like Stack Overflow, but more focused.
Good luck!

Instead of using a C-style for loop, you can read data from an array two elements at a time using splice inside a while loop:
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# stuff...
}
I've cleaned up some of your other code below:
use strict;
use warnings;
open(my $infile, '<', 'dnadata.txt');
my #data = <$infile>;
close $infile;
chomp #data;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
# initialize a 2 x 21 array for holding the amino acid data
my $matrix;
foreach my $i (0 .. 1)
{
foreach my $j (0 .. $aalen-1)
{
$matrix->[$i][$j] = 0;
}
}
# Process all letters in the DNA data
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# do something... not sure what?
# you appear to want to look up the letters in a reference table, perhaps $aminoacids?
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Perl - Data comparison taking huge time - perl

Related

How to use hash in perl to reduce the use of loops?

Retain quotes on CSV fields that were quoted in the input

Parsing CSV files, finding columns and remembering them

Simplest way to match array of strings to search in perl?

Why does my Perl for loop exit early?

Categories

Resources