How to find common parts in a paths with perl? - perl

Having several paths, like:
1: /abc/def/some/common/part/xyz/file1.ext
2: /other/path/to/7433/qwe/some/common/part/anotherfile.ext
3: /misc/path/7433/qwe/some/common/part/filexx.ext
4: /2443/totally/different/path/file9988.ext
5: /abc/another/same/path/to/ppp/thisfile.ext
6: /deep1/deep2/another/same/path/to/diffone/filename.ext
I need find the common parts - each possible ones, eg. in the above if possible to found common parts:
/some/common/part/ - in the paths 1,2,3
/another/same/path/to/ - in the 5,6
/path/to/ - in the 2,5,6
/path/ - 2,3,4,5,6
etc..
I simply absoulutely haven't any idea how to solve this - what approach is good one
string based - somewhat find common parts of a string
list based - splitting all path into lists and somewhat compare arrays for common elements
tree-graph - somewhat find a common parts of a graph
other?
When i get some direction how to solve this problem, I'm (probably) able code it myself - so don't want free programmming service - but need some guiding how to start.
I'm sure than here is already some CPAN module what could help me, but I'm really have'nt idea how to find the right useful module from the list of 30k modules for the above problem. :(
EDIT - For what i need this:
Having approx. 200k files, in 10k directories and many of them "belong together", like:
/u/some/path/project1/subprojct/file1
/u/backup/of/work/date/project1/subproject/file2
/u/backup_of_backup/of/work/date/project1/subproject/file2
/u/new/addtions/to/projec1/subproject/file3
The files are dirrerent kind (pdf, images, doc, txt and so on), several are identical (like above file2 - easy to filter with Digest::MD5), but the only way "group them together" is based on "common parts" of a path - e.g. "project1/subproject" and so on..
Another files HAS the same MD5, so can filter out duplicates, but they are in different trees, like
/u/path/some/file
/u/path/lastest_project/menu/file
/u/path/jquery/menu/file
/u/path/example/solution/jquery/menu/file
so, the files are the same, (identical md5) but need somewhat move one copy to the right place (and delete others) and need somewhat determine the "most used" common paths, and collect tags... (old path elements are tags)
The idea behind is:
if the same md5 files are mostly stored under some common path - I can make a decision where to move one copy...
And it is more complicated, but for explanation is enough the above ;)
Simply need lowering the entropy on my HDD ;)

There is some discussion about finding the longest common consecutive substrings in this thread: http://www.nntp.perl.org/group/perl.fwp/2002/02/msg1662.html
The "winner" appears to be the following code, but there are a few other things in there you could try:
#!/usr/bin/perl
use strict;
use warnings;
sub lcs {
my $this = shift;
my $that = shift;
my $str = join "\0", $this, $that;
my $len = 1;
my $lcs;
while ($str =~ m{ ([^\0]{$len,}) (?= [^\0]* \0 [^\0]*? \1 ) }xg) {
$lcs = $1;
$len = 1 + length($1);
}
if ($len == 1) { print("No common substring\n"); }
else {
print("Longest common substring of length $len: \"");
print("$lcs");
print("\"\n");
}
}
Keep in mind you would have to adjust it a little bit to account for the fact that you only want entire subdirectories that match... ie, change if ($len == 1) to something like if ($len == 1 or $lcs !~ /^\// or $lcs !~ /\/$/)
You would also have to add some bookkeeping to keep track of which ones match. When I ran this code on your examples above it also found the /abc/ match in lines 1 & 5.
One thing that may or may not be a problem is that the following two lines:
/abc/another/same/path/to/ppp/thisfile.ext
/abc/another/different/path/to/ppp/otherfile.ext
Would match on:
/abc/another/
But not on:
/path/to/ppp/
But -- here's the bad news -- you will have to do O(n^2) comparisons with n=200,000 files. That could take an obscene amount of time.
Another solution would be to go through each path in your list, add all of its possible directory paths as keys to a hash and push the file itself to the hash (so that the value is an array of files that have this path in it). Something like this:
use strict;
use warnings;
my %links;
open my $fh, "<", 'filename' or die "Can't open $!";
while (my $line = <$fh>) {
chomp($line);
my #dirs = split /\//, $line;
for my $i (0..$#dirs) {
if ($i == $#dirs) {
push(#{ $links{$dirs[$i]} }, $line);
}
for my $j ($i+1..$#dirs) {
push(#{ $links{join("/",#dirs[$i..$j])} }, $line);
#PROCESS THIS if length of array is > 1
}
}
}
Of course, this would take an obscene amount of memory. With 200,000 files to process, you might have a hard time no matter what you try, but maybe you can break it up into more manageable chunks. Hopefully, this will give you a starting point.

To solve this problem, you need the correct data structure. A hash that counts the partial paths works well:
use File::Spec;
my %Count_of = ();
while( <DATA> ){
my #names = File::Spec->splitdir( $_ );
# remove file
pop #names;
# if absolute path, remove empty names at start
shift #names while length( $names[0] ) == 0;
# don't count blank lines
next unless #names;
# move two cursor thru the names,
# and count the partial parts
# created from one to the other
for my $i ( 0 .. $#names ){
for my $j ( $i .. $#names ){
my $partial_path = File::Spec->catdir( #names[ $i .. $j ] );
$Count_of{ $partial_path } ++;
}
}
}
# now display the results
for my $path ( sort { $Count_of{$b} <=> $Count_of{$a} || $a cmp $b } keys %Count_of ){
# skip if singleton.
next if $Count_of{ $path } <= 1;
printf "%3d : %s\n", $Count_of{ $path }, $path;
}
__DATA__
/abc/def/some/common/part/xyz/file1.ext
/other/path/to/7433/qwe/some/common/part/anotherfile.ext
/misc/path/7433/qwe/some/common/part/filexx.ext
/2443/totally/different/path/file9988.ext
/abc/another/same/path/to/ppp/thisfile.ext
/deep1/deep2/another/same/path/to/diffone/filename.ext

Related

Why is my Tie::IxHash program taking a long time?

Basically, I have a script to create a hash for COGs with corresponding gene IDs:
# Open directory and get all the files in it
opendir(DIR, "/my/path/to/COG/");
my #infiles = grep(/OG-.*\.fasta/, readdir(DIR));
closedir(DIR);
# Create hash for COGs and their corresponding gene IDs
tie my %ids_for, 'Tie::IxHash';
if (! -e '/my/path/to/COG/COG_hash.ref') {
for my $infile (#infiles) {
## $infile
%ids_for = (%ids_for, read_COG_fasta($infile));
}
## %ids_for
store \%ids_for, '/my/path/to/COG/COG_hash.ref';
}
my $id_ref = retrieve('/my/path/to/COG/COG_hash.ref');
%ids_for = %$id_ref;
## %ids_for
The problem isn't that it doesn't work (at least I think), but that it is extremely slow for some reason. When I tried to test run it, it would take weeks for me to have an actual result. Somehow the hash creation is really really slow and I'm sure there is some way to optimize it better for it to work way faster.
Ideally, the paths should be the input of the script that way there would be no need to constantly change the script in case the path changes.
It would also be great if there could be a way to see the progress of the hash creation, like maybe have it show that it is 25% done, 50% done, 75% done and ultimately 100% done. Regarding this last point I have seen things like use Term::ProgressBar but I am not sure if it would be appropriate in this case.
Do you really need Tie::IxHash?
That aside, I suspect your culprit is this set of lines:
for my $infile (#infiles) {
## $infile
%ids_for = (%ids_for, read_COG_fasta($infile));
}
To add a key to the hash, you are creating a list of the current key-value pairs, adding the new pair, then assigning it all back to the hash.
What happens if you take the results from read_COG_fasta and add the keys one at a time?
for my $infile (#infiles) {
my %new_hash = read_COG_fasta($infile);
foreach my $key ( keys %new_hash ) {
$ids_for{$key} = $new_hash{$key};
}
}
As for progress, I usually have something like this when I'm trying to figure out something:
use v5.26;
my $file_count = #files;
foreach my $n ( 0 .. $#files ) {
say "[$n/$file_count] Processing $file[$n]";
my %result = ...;
printf "\tGot %d results", scalar %hash; # v5.26 feature!
}
You could do the same sort of thing with the keys that you get back so you can track the size.

Perl need the right grep operator to match value of variable

I want to see if I have repeated items in my array, there are over 16.000 so will automate it
There may be other ways but I started with this and, well, would like to finish it unless there is a straightforward command. What I am doing is shifting and pushing from one array into another and this way, check the destination array to see if it is "in array" (like there is such a command in PHP).
So, I got this sub routine and it works with literals, but it doesn't with variables. It is because of the 'eq' or whatever I should need. The 'sourcefile' will contain one or more of the words of the destination array.
// Here I just fetch my file
$listamails = <STDIN>;
# Remove the newlines filename
chomp $listamails;
# open the file, or exit
unless ( open(MAILS, $listamails) ) {
print "Cannot open file \"$listamails\"\n\n";
exit;
}
# Read the list of mails from the file, and store it
# into the array variable #sourcefile
#sourcefile = <MAILS>;
# Close the handle - we've read all the data into #sourcefile now.
close MAILS;
my #destination = ('hi', 'bye');
sub in_array
{
my ($destination,$search_for) = #_;
return grep {$search_for eq $_} #$destination;
}
for($i = 0; $i <=100; $i ++)
{
$elemento = shift #sourcefile;
if(in_array(\#destination, $elemento))
{
print "it is";
}
else
{
print "it aint there";
}
}
Well, if instead of including the $elemento in there I put a 'hi' it does work and also I have printed the value of $elemento which is also 'hi', but when I put the variable, it does not work, and that is because of the 'eq', but I don't know what else to put. If I put == it complains that 'hi' is not a numeric value.
When you want distinct values think hash.
my %seen;
#seen{ #array } = ();
if (keys %seen == #array) {
print "\#array has no duplicate values\n";
}
It's not clear what you want. If your first sentence is the only one that matters ("I want to see if I have repeated items in my array"), then you could use:
my %seen;
if (grep ++$seen{$_} >= 2, #array) {
say "Has duplicates";
}
You said you have a large array, so it might be faster to stop as soon as you find a duplicate.
my %seen;
for (#array) {
if (++$seen{$_} == 2) {
say "Has duplicates";
last;
}
}
By the way, when looking for duplicates in a large number of items, it's much faster to use a strategy based on sorting. After sorting the items, all duplicates will be right next to each other, so to tell if something is a duplicate, all you have to do is compare it with the previous one:
#sorted = sort #sourcefile;
for (my $i = 1; $i < #sorted; ++$i) { # Start at 1 because we'll check the previous one
print "$sorted[$i] is a duplicate!\n" if $sorted[$i] eq $sorted[$i - 1];
}
This will print multiple dupe messages if there are multiple dupes, but you can clean it up.
As eugene y said, hashes are definitely the way to go here. Here's a direct translation of the code you posted to a hash-based method (with a little more Perlishness added along the way):
my #destination = ('hi', 'bye');
my %in_array = map { $_ => 1 } #destination;
for my $i (0 .. 100) {
$elemento = shift #sourcefile;
if(exists $in_array{$elemento})
{
print "it is";
}
else
{
print "it aint there";
}
}
Also, if you mean to check all elements of #sourcefile (as opposed to testing the first 101 elements) against #destination, you should replace the for line with
while (#sourcefile) {
Also also, don't forget to chomp any values read from a file! Lines read from a file have a linebreak at the end of them (the \r\n or \n mentioned in comments on the initial question), which will cause both eq and hash lookups to report that otherwise-matching values are different. This is, most likely, the reason why your code is failing to work correctly in the first place and changing to use sort or hashes won't fix that. First chomp your input to make it work, then use sort or hashes to make it efficient.

Comparing two directories using Perl

i am new to Perl so excuse my noobness,
Here's what i intend to do.
$ perl dirComp.pl dir1 dir2
dir1 & dir2 are directory names.
The script dirComp.pl should identify whether contents in dir1 & dir2 are identical or not.
I have come up with an algorithm
Store all the contents of dir1(recursively) in a list
Store all the contents of dir2 in another list
Compare the two list, if they are same - dir1 & dir2 are same else not.
my #files1 = readdir(DIR1h);
my #files2 = readdir(DIR2h);
# Remove filename extensions for each list.
foreach my $item (#files1) {
my ( $fileName, $filePath, $fileExt ) = fileparse($item, qr/\.[^.]*/);
$item = $fileName;
}
foreach my $item (#files2) {
my ( $fileName, $filePath, $fileExt ) = fileparse($item, qr/\.[^.]*/);
$item = $fileName;
}
I am not able to recursively traverse subdirectories in a given directory with the help of above code. Any help would be appreciated.
EDIT: Using File:DirCompare
#!/usr/bin/perl -w
use File::DirCompare;
use File::Basename;
if ($#ARGV < 1 )
{
&usage;
}
my $dir1 = $ARGV[0];
my $dir2 = $ARGV[1];
File::DirCompare->compare($dir1,$dir2,sub {
my ($a,$b) = #_;
if ( !$b )
{
printf "Test result:PASSED.\n";
printf "Only in %s : %s\n", dirname($a), basename($a);
}elsif ( !$a ) {
printf "Test result:PASSED.\n";
printf "Only in %s : %s\n", dirname($b), basename($b);
}else {
printf "Test result:FAILED.\n";
printf "Files $a and $b are different.\n";
}
});
I have a directory structure as below,
dir1/ dir2/
--file1.txt --file1.txt
--file2.txt --file2.txt
--file3.cpp --file3.cpp
I am facing Test result:FAILED. As the result must have been passed. Can anyone please correct me?
Thanks
The example you supplied using File::DirCompare works as intended.
Keep in mind that the callback subroutine is called for every unique file in each directory and for every pair of files which differ in their content. Having the same filename is not enough, the contents of each file in each directory must be exactly the same as well.
Furthermore, the cases in which you report "PASSED" aren't a success at all (by your definition) since they detail the cases in which a file exists in one of the directories, but not the other: meaning the directories' contents are not identical.
This should be closer to what you want:
#!/usr/bin/perl
use strict;
use warnings;
use File::DirCompare;
use File::Basename;
sub compare_dirs
{
my ($dir1, $dir2) = #_;
my $equal = 1;
File::DirCompare->compare($dir1, $dir2, sub {
my ($a,$b) = #_;
$equal = 0; # if the callback was called even once, the dirs are not equal
if ( !$b )
{
printf "File '%s' only exists in dir '%s'.\n", basename($a), dirname($a);
}
elsif ( !$a ) {
printf "File '%s' only exists in dir '%s'.\n", basename($b), dirname($b);
}
else
{
printf "File contents for $a and $b are different.\n";
}
});
return $equal;
}
print "Please specify two directory names\n" and exit if (#ARGV < 2);
printf "%s\n", &compare_dirs($ARGV[0], $ARGV[1]) ? 'Test: PASSED' : 'Test: FAILED';
I'd recommend using File::DirCompare module instead. ) It takes all the hard work of traversing the directory structure - you just need to define how your directories should be checked (should the sub compare the file contents, etc.)
You might want to try the ol' File::Find. It's not my favorite module. (It is just funky in the way it works), but for your purposes, it allows you to easily find all files in two directories, and compare them. Here's a brief example:
use strict;
use warnings;
use feature qw(say);
use Digest::MD5::File qw(file_md5_hex);
use File::Find;
use constant {
DIR_1 => "/usr/foo",
DIR_2 => "/usr/bar",
};
my %dir_1;
my %dir_2;
find ( sub {
if ( -f $File::Find::name ) {
$dir_1{$File::Find::name} = file_md5_hex($File::Find::name);
}
else {
$dir_1($file::Find::name} = "DIRECTORY!";
}
}, DIR_1);
find ( sub {
if ( -f $File::Find::name ) {
$dir_2{$File::Find::name} = file_md5_hex($File::Find::name);
}
else {
$dir_2($file::Find::name} = "DIRECTORY!";
}
}, DIR_2);
This will create two hashes keyed by the file names in each directory. I used the Digest::MD5::File to create a MD5 checksum. If the checksum between the two files differ, I know the files differ (although I don't know where).
Now you have to do three things:
Go through %dir_1 and see if there's an equivalent key in %dir_2. If there is not an equivalent key, you know that a file exists in %dir_1 and not %dir_2.
If there an equivalent key in each hash, check to see if the md5 checksums agree. If they do, then, the files match. If they don't they differ. You can't say where they differ, but they differ.
Finally, go through %dir_2 and check to see if there's an equivalent key in %dir_1. If there is, do nothing. If there isn't, that means there's a file in %dir_1 that's not in %dir_2.
Just a word of warning: The keys int these two hashes won't match. You'll have to transform one to the other when doing your compare. For example, you'll have two files as:
/usr/bar/my/file/is/here.txt
/usr/foo/my/file/is/here.txt
As you can see, my/file/is/here.txt exist in both directories, but in my code, the two hashes will have two different keys. You could either fix the two subroutines to strip the directory name off the front of the files paths, or when you do your comparison, transform one to the other. I didn't want to run through a full test. (The bit of code I wrote works in my testing), so I'm not 100% sure what you'll have to do to make sure you find the matching keys.
Oh, another warning: I pick up all entries and not just files. For directories, I can check to see if the hash key is equal to DIRECTORY! or not. I could simply ignore everything that's not a file.
And, you might want to check for special cases. Is this a link? Is it a hard link or a soft link? What about some sort of special file. That makes things a bit more complex. However, the basics are here.

How can I parse just part of a file with Perl?

I'm a total newbie to Perl, but I've heard that it's great for parsing files, so I've thought of giving it a spin.
I have a text file that has the following sample info:
High school is used in some
parts of the world, particularly in
Scotland, North America and Oceania to
describe an institution that provides
all or part of secondary education.
The term "high school" originated in
Scotland with the world's oldest being
the Royal High School (Edinburgh) in
1505.
The Royal High School was used as a
model for the first public high school
in the United States, the English High
School founded in Boston,
Massachusetts, in 1821. The precise
stage of schooling provided by a high
school differs from country to
country, and may vary within the same
jurisdiction. In all of New Zealand
and Malaysia along with parts of
Australia and Canada, high school is
synonymous with secondary school, and
encompasses the entire secondary stage
of education.
======================================
Grade1 87.43%
Grade2 84.30%
Grade3 83.00%
=====================================
I want to parse the file and only get the numerical information. I
looked into regex, and I think I'd use something like
if (m/^%/) {
do something
}
else {
skip the line
}
But, what I really want to do is keep track of the variable on the
left and store the numerical value in that variable. So, after
parsing the file, I'd really like to have the following variables
to have the % value stored in them. The reason being, I want to
create a pie-chart/bar graph of the different grades.
Grade1 = 87.43
Grade2 = 84.30
...
Could you'll suggest methods I should be looking at?
You'll need a regular expression. Something like the following should work
while (<>) {
/(Grade[0-9]+)\s*([0-9]+\.[0-9]+)/;
$op{$1} = $2;
}
as a filter. The op hash will store the grade names and scores. This is preferable to automatically instantiating variables.
If you can guarantee that your points of interest are nested between two =s (and there isn't an odd number of these demarcations in a given file), the flip-flop operator is a handy thing here:
use strict; # These two pragmas go a long, ...
use warnings; # ... long way in helping you code better
my %scores; # Create a hash of scores
while (<>) { # The diamond operator processes all files ...
# ... supplied at command-line, line-by-line
next unless /^=+$/ .. /^=+$/; # The flip-flop operator used ...
# ... to filter out only 'grades'
my ( $name, $grade ) = split; # This usage of split will break ...
# ... the current line into an array
$scores{$name} = $grade; # Associate grade with name
}
You want to use a hash. Something like this should do the trick:
my %grades = (); # this is a hash
open(my $fh, "grade_file.txt" ) or die $!;
while( my $line = <$fh> ) {
if( my( $name, $grade ) = $line =~ /^(Grade\d+)\s(\d+\.\d+\%) ) {
$grades{$name} = $grade;
}
}
close($fh);
Your %grades hash would then contain the name and grade pairs. (Access it like my $value = $grades{'Grade1'}
Also just a note. The language is called "Perl", not "PERL". Many people in the Perl community get upset about it :-)
See Zaid's answer for an example of using the flip-flop operator (which is what I would recommend). However, if you run into difficulties with that (sometimes the DWIMmery might get in the way), you can also explicitly maintain state while reading the file line-by-line:
#!/usr/bin/perl
use strict; use warnings;
my %grades;
my $interesting;
while ( my $line = <DATA> ) {
if ( not $interesting and $line =~ /^=+\s*\z/ ) {
$interesting = 1;
next;
}
if ( $interesting ) {
if ( $line =~ /^=+\s*$/ ) {
$interesting = 0;
next;
}
elsif ( my ($name, $grade) = $line =~ /^(\w+)\s+(\d+\.\d+%)/ ) {
# Keep an array in case the same name occurs
# multiple times
push #{ $grades{$name} }, $grade;
}
}
}
use YAML;
print Dump \%grades;
Creating dynamic variable names is probably not going to help you much in producing a graph; using an array is almost certainly a better idea.
However, if you really think you want to do this:
while (my $line = <$your_infile_handler>){
if ($line =~ m/(.*) = ([0-9.]*)){
$$1 = $2;
}
}
should accomplish this.

Why does my Perl for loop exit early?

I am trying to get a perl loop to work that is working from an array that contains 6 elements. I want the loop to pull out two elements from the array, perform certain functions, and then loop back and pull out the next two elements from the array until the array runs out of elements. Problem is that the loop only pulls out the first two elements and then stops. Some help here would be greatly apperaciated.
my open(infile, 'dnadata.txt');
my #data = < infile>;
chomp #data;
#print #data; #Debug
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
my $i=0;
my $j=0;
my #matrix =();
for(my $i=0; $i<2; $i++){
for( my $j=0; $j<$aalen; $j++){
$matrix[$i][$j] = 0;
}
}
The guidelines for this program states that the program should ignore the presence of gaps in the program. which means that DNA code that is matched up with a gap should be ignored. So the code that is pushed through needs to have alignments linked with gaps removed.
I need to modify the length of the array by two since I am comparing two sequence in this part of the loop.
#$lemseqcomp = $lenarray / 2;
#print $lenseqcomp;
#I need to initialize these saclar values.
$junk1 = " ";
$junk2 = " ";
$seq1 = " ";
$seq2 = " ";
This is the loop that is causeing issues. I belive that the first loop should move back to the array and pull out the next element each time it loops but it doesn't.
for($i=0; $i<$lenarray; $i++){
#This code should remove the the last value of the array once and
#then a second time. The sequences should be the same length at this point.
my $last1 =pop(#data1);
my $last2 =pop(#data1);
for($i=0; $i<length($last1); $i++){
my $letter1 = substr($last1, $i, 1);
my $letter2 = substr($last2, $i, 1);
if(($letter1 eq '-')|| ($letter2 eq '-')){
#I need to put the sequences I am getting rid of somewhere. Here is a good place as any.
$junk1 = $letter1 . $junk1;
$junk2 = $letter1 . $junk2;
}
else{
$seq1 = $letter1 . $seq1;
$seq2 = $letter2 . $seq2;
}
}
}
print "$seq1\n";
print "$seq2\n";
print "#data1\n";
I am actually trying to create a substitution matrix from scratch and return the data. The reason why the code looks weird, is because it isn't actually finished yet and I got stuck.
This is the test sequence if anyone is curious.
YFRFR
YF-FR
FRFRFR
ARFRFR
YFYFR-F
YFRFRYF
First off, if you're going to work with sequence data, use BioPerl. Life will be so much easier. However...
Since you know you'll be comparing the lines from your input file as pairs, it makes sense to read them into a datastructure that reflects that. As elsewhere suggested, an array like #data[[line1, line2],[line3,line4]) ensures that the correct pairs of lines are always together.
What I'm not clear on what you're trying to do is:
a) are you generating a consensus
sequence where the 2 sequences are
difference only by gaps
b) are your 2 sequences significantly
different and you're trying to
exclude the non-aligning parts and
then generate a consensus?
So, does the first pair represent your data, or is it more like the second?
ATCG---AAActctgGGGGG--taGC
ATCGcccAAActctgGGGGGTTtaGC
ATCG---AAActctgGGGGG--taGCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
ATCGcccAAActctgGGGGGTTtaGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
The problem is that you're using $i as the counter variable for both your loops, so the inner loop modifies the counter out from under the outer loop. Try changing the inner loop's counter to $j, or using my to localize them properly.
Don't store your values as an array, store as a two-dimensional array:
my #dataset = ([$val1, $val2], [$val3, $val4]);
or
my #dataset;
push (#dataset, [$val_n1, $val_n2]);
Then:
for my $value (#dataset) {
### Do stuff with $value->[0] and $value->[1]
}
There are lots of strange things in your code: you are initializing a matrix then not using it; reading a whole file into an array; scanning a string C style but then not doing anything with the unmatched values; and finally, just printing the two last processed values (which, in your case, are the two first elements of your array, since you are using pop.)
Here's a guess.
use strict;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
# Preparing a regular expression. This is kind of useful if processing large
# amounts of data. This will match anything that is not in the string above.
my $regex = qr([^$aminoacids]);
# Our work function.
sub do_something {
my ($a, $b) = #_;
$a =~ s/$regex//g; # removing unwanted characters
$b =~ s/$regex//g; # ditto
# Printing, saving, whatever...
print "Something: $a - $b\n";
return ($a, $b);
}
my $prev;
while (<>) {
chomp;
if ($prev) {
do_something($prev, $_);
$prev = undef;
} else {
$prev = $_;
}
}
print STDERR "Warning: trailing data: $prev\n"
if $prev;
Since you are a total Perl/programming newbie, I am going to show a rewrite of your first code block, then I'll offer you some general advice and links.
Let's look at your first block of sample code. There is a lot of stuff all strung together, and it's hard to follow. I, personally, am too dumb to remember more than a few things at a time, so I chop problems into small pieces that I can understand. This is (was) known as 'chunking'.
One easy way to chunk your program is use write subroutines. Take any particular action or idea that is likely to be repeated or would make the current section of code long and hard to understand, and wrap it up into a nice neat package and get it out of the way.
It also helps if you add space to your code to make it easier to read. Your mind is already struggling to grok the code soup, why make things harder than necessary? Grouping like things, using _ in names, blank lines and indentation all help. There are also conventions that can help, like making constant values (values that cannot or should not change) all capital letters.
use strict; # Using strict will help catch errors.
use warnings; # ditto for warnings.
use diagnostics; # diagnostics will help you understand the error messages
# Put constants at the top of your program.
# It makes them easy to find, and change as needed.
my $AMINO_ACIDS = 'ARNDCQEGHILKMFPSTWYV';
my $AMINO_COUNT = length($AMINO_ACIDS);
my $DATA_FILE = 'dnadata.txt';
# Here I am using subroutines to encapsulate complexity:
my #data = read_data_file( $DATA_FILE );
my #matrix = initialize_matrix( 2, $amino_count, 0 );
# now we are done with the first block of code and can do more stuff
...
# This section down here looks kind of big, but it is mostly comments.
# Remove the didactic comments and suddenly the code is much more compact.
# Here are the actual subs that I abstracted out above.
# It helps to document your subs:
# - what they do
# - what arguments they take
# - what they return
# Read a data file and returns an array of dna strings read from the file.
#
# Arguments
# data_file => path to the data file to read
sub read_data_file {
my $data_file = shift;
# Here I am using a 3 argument open, and a lexical filehandle.
open( my $infile, '<', $data_file )
or die "Unable to open dnadata.txt - $!\n";
# I've left slurping the whole file intact, even though it can be very inefficient.
# Other times it is just what the doctor ordered.
my #data = <$infile>;
chomp #data;
# I return the data array rather than a reference
# to keep things simple since you are just learning.
#
# In my code, I'd pass a reference.
return #data;
}
# Initialize a matrix (or 2-d array) with a specified value.
#
# Arguments
# $i => width of matrix
# $j => height of matrix
# $value => initial value
sub initialize_matrix {
my $i = shift;
my $j = shift;
my $value = shift;
# I use two powerful perlisms here: map and the range operator.
#
# map is a list contsruction function that is very very powerful.
# it calls the code in brackets for each member of the the list it operates against.
# Think of it as a for loop that keeps the result of each iteration,
# and then builds an array out of the results.
#
# The range operator `..` creates a list of intervening values. For example:
# (1..5) is the same as (1, 2, 3, 4, 5)
my #matrix = map {
[ ($value) x $i ]
} 1..$j;
# So here we make a list of numbers from 1 to $j.
# For each member of the list we
# create an anonymous array containing a list of $i copies of $value.
# Then we add the anonymous array to the matrix.
return #matrix;
}
Now that the code rewrite is done, here are some links:
Here's a response I wrote titled "How to write a program". It offers some basic guidelines on how to approach writing software projects from specification. It is aimed at beginners. I hope you find it helpful. If nothing else, the links in it should be handy.
For a beginning programmer, beginning with Perl, there is no better book than Learning Perl.
I also recommend heading over to Perlmonks for Perl help and mentoring. It is an active Perl specific community site with very smart, friendly people who are happy to help you. Kind of like Stack Overflow, but more focused.
Good luck!
Instead of using a C-style for loop, you can read data from an array two elements at a time using splice inside a while loop:
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# stuff...
}
I've cleaned up some of your other code below:
use strict;
use warnings;
open(my $infile, '<', 'dnadata.txt');
my #data = <$infile>;
close $infile;
chomp #data;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
# initialize a 2 x 21 array for holding the amino acid data
my $matrix;
foreach my $i (0 .. 1)
{
foreach my $j (0 .. $aalen-1)
{
$matrix->[$i][$j] = 0;
}
}
# Process all letters in the DNA data
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# do something... not sure what?
# you appear to want to look up the letters in a reference table, perhaps $aminoacids?
}