How can I parse just part of a file with Perl? - perl

I'm a total newbie to Perl, but I've heard that it's great for parsing files, so I've thought of giving it a spin.
I have a text file that has the following sample info:
High school is used in some
parts of the world, particularly in
Scotland, North America and Oceania to
describe an institution that provides
all or part of secondary education.
The term "high school" originated in
Scotland with the world's oldest being
the Royal High School (Edinburgh) in
1505.
The Royal High School was used as a
model for the first public high school
in the United States, the English High
School founded in Boston,
Massachusetts, in 1821. The precise
stage of schooling provided by a high
school differs from country to
country, and may vary within the same
jurisdiction. In all of New Zealand
and Malaysia along with parts of
Australia and Canada, high school is
synonymous with secondary school, and
encompasses the entire secondary stage
of education.
======================================
Grade1 87.43%
Grade2 84.30%
Grade3 83.00%
=====================================
I want to parse the file and only get the numerical information. I
looked into regex, and I think I'd use something like
if (m/^%/) {
do something
}
else {
skip the line
}
But, what I really want to do is keep track of the variable on the
left and store the numerical value in that variable. So, after
parsing the file, I'd really like to have the following variables
to have the % value stored in them. The reason being, I want to
create a pie-chart/bar graph of the different grades.
Grade1 = 87.43
Grade2 = 84.30
...
Could you'll suggest methods I should be looking at?

You'll need a regular expression. Something like the following should work
while (<>) {
/(Grade[0-9]+)\s*([0-9]+\.[0-9]+)/;
$op{$1} = $2;
}
as a filter. The op hash will store the grade names and scores. This is preferable to automatically instantiating variables.

If you can guarantee that your points of interest are nested between two =s (and there isn't an odd number of these demarcations in a given file), the flip-flop operator is a handy thing here:
use strict; # These two pragmas go a long, ...
use warnings; # ... long way in helping you code better
my %scores; # Create a hash of scores
while (<>) { # The diamond operator processes all files ...
# ... supplied at command-line, line-by-line
next unless /^=+$/ .. /^=+$/; # The flip-flop operator used ...
# ... to filter out only 'grades'
my ( $name, $grade ) = split; # This usage of split will break ...
# ... the current line into an array
$scores{$name} = $grade; # Associate grade with name
}

You want to use a hash. Something like this should do the trick:
my %grades = (); # this is a hash
open(my $fh, "grade_file.txt" ) or die $!;
while( my $line = <$fh> ) {
if( my( $name, $grade ) = $line =~ /^(Grade\d+)\s(\d+\.\d+\%) ) {
$grades{$name} = $grade;
}
}
close($fh);
Your %grades hash would then contain the name and grade pairs. (Access it like my $value = $grades{'Grade1'}
Also just a note. The language is called "Perl", not "PERL". Many people in the Perl community get upset about it :-)

See Zaid's answer for an example of using the flip-flop operator (which is what I would recommend). However, if you run into difficulties with that (sometimes the DWIMmery might get in the way), you can also explicitly maintain state while reading the file line-by-line:
#!/usr/bin/perl
use strict; use warnings;
my %grades;
my $interesting;
while ( my $line = <DATA> ) {
if ( not $interesting and $line =~ /^=+\s*\z/ ) {
$interesting = 1;
next;
}
if ( $interesting ) {
if ( $line =~ /^=+\s*$/ ) {
$interesting = 0;
next;
}
elsif ( my ($name, $grade) = $line =~ /^(\w+)\s+(\d+\.\d+%)/ ) {
# Keep an array in case the same name occurs
# multiple times
push #{ $grades{$name} }, $grade;
}
}
}
use YAML;
print Dump \%grades;

Creating dynamic variable names is probably not going to help you much in producing a graph; using an array is almost certainly a better idea.
However, if you really think you want to do this:
while (my $line = <$your_infile_handler>){
if ($line =~ m/(.*) = ([0-9.]*)){
$$1 = $2;
}
}
should accomplish this.

Related

Nested Loops in Perl - Failing logic

I'm trying to work out a bigger problem, but have simplified the issue for readability, ultimately the logic below is the reason the extended program is failing.
I am using Perl to search for a short sequence of letters within a larger sequence (protein sequences), and if it is not found, then I'd like to do some calculations. I don't know whether I'm going crazy, but I can't work out why this logic is failing.
sub calculateEpitopeMutations {
my #mutationArray;
my #epitopeArray;
my $count;
my $localEpitope;
open( EPITOPESIN2, $ARGV[5] ) or die "Unable to open file $ARGV[5]\n";
while ( my $line = <EPITOPESIN2> ) {
chomp $line;
push #epitopeArray, $line;
}
while ( my ( $key, $value ) = each our %sequencesForCalculation ) {
foreach ( #epitopeArray ) {
$localEpitope = $_;
if ( $value =~ /($localEpitope)/g ) {
print "$key\n$localEpitope\nexactly the same\n\n";
next;
}
else {
#This is where I'd like to do the further calculations
print "$key\n$localEpitope\nthere is a difference\n\n";
next;
}
}
}
}
$ARGV[5] is the name of a text file containing a list of 9-character sequences, exactly like the following
RVSENIQRF
SFQVDCFLW
The idea is to put these into array #epitopeArray and iterate through these, and compare them with all (currently just one) $value sequences in the hash %sequencesForCalculation.
%sequencesForCalculation is a hash, where $value is a long sequence of characters, like this
MDSNTMSSFQVDCFLWHIRKRFADNGLGDAPFLDRLRRDQKSLKGRGNTLGLDIETATLVGKQIVEWILKEESSETLRMTIASVPTSRYLSDMTLEEMSRDWFMLMPRQKKIGPLCVRLDQAVMEKNIVLKANFSVIFNRLETLILLRAFTEEEAIVGEISPLPSLPGHTYEDVKNAVGVLIGGLEWNGNTVRVSENIQRFAWRNCDENGRPSLPPEQK
Currently, the small 9-character long sequence $localEpitope is contained in the longer sequence $value so when I iterate through the program, I should get this printed every time.
($key contains a header of information about the protein sequences, but is irrelevant so I have shortened it to just the variable name.)
$key
RVSENIQRF
Exactly the same
$key
SFQVDCFLW
Exactly the same
$key
But instead I'm getting this
$key
RVSENIQRF
exactly the same
$key
SFQVDCFLW
there is a difference
$key
Any ideas? Please let me know if anything further is required.
Update
TL;DR: You should change $value =~ /($localEpitope)/g to $value =~ /$localEpitope/
Okay now that we know the real circumstances, the problem (as melpomene points out in his comment) is that you have the /g modifier on your pattern match. There's no reason for that; you don't want check how many times the substring appears, you just want to know whether it's there at all
The problem is that variables subjected to a /g pattern search keep a state that says where the last search ended. So you're searching for $epitopeArray[0] in the longer string and finding it, and then searching for $epitopeArray[1] from where the previous search terminated. The first substring appears after the second one, so only the first is found
For more information on this behaviour, take a look at the pos function which returns the current value of this state. For instance pos($value) will return the character offset where the next m//g will start its search
This short program demonstrates the problem. With the /g modifier only BBB is found. Remove it and both are found
use strict;
use warnings;
use 5.010;
my $long_s = 'xxxAAAxxxBBBxxx';
for my $substr ( qw/ BBB AAA / ) {
if ( $long_s =~ /$substr/g ) {
say "$substr okay";
}
else {
say "$substr nope";
}
}
output
BBB okay
AAA nope
Original
You say
Currently, the small 9-character long sequence ($localEpitope) IS contained in the longer sequence ($value), and so when I iterate through the program, I should get the following printed everytime
So $localEpitope is a substring of $value and you're saying that
$value =~ /($localEpitope)/g
evaluates to true
That is correct behaviour. $value =~ /$localEpitope/ will check whether $localEpitope can be found anywhere in $value
Unfortunately it's not clear enough from what you've written to suggest a solution

How to find common parts in a paths with perl?

Having several paths, like:
1: /abc/def/some/common/part/xyz/file1.ext
2: /other/path/to/7433/qwe/some/common/part/anotherfile.ext
3: /misc/path/7433/qwe/some/common/part/filexx.ext
4: /2443/totally/different/path/file9988.ext
5: /abc/another/same/path/to/ppp/thisfile.ext
6: /deep1/deep2/another/same/path/to/diffone/filename.ext
I need find the common parts - each possible ones, eg. in the above if possible to found common parts:
/some/common/part/ - in the paths 1,2,3
/another/same/path/to/ - in the 5,6
/path/to/ - in the 2,5,6
/path/ - 2,3,4,5,6
etc..
I simply absoulutely haven't any idea how to solve this - what approach is good one
string based - somewhat find common parts of a string
list based - splitting all path into lists and somewhat compare arrays for common elements
tree-graph - somewhat find a common parts of a graph
other?
When i get some direction how to solve this problem, I'm (probably) able code it myself - so don't want free programmming service - but need some guiding how to start.
I'm sure than here is already some CPAN module what could help me, but I'm really have'nt idea how to find the right useful module from the list of 30k modules for the above problem. :(
EDIT - For what i need this:
Having approx. 200k files, in 10k directories and many of them "belong together", like:
/u/some/path/project1/subprojct/file1
/u/backup/of/work/date/project1/subproject/file2
/u/backup_of_backup/of/work/date/project1/subproject/file2
/u/new/addtions/to/projec1/subproject/file3
The files are dirrerent kind (pdf, images, doc, txt and so on), several are identical (like above file2 - easy to filter with Digest::MD5), but the only way "group them together" is based on "common parts" of a path - e.g. "project1/subproject" and so on..
Another files HAS the same MD5, so can filter out duplicates, but they are in different trees, like
/u/path/some/file
/u/path/lastest_project/menu/file
/u/path/jquery/menu/file
/u/path/example/solution/jquery/menu/file
so, the files are the same, (identical md5) but need somewhat move one copy to the right place (and delete others) and need somewhat determine the "most used" common paths, and collect tags... (old path elements are tags)
The idea behind is:
if the same md5 files are mostly stored under some common path - I can make a decision where to move one copy...
And it is more complicated, but for explanation is enough the above ;)
Simply need lowering the entropy on my HDD ;)
There is some discussion about finding the longest common consecutive substrings in this thread: http://www.nntp.perl.org/group/perl.fwp/2002/02/msg1662.html
The "winner" appears to be the following code, but there are a few other things in there you could try:
#!/usr/bin/perl
use strict;
use warnings;
sub lcs {
my $this = shift;
my $that = shift;
my $str = join "\0", $this, $that;
my $len = 1;
my $lcs;
while ($str =~ m{ ([^\0]{$len,}) (?= [^\0]* \0 [^\0]*? \1 ) }xg) {
$lcs = $1;
$len = 1 + length($1);
}
if ($len == 1) { print("No common substring\n"); }
else {
print("Longest common substring of length $len: \"");
print("$lcs");
print("\"\n");
}
}
Keep in mind you would have to adjust it a little bit to account for the fact that you only want entire subdirectories that match... ie, change if ($len == 1) to something like if ($len == 1 or $lcs !~ /^\// or $lcs !~ /\/$/)
You would also have to add some bookkeeping to keep track of which ones match. When I ran this code on your examples above it also found the /abc/ match in lines 1 & 5.
One thing that may or may not be a problem is that the following two lines:
/abc/another/same/path/to/ppp/thisfile.ext
/abc/another/different/path/to/ppp/otherfile.ext
Would match on:
/abc/another/
But not on:
/path/to/ppp/
But -- here's the bad news -- you will have to do O(n^2) comparisons with n=200,000 files. That could take an obscene amount of time.
Another solution would be to go through each path in your list, add all of its possible directory paths as keys to a hash and push the file itself to the hash (so that the value is an array of files that have this path in it). Something like this:
use strict;
use warnings;
my %links;
open my $fh, "<", 'filename' or die "Can't open $!";
while (my $line = <$fh>) {
chomp($line);
my #dirs = split /\//, $line;
for my $i (0..$#dirs) {
if ($i == $#dirs) {
push(#{ $links{$dirs[$i]} }, $line);
}
for my $j ($i+1..$#dirs) {
push(#{ $links{join("/",#dirs[$i..$j])} }, $line);
#PROCESS THIS if length of array is > 1
}
}
}
Of course, this would take an obscene amount of memory. With 200,000 files to process, you might have a hard time no matter what you try, but maybe you can break it up into more manageable chunks. Hopefully, this will give you a starting point.
To solve this problem, you need the correct data structure. A hash that counts the partial paths works well:
use File::Spec;
my %Count_of = ();
while( <DATA> ){
my #names = File::Spec->splitdir( $_ );
# remove file
pop #names;
# if absolute path, remove empty names at start
shift #names while length( $names[0] ) == 0;
# don't count blank lines
next unless #names;
# move two cursor thru the names,
# and count the partial parts
# created from one to the other
for my $i ( 0 .. $#names ){
for my $j ( $i .. $#names ){
my $partial_path = File::Spec->catdir( #names[ $i .. $j ] );
$Count_of{ $partial_path } ++;
}
}
}
# now display the results
for my $path ( sort { $Count_of{$b} <=> $Count_of{$a} || $a cmp $b } keys %Count_of ){
# skip if singleton.
next if $Count_of{ $path } <= 1;
printf "%3d : %s\n", $Count_of{ $path }, $path;
}
__DATA__
/abc/def/some/common/part/xyz/file1.ext
/other/path/to/7433/qwe/some/common/part/anotherfile.ext
/misc/path/7433/qwe/some/common/part/filexx.ext
/2443/totally/different/path/file9988.ext
/abc/another/same/path/to/ppp/thisfile.ext
/deep1/deep2/another/same/path/to/diffone/filename.ext

how to put a file into an array and save it in perl

Hello everyone I'm a beginner in perl and I'm facing some problems as I want to put my strings starting from AA to \ in to an array and want to save it. There are about 2000-3000 strings in a txt file starting from same initials i.e., AA to / I'm doing it by this way plz correct me if I'm wrong.
Input File
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\
Source code
$flag = 0
while ($line = <ifh>)
{
if ( $line = m//\/g)
{
$flag = 1;
}
while ( $flag != 0)
{
for ($i = 0; $i <= 10000; $i++)
{ # Missing brace added by editor
$array[$i] = $line;
} # Missing brace added by editor
}
} # Missing close brace added by editor; position guessed!
print $ofh, $line;
close $ofh;
Welcome to StackOverflow.
There are multiple issues with your code. First, please post compilable Perl; I had to add three braces to give it the remotest chance of compiling, and I had to guess where one of them went (and there's a moderate chance it should be on the other side of the print statement from where I put it).
Next, experts have:
use warnings;
use strict;
at the top of their scripts because they know they will miss things if they don't. As a learner, it is crucial for you to do the same; it will prevent you making errors.
With those in place, you have to declare your variables as you use them.
Next, remember to indent your code. Doing so makes it easier to comprehend. Perl can be incomprehensible enough at the best of times; don't make it any harder than it has to be. (You can decide where you like braces - that is open to discussion, though it is simpler to choose a style you like and stick with it, ignoring any discussion because the discussion will probably be fruitless.)
Is the EB vs VB in the data significant? It is hard to guess.
It is also not clear exactly what you are after. It might be that you're after an array of entries, one for each block in the file (where the blocks end at the line containing just a backslash), and where each entry in the array is a hash keyed by the first two letters (or first word) on the line, with the remainder of the line being the value. This is a modestly complex structure, and probably beyond what you're expected to use at this stage in your learning of Perl.
You have the line while ($line = <ifh>). This is not invalid in Perl if you opened the file the old fashioned way, but it is not the way you should be learning. You don't show how the output file handle is opened, but you do use the modern notation when trying to print to it. However, there's a bug there, too:
print $ofh, $line; # Print two values to standard output
print $ofh $line; # Print one value to $ofh
You need to look hard at your code, and think about the looping logic. I'm sure what you have is not what you need. However, I'm not sure what it is that you do need.
Simpler solution
From the comments:
I want to flag each record starting from AA to \ as record 0 till record n and want to save it in a new file with all the record numbers.
Then you probably just need:
#!/usr/bin/env perl
use strict;
use warnings;
my $recnum = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
print "$_\n";
$recnum++;
}
else
{
print "$recnum $_\n";
}
}
This reads from the files specified on the command line (or standard input if there are none), and writes the tagged output to standard output. It prefixes each line except the 'end of record' marker lines with the record number and a space. Choose your output format and file handling to suit your needs. You might argue that the chomp is counter-productive; you can certainly code the program without it.
Overly complex solution
Developed in the absence of clear direction from the questioner.
Here is one possible way to read the data, but it uses moderately advanced Perl (hash references, etc). The Data::Dumper module is also useful for printing out Perl data structures (see: perldoc Data::Dumper).
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #data;
my $hashref = { };
my $nrecs = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
# End of group - save to data array and start new hash
$data[$nrecs++] = $hashref;
$hashref = { };
}
else
{
m/^([A-Z]+)\s+(.*)$/;
$hashref->{$1} = $2;
}
}
foreach my $i (0..$nrecs-1)
{
print "Record $i:\n";
foreach my $key (sort keys $data[$i])
{
print " $key = $data[$i]->{$key}\n";
}
}
print Data::Dumper->Dump([ \#data ], [ '#data' ]);
Sample output for example input:
Record 0:
AA = c0001
BB = afsfjgfjgjgjflffbg
CC = table
DD = hhhfsegsksgk
EB = jksgksjs
Record 1:
AA = e0002
BB = rejwkghewhgsejkhrj
CC = chair
DD = egrhjrhojohkhkhrkfs
VB = rkgjehkrkhkh;r
$#data = [
{
'EB' => 'jksgksjs',
'CC' => 'table',
'AA' => 'c0001',
'BB' => 'afsfjgfjgjgjflffbg',
'DD' => 'hhhfsegsksgk'
},
{
'CC' => 'chair',
'AA' => 'e0002',
'VB' => 'rkgjehkrkhkh;r',
'BB' => 'rejwkghewhgsejkhrj',
'DD' => 'egrhjrhojohkhkhrkfs'
}
];
Note that this data structure is not optimized for searching except by record number. If you need to search the data in some other way, then you need to organize it differently. (And don't hand this code in as your answer without understanding it all - it is subtle. It also does no error checking; beware faulty data.)
It can't be right. I can see two main issues with your while-loop.
Once you enter the following loop
while ( $flag != 0)
{
...
}
you'll never break out because you do not reset the flag whenever you find an break-line. You'll have to parse you input and exit the loop if necessary.
And second you never read any input within this loop and thus process the same $line over and over again.
You should not put the loop inside your code but instead you can use the following pattern (pseudo-code)
if flag != 0
append item to array
else
save array to file
start with new array
end
I believe what you want is to split the files content at \ though it's not too clear.
To achieve this you can slurp the file into a variable by setting the input record separator, then split the content.
To find out about Perl's special variables related to filehandlers read perlvar
#!perl
use strict;
use warnings;
my $content;
{
open my $fh, '<', 'test.txt';
local $/; # slurp mode
$content = <$fh>;
close $fh;
}
my #blocks = split /\\/, $content;
Make sure to localize modifications of Perl's special variables to not interfere with different parts of your program.
If you want to keep the separator you could set $/ to \ directly and skip split.
#!perl
use strict;
use warnings;
my #blocks;
{
open my $fh, '<', 'test.txt';
local $/ = '\\'; # seperate at \
#blocks = <$fh>;
close $fh;
}
Here's a way to read your data into an array. As I said in a comment, "saving" this data to a file is pointless, unless you change it. Because if I were to print the #data array below to a file, it would look exactly like the input file.
So, you need to tell us what it is you want to accomplish before we can give you an answer about how to do it.
This script follows these rules (exactly):
Find a line that begins with "AA",
and save that into $line
Concatenate every new line from the
file into $line
When you find a line that begins with
a backslash \, stop concatenating
lines and save $line into #data.
Then, find the next line that begins
with "AA" and start the loop over.
These matching regexes are pretty loose, as they will match AAARGH and \bonkers as well. If you need them stricter, you can try /^\\$/ and /^AA$/, but then you need to watch out for whitespace at the beginning and end of line. So perhaps /^\s*\\\s*$/ and /^\s*AA\s*$/ instead.
The code:
use warnings;
use strict;
my $line="";
my #data;
while (<DATA>) {
if (/^AA/) {
$line = $_;
while (<DATA>) {
$line .= $_;
last if /^\\/;
}
}
push #data, $line;
}
use Data::Dumper;
print Dumper \#data;
__DATA__
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\

Perl Array Question

Never done much programming -- been charged at work with manipulating the data from comment cards. Using perl so far I've got the database to correctly put its daily comments into an array. Comments are each one LINE of text within the database, so I just split the array on the line-break.
my #comments = split("\n", $c_data);
And yes, this being my first time programming, that took me wayyy too long to figure out.
At this point I now need to organize these array elements (is that what I should call them?) into their own separate scalars based on capitalized words (this is a behavior of the database, which was at one point corrupt).
Example of what two elements of the array look like:
print "$comments[0]\n";
This dining experience was GOOD blah blah blah.
or
print "$comments[1]\n";
Overall this was a BAD time and me and my blah blah.
These "good" or "bad" or "best" are already capitalized by the database the data came from.
What's the easiest way in Perl to get these lines into scalars from an array based on these capitalized words?
If I understand you correctly, you want to merge array elements that match a certain word. You can do it like this:
my #bad_comments = grep { /\bBAD\b/ } #comments;
my #good_comments = grep { /\bGOOD\b/ } #comments;
That way all 'good' and 'bad' comments go to each own array.
Now if you need to merge them into a scalar you'd want to join them (opposite of split):
my $bad_comments = join "\n", grep { /\bBAD\b/ } #comments;
my $good_comments = join "\n", grep { /\bGOOD\b/ } #comments;
Think hash table when you want to group data by arbitrary string keys. In this case, you have an array of GOOD comments and an array of BAD comments. What if you had an array of SO-SO comments? A strategy based on having array variables #good, #bad, #soso breaks down fast.
You have some ways to go before you can fully understand the code below:
#!/usr/bin/perl
use strict; use warnings;
use Regex::PreSuf;
my %comments;
my #types = qw( GOOD BAD ); # DRY
my $types_re = presuf #types;
while ( my $comment = <DATA> ) {
chomp $comment;
last unless $comment =~ /\S/;
# capturing match in list context returns captured strings
my ($type) = ( $comment =~ /($types_re)/ );
push #{ $comments{$type} }, $comment;
}
for my $type ( #types ) {
print "$type comments:\n";
for my $comment ( #{ $comments{$type} } ) {
print $comment, "\n";
}
}
__DATA__
This dining experience was GOOD blah blah blah.
Overall this was a BAD time and me and my blah blah.
You could use regular expressons, eg:
if ($comments[$i] =~ /GOOD/) {
# good comment
}
or more generally
if ($comments[$i] =~ /\b([A-Z]{2,})\b/) {
print "Comment: $1\n";
}
Here, \b means word boundary, () are used to extract captured text, [A-Z] represent a group of capital characters - capital letters, {2,} means that there have to be 2 or more characters defined by previous class.
I would store all your comments into a hash-of-arrays data structure, with the key being your capitalized word.
Here is a general solution to grab any capitalized word (assuming only one per comment), not just GOOD and BAD:
use strict;
use warnings;
my #comments = <DATA>;
chomp #comments;
my %data;
for (#comments) {
my $cap;
for (split) {
$cap = $_ if /^[A-Z]+$/;
}
if ($cap) { push #{ $data{$cap} }, $_ }
}
use Data::Dumper; print Dumper(\%data);
__DATA__
This is GOOD stuff
Here's some BAD stuff.
More of the GOOD junk.
Nothing here.
Here is the output:
$VAR1 = {
'BAD' => [
'Here\'s some BAD stuff.'
],
'GOOD' => [
'This is GOOD stuff',
'More of the GOOD junk.'
]
};
In my opinion, your best bet would be to create a disk-based database of some sort (SQLite?) that stores the comments and type as separate data.
Then use one of the other posted solutions to import your existing data into it.
The only problem here is that you need to learn Perl's DBI layer and a bit of SQL to use SQLite with Perl.
Not sure what you mean by "organize" and "based on".
If you mean produce a list of any capitalized words, each with a list of the lines containing that word (similar to toolic's solution, you could do this:
my %CAPS = ();
map {
my ($word) = /(\b[A-Z]+\b)/;
push( #{ $CAPS{$word} }, $_)
} #comments;
This will build a mapping of WORDS to things, and the things in this case are going to be lists of lines.
And you can refer to these lists as $CAPS{'GOOD'} or $CAPS{'BAD'}, or $CAPS{whatever}.

Why does my Perl for loop exit early?

I am trying to get a perl loop to work that is working from an array that contains 6 elements. I want the loop to pull out two elements from the array, perform certain functions, and then loop back and pull out the next two elements from the array until the array runs out of elements. Problem is that the loop only pulls out the first two elements and then stops. Some help here would be greatly apperaciated.
my open(infile, 'dnadata.txt');
my #data = < infile>;
chomp #data;
#print #data; #Debug
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
my $i=0;
my $j=0;
my #matrix =();
for(my $i=0; $i<2; $i++){
for( my $j=0; $j<$aalen; $j++){
$matrix[$i][$j] = 0;
}
}
The guidelines for this program states that the program should ignore the presence of gaps in the program. which means that DNA code that is matched up with a gap should be ignored. So the code that is pushed through needs to have alignments linked with gaps removed.
I need to modify the length of the array by two since I am comparing two sequence in this part of the loop.
#$lemseqcomp = $lenarray / 2;
#print $lenseqcomp;
#I need to initialize these saclar values.
$junk1 = " ";
$junk2 = " ";
$seq1 = " ";
$seq2 = " ";
This is the loop that is causeing issues. I belive that the first loop should move back to the array and pull out the next element each time it loops but it doesn't.
for($i=0; $i<$lenarray; $i++){
#This code should remove the the last value of the array once and
#then a second time. The sequences should be the same length at this point.
my $last1 =pop(#data1);
my $last2 =pop(#data1);
for($i=0; $i<length($last1); $i++){
my $letter1 = substr($last1, $i, 1);
my $letter2 = substr($last2, $i, 1);
if(($letter1 eq '-')|| ($letter2 eq '-')){
#I need to put the sequences I am getting rid of somewhere. Here is a good place as any.
$junk1 = $letter1 . $junk1;
$junk2 = $letter1 . $junk2;
}
else{
$seq1 = $letter1 . $seq1;
$seq2 = $letter2 . $seq2;
}
}
}
print "$seq1\n";
print "$seq2\n";
print "#data1\n";
I am actually trying to create a substitution matrix from scratch and return the data. The reason why the code looks weird, is because it isn't actually finished yet and I got stuck.
This is the test sequence if anyone is curious.
YFRFR
YF-FR
FRFRFR
ARFRFR
YFYFR-F
YFRFRYF
First off, if you're going to work with sequence data, use BioPerl. Life will be so much easier. However...
Since you know you'll be comparing the lines from your input file as pairs, it makes sense to read them into a datastructure that reflects that. As elsewhere suggested, an array like #data[[line1, line2],[line3,line4]) ensures that the correct pairs of lines are always together.
What I'm not clear on what you're trying to do is:
a) are you generating a consensus
sequence where the 2 sequences are
difference only by gaps
b) are your 2 sequences significantly
different and you're trying to
exclude the non-aligning parts and
then generate a consensus?
So, does the first pair represent your data, or is it more like the second?
ATCG---AAActctgGGGGG--taGC
ATCGcccAAActctgGGGGGTTtaGC
ATCG---AAActctgGGGGG--taGCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
ATCGcccAAActctgGGGGGTTtaGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
The problem is that you're using $i as the counter variable for both your loops, so the inner loop modifies the counter out from under the outer loop. Try changing the inner loop's counter to $j, or using my to localize them properly.
Don't store your values as an array, store as a two-dimensional array:
my #dataset = ([$val1, $val2], [$val3, $val4]);
or
my #dataset;
push (#dataset, [$val_n1, $val_n2]);
Then:
for my $value (#dataset) {
### Do stuff with $value->[0] and $value->[1]
}
There are lots of strange things in your code: you are initializing a matrix then not using it; reading a whole file into an array; scanning a string C style but then not doing anything with the unmatched values; and finally, just printing the two last processed values (which, in your case, are the two first elements of your array, since you are using pop.)
Here's a guess.
use strict;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
# Preparing a regular expression. This is kind of useful if processing large
# amounts of data. This will match anything that is not in the string above.
my $regex = qr([^$aminoacids]);
# Our work function.
sub do_something {
my ($a, $b) = #_;
$a =~ s/$regex//g; # removing unwanted characters
$b =~ s/$regex//g; # ditto
# Printing, saving, whatever...
print "Something: $a - $b\n";
return ($a, $b);
}
my $prev;
while (<>) {
chomp;
if ($prev) {
do_something($prev, $_);
$prev = undef;
} else {
$prev = $_;
}
}
print STDERR "Warning: trailing data: $prev\n"
if $prev;
Since you are a total Perl/programming newbie, I am going to show a rewrite of your first code block, then I'll offer you some general advice and links.
Let's look at your first block of sample code. There is a lot of stuff all strung together, and it's hard to follow. I, personally, am too dumb to remember more than a few things at a time, so I chop problems into small pieces that I can understand. This is (was) known as 'chunking'.
One easy way to chunk your program is use write subroutines. Take any particular action or idea that is likely to be repeated or would make the current section of code long and hard to understand, and wrap it up into a nice neat package and get it out of the way.
It also helps if you add space to your code to make it easier to read. Your mind is already struggling to grok the code soup, why make things harder than necessary? Grouping like things, using _ in names, blank lines and indentation all help. There are also conventions that can help, like making constant values (values that cannot or should not change) all capital letters.
use strict; # Using strict will help catch errors.
use warnings; # ditto for warnings.
use diagnostics; # diagnostics will help you understand the error messages
# Put constants at the top of your program.
# It makes them easy to find, and change as needed.
my $AMINO_ACIDS = 'ARNDCQEGHILKMFPSTWYV';
my $AMINO_COUNT = length($AMINO_ACIDS);
my $DATA_FILE = 'dnadata.txt';
# Here I am using subroutines to encapsulate complexity:
my #data = read_data_file( $DATA_FILE );
my #matrix = initialize_matrix( 2, $amino_count, 0 );
# now we are done with the first block of code and can do more stuff
...
# This section down here looks kind of big, but it is mostly comments.
# Remove the didactic comments and suddenly the code is much more compact.
# Here are the actual subs that I abstracted out above.
# It helps to document your subs:
# - what they do
# - what arguments they take
# - what they return
# Read a data file and returns an array of dna strings read from the file.
#
# Arguments
# data_file => path to the data file to read
sub read_data_file {
my $data_file = shift;
# Here I am using a 3 argument open, and a lexical filehandle.
open( my $infile, '<', $data_file )
or die "Unable to open dnadata.txt - $!\n";
# I've left slurping the whole file intact, even though it can be very inefficient.
# Other times it is just what the doctor ordered.
my #data = <$infile>;
chomp #data;
# I return the data array rather than a reference
# to keep things simple since you are just learning.
#
# In my code, I'd pass a reference.
return #data;
}
# Initialize a matrix (or 2-d array) with a specified value.
#
# Arguments
# $i => width of matrix
# $j => height of matrix
# $value => initial value
sub initialize_matrix {
my $i = shift;
my $j = shift;
my $value = shift;
# I use two powerful perlisms here: map and the range operator.
#
# map is a list contsruction function that is very very powerful.
# it calls the code in brackets for each member of the the list it operates against.
# Think of it as a for loop that keeps the result of each iteration,
# and then builds an array out of the results.
#
# The range operator `..` creates a list of intervening values. For example:
# (1..5) is the same as (1, 2, 3, 4, 5)
my #matrix = map {
[ ($value) x $i ]
} 1..$j;
# So here we make a list of numbers from 1 to $j.
# For each member of the list we
# create an anonymous array containing a list of $i copies of $value.
# Then we add the anonymous array to the matrix.
return #matrix;
}
Now that the code rewrite is done, here are some links:
Here's a response I wrote titled "How to write a program". It offers some basic guidelines on how to approach writing software projects from specification. It is aimed at beginners. I hope you find it helpful. If nothing else, the links in it should be handy.
For a beginning programmer, beginning with Perl, there is no better book than Learning Perl.
I also recommend heading over to Perlmonks for Perl help and mentoring. It is an active Perl specific community site with very smart, friendly people who are happy to help you. Kind of like Stack Overflow, but more focused.
Good luck!
Instead of using a C-style for loop, you can read data from an array two elements at a time using splice inside a while loop:
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# stuff...
}
I've cleaned up some of your other code below:
use strict;
use warnings;
open(my $infile, '<', 'dnadata.txt');
my #data = <$infile>;
close $infile;
chomp #data;
my $aminoacids = 'ARNDCQEGHILKMFPSTWYV';
my $aalen = length($aminoacids);
# initialize a 2 x 21 array for holding the amino acid data
my $matrix;
foreach my $i (0 .. 1)
{
foreach my $j (0 .. $aalen-1)
{
$matrix->[$i][$j] = 0;
}
}
# Process all letters in the DNA data
while (my ($letter1, $letter2) = splice(#data, 0, 2))
{
# do something... not sure what?
# you appear to want to look up the letters in a reference table, perhaps $aminoacids?
}