perl while nest when read files - perl

I want to compare two files so I wrote the following code:
while($line1 = <FH1>){
while($line2 = <FH2>){
next if $line1 > $line2;
last if $line1 < $line2;
}
next;
}
My question here is that when the outer loop comes to the next line of file1 and then goes to the inner loop, will the inner while statement read from the first line of file2 again or continue where it left off on the previous iteration of the outer loop?
Thanks

You should always use strict and use warnings at the start of all your programs and declare all variables at their point of first use. This applies especially when you are asking for help with your code.
Is all the data in your files numeric? If not then enabling warnings would have told you that that the < and > operators are for comparing numeric values rather than general strings.
Once a file has been read through completely - i.e. the second loop's while condition terminates - you can read no more data from the file unless you open it again or use seek to rewind to the beginning.
In general it is better in these circumstances to read the smaller of the two files into an array and use the data from there. If both files are very large then something special must be done.
What sort of file comparison are you trying to do? Are you making sure that the two files are identical, or that all data in the second file appears in the first, or something else? Please give an example of your two data files so that we can help you better.

The inner while loop will consume all the content of the FH2 filehandle when you have read the first line from the FH1 handle. If I can intuit what you want to accomplish, one way to go about it would be to read from both handles in the same statement:
while ( defined($line1 = <FH1>) && defined($line2 = <FH2>) ) {
# 'lt' is for string comparison, '<' is for numbers
if ($line1 lt $line2) {
# print a warning?
last;
}
}

The inner loop will continue from it's last known position in FH2 - if you want it to restart from the beginning of the file you need to put:
seek(FH2, SEEK_SET, 0);
before the inner while
Documentation for seek is here in perldoc

Related

Perl subroutines

Here I fixed most of my mistakes and thank you all, any other advice please with my hash at this point and how can I clear each word and puts the word and its frequency in a hash, excluding the empty words.. I think my code make since now.
So you can focus on the key part of the algorithm, how about accepting input on STDIN and output to STDOUT. That way there's no argument checking, etc. Just a simple:
$ prog < words.txt
All you really need is a very simple algorithm:
Read a line
Split it into words
Record a count of the word
When done, display the counts
Here's a sample program
#! /usr/bin/perl -w
use strict;
my (%data);
while (<STDIN>) {
chomp;
my(#words) = split(/\s+/);
foreach my $word (#words) {
if (!defined($data{$word})) {
$data{$word} = 0;
}
$data{$word}++;
}
}
foreach (sort(keys(%data))) {
print "$_: $data{$_}\n";
}
Once you understand this and have it working in your environment, you can extend it to meet your other requirements:
remove non-alphabetic characters from each word
print three results per line
use input and output files
put the algorithm into a subroutine
I agree that starting with dave's answer would be more productive, but if you are interested in your mistakes, here is what I see:
You assign the return value of checkArgs to a scalar variable $checkArgs, but return an array value. It means that $checkArgs will always contain 2 (the size of the array) after this call (because the program dies if the number of arguments is not 2). It is not very bad since you do not use the value later, but why you need it at all in this case?
You open files and close them immediately without reading from them. Does not make sense.
Statement
while (<>)
reads either from standard output or from all files in the command line arguments. The latter variant is like what you want, but your second argument is the output file, not input. The diamond operator will try to read from it too. You have two options: a) use only one file name in the command line arguments, read the file with <>, use standard output for output, and redirect output to a file in shell; b) use
while(<$file1>)
instead, of course, before closing files. Option a) is the traditional Unix- and Perl-style, but b) provides for clearer code for beginners.
Statements
return $word;
and
return $str, $hash{$str};
return corresponding values on the first iterations of the loops, all other data remain unprocessed. In the first case, you should create a local array, store all $word in it and return the array as a whole. In the second case, you already have such local %hash, it is enough to return this hash. In both cases, you need should assign the return values of the functions not to scalars, but to an array and a hash correspondingly. Now, you actually lose all you data.

Find doublet data in csv file

I'm trying to write a Perl script that can check if a csv file has doublet data in the two last columns. If doublet data is found, an additional column with the word "doublet" should be added:
Example, the original file looks like this:
cat,111,dog,555
cat,444,dog,222
mouse,333,dog,555
mouse,555,cat,555
The final output file should look like this:
cat,111,dog,555,doublet
cat,444,dog,222
mouse,333,dog,555,doublet
mouse,555,cat,555
I'm very much a newbie to Perl scripting, so I won't expose myself with what i've written so far. I tried to read through the file splitting the data into two arrays, one with the first two columns, and the other with the last two columns
The idea was then to check for doublets in the second array, and add (push?) the additional column with the "doublets" information to that array, and then afterwards merge to two array back together again(?)
Unfortunately my brain has now collapsed, and I need help from someone smarter than me, to guide me in the right direction.
Any help would be very much appreciated, thanks.
This is not the most efficient way but here is something to get you started. Script assumes that your input data is comma separated and can contain any number of columns.
#!/usr/bin/env perl
use strict;
use warnings;
my %h;
my #lines;
while (<>) {
chomp;
push (#lines,$_); # save each line
my #fields = split(/,/,$_);
if(#fields > 1) {
$h{join("",#fields[-2,-1])}++; # keep track of how many times a doublet appears.
}
}
# go back through the lines. If doublet appears 2 or more times, append ',doublet' to the output.
foreach (#lines) {
my $d = "";
my #fields = split(/,/,$_);
if (#fields > 1 && $h{join("",#fields[-2,-1])} >= 2) {
$d = ",doublet";
}
print $_,$d,$/;
}
The syntax #fields[-2,-1] is an array slice that returns an array with the last two column values. Then, join("",...) concatenates them together and this becomes the key to the hash. $/ is the input record separator which is newline by default and is quicker to write than "\n"
cat,111,dog,555,doublet
cat,444,dog,222
mouse,333,dog,555,doublet
mouse,555,cat,555

Perl Program to Mimic RNA Synthesis

Looking for suggestions on how to approach my Perl programming homework assignment to write an RNA synthesis program. I've summed and outlined the program below. Specifically, I'm looking for feedback on the blocks below (I'll number for easy reference). I've read up to chapter 6 in Elements of Programming with Perl by Andrew Johnson (great book). I've also read the perlfunc and perlop pod-pages with nothing jumping out on where to start.
Program Description: The program should read an input file from the command line, translate it into RNA, and then transcribe the RNA into a sequence of uppercase one-letter amino acid names.
Accept a file named on the command line
here I will use the <> operator
Check to make sure the file only contains acgt or die
if ( <> ne [acgt] ) { die "usage: file must only contain nucleotides \n"; }
Transcribe the DNA to RNA (Every A replaced by U, T replaced by A, C replaced by G, G replaced by C)
not sure how to do this
Take this transcription & break it into 3 character 'codons' starting at the first occurance of "AUG"
not sure but I'm thinking this is where I will start a %hash variables?
Take the 3 character "codons" and give them a single letter Symbol (an uppercase one-letter amino acid name)
Assign a key a value using (there are 70 possibilities here so I'm not sure where to store or how to access)
If a gap is encountered a new line is started and process is repeated
not sure but we can assume that gaps are multiples of threes.
Am I approaching this the right way? Is there a Perl function that I'm overlooking that can simplify the main program?
Note
Must be self contained program (stored values for codon names & symbols).
Whenever the program reads a codon that has no symbol this is a gap in the RNA, it should start a new line of output and begin at the next occurance of "AUG". For simplicity we can assume that gaps are always multiples of threes.
Before I spend any additional hours on research I am hoping to get confirmation that I'm taking the right approach. Thanks for taking time to read and for sharing your expertise!
1. here I will use the <> operator
OK, your plan is to read the file line by line. Don't forget to chomp each line as you go, or you'll end up with newline characters in your sequence.
2. Check to make sure the file only contains acgt or die
if ( <> ne [acgt] ) { die "usage: file must only contain nucleotides \n"; }
In a while loop, the <> operator puts the line read into the special variable $_, unless you assign it explicitly (my $line = <>).
In the code above, you're reading one line from the file and discarding it. You'll need to save that line.
Also, the ne operator compares two strings, not one string and one regular expression. You'll need the !~ operator here (or the =~ one, with a negated character class [^acgt]. If you need the test to be case-insensitive, look into the i flag for regular expression matching.
3. Transcribe the DNA to RNA (Every A replaced by U, T replaced by A, C replaced by G, G replaced by C).
As GWW said, check your biology. T->U is the only step in transcription. You'll find the tr (transliterate) operator helpful here.
4. Take this transcription & break it into 3 character 'codons' starting at the first occurance of "AUG"
not sure but I'm thinking this is where I will start a %hash variables?
I would use a buffer here. Define an scalar outside the while(<>) loop. Use index to match "AUG". If you don't find it, put the last two bases on that scalar (you can use substr $line, -2, 2 for that). On the next iteration of the loop append (with .=) the line to those two bases, and then test for "AUG" again. If you get a hit, you'll know where, so you can mark the spot and start translation.
5. Take the 3 character "codons" and give them a single letter Symbol (an uppercase one-letter amino acid name)
Assign a key a value using (there are 70 possibilities here so I'm not sure where to store or how to access)
Again, as GWW said, build a hash table:
%codons = ( AUG => 'M', ...).
Then you can use (for eg.) split to build an array of the current line you're examining, build codons three elements at a time, and grab the correct aminoacid code from the hash table.
6.If a gap is encountered a new line is started and process is repeated
not sure but we can assume that gaps are multiples of threes.
See above. You can test for the existence of a gap with exists $codons{$current_codon}.
7. Am I approaching this the right way? Is there a Perl function that I'm overlooking that can simplify the main program?
You know, looking at the above, it seems way too complex. I built a few building blocks; the subroutines read_codon and translate: I think they help the logic of the program immensely.
I know this is a homework assignment, but I figure it might help you get a feel for other possible approaches:
use warnings; use strict;
use feature 'state';
# read_codon works by using the new [state][1] feature in Perl 5.10
# both #buffer and $handle represent 'state' on this function:
# Both permits abstracting reading codons from processing the file
# line-by-line.
# Once read_colon is called for the first time, both are initialized.
# Since $handle is a state variable, the current file handle position
# is never reset. Similarly, #buffer always holds whatever was left
# from the previous call.
# The base case is that #buffer contains less than 3bp, in which case
# we need to read a new line, remove the "\n" character,
# split it and push the resulting list to the end of the #buffer.
# If we encounter EOF on the $handle, then we have exhausted the file,
# and the #buffer as well, so we 'return' undef.
# otherwise we pick the first 3bp of the #buffer, join them into a string,
# transcribe it and return it.
sub read_codon {
my ($file) = #_;
state #buffer;
open state $handle, '<', $file or die $!;
if (#buffer < 3) {
my $new_line = scalar <$handle> or return;
chomp $new_line;
push #buffer, split //, $new_line;
}
return transcribe(
join '',
shift #buffer,
shift #buffer,
shift #buffer
);
}
sub transcribe {
my ($codon) = #_;
$codon =~ tr/T/U/;
return $codon;
}
# translate works by using the new [state][1] feature in Perl 5.10
# the $TRANSLATE state is initialized to 0
# as codons are passed to it,
# the sub updates the state according to start and stop codons.
# Since $TRANSLATE is a state variable, it is only initialized once,
# (the first time the sub is called)
# If the current state is 'translating',
# then the sub returns the appropriate amino-acid from the %codes table, if any.
# Thus this provides a logical way to the caller of this sub to determine whether
# it should print an amino-acid or not: if not, the sub will return undef.
# %codes could also be a state variable, but since it is not actually a 'state',
# it is initialized once, in a code block visible form the sub,
# but separate from the rest of the program, since it is 'private' to the sub
{
our %codes = (
AUG => 'M',
...
);
sub translate {
my ($codon) = #_ or return;
state $TRANSLATE = 0;
$TRANSLATE = 1 if $codon =~ m/AUG/i;
$TRANSLATE = 0 if $codon =~ m/U(AA|GA|AG)/i;
return $codes{$codon} if $TRANSLATE;
}
}
I can give you a few hints on a few of your points.
I think your first goal should be to parse the file character by character, ensuring each one is valid, group them into sets of three nucleotides and then work on your other goals.
I think your biology is a bit off as well, when you transcribe DNA to RNA you need to think about what strands are involved. You may not need to "complement" your bases during your transcription step.
2. You should check this as your parse the file character by character.
3. You could do this with a loop and some if statements or hash
4. This could probably be done with a counter as you read the file character by character. Since you need to insert a space after every 3rd character.
5. This would be a good place to use a hash that's based on the amino acid codon table.
6. You'll have to look for the gap character as you parse the file. This seems to contradict your #2 requirement since the program says your text can only contain ATGC.
There are a lot of perl functions that could make this easier. There are also perl modules such as bioperl. But I think using some of these could defeat the purpose of your assignment.
Look at BioPerl and browse the source-modules for indicators on how to go about it.

Can someone suggest how this Perl script works?

I have to maintain the following Perl script:
#!/usr/bin/perl -w
die "Usage: $0 <file1> <file2>\n" unless scalar(#ARGV)>1;
undef $/;
my #f1 = split(/(?=(?:SERIAL NUMBER:\s+\d+))/, <>);
my #f2 = split(/(?=(?:SERIAL NUMBER:\s+\d+))/, <>);
die "Error: file1 has $#f1 serials, file2 has $#f2\n" if ($#f1 != $#f2);
foreach my $g (0 .. $#f1) {
print (($f2[$g] =~ m/RESULT:\s+PASS/) ? $f2[$g] : $f1[$g]);
}
print STDERR "$#f1 serials found\n";
I know pretty much what it does, but how it's done is difficult to follow. The calls to split() are particulary puzzling.
It's fairly idiomatic Perl and I would be grateful if a Perl expert could make a few clarifying suggestions about how it does it, so that if I need to use it on input files it can't deal with, I can attempt to modify it.
It combines the best results from two datalog files containing test results. The datalog files contain results for various serial numbers and the data for each serial number begins and ends with SERIAL NUMBER: n (I know this because my equipment creates the input files)
I could describe the format of the datalog files, but I think the only important aspect is the SERIAL NUMBER: n because that's all the Perl script checks for
The ternary operator is used to print a value from one input file or the other, so the output can be redirected to a third file.
This may not be what I would call "idiomatic" (that would be use Module::To::Do::Task) but they've certainly (ab)used some language features here. I'll see if I can't demystify some of this for you.
die "Usage: $0 <file1> <file2>\n" unless scalar(#ARGV)>1;
This exits with a usage message if they didn't give us any arguments. Command-line arguments are stored in #ARGV, which is like C's char **argv except the first element is the first argument, not the program name. scalar(#ARGV) converts #ARGV to "scalar context", which means that, while #ARGV is normally a list, we want to know about it's scalar (i.e. non-list) properties. When a list is converted to scalar context, we get the list's length. Therefore, the unless conditional is satisfied only if we passed no arguments.
This is rather misleading, because it will turn out your program needs two arguments. If I wrote this, I would write:
die "Usage: $0 <file1> <file2>\n" unless #ARGV == 2;
Notice I left off the scalar(#ARGV) and just wrote #ARGV. The scalar() function forces scalar context, but if we're comparing equality with a number, Perl can implicitly assume scalar context.
undef $/;
Oof. The $/ variable is a special Perl built-in variable that Perl uses to tell what a "line" of data from a file is. Normally, $/ is set to the string "\n", meaning when Perl tries to read a line it will read up until the next linefeed (or carriage return/linefeed on Windows). Your writer has undef-ed the variable, though, which means when you try to read a "line", Perl will just slurp up the whole file.
my #f1 = split(/(?=(?:SERIAL NUMBER:\s+\d+))/, <>);
This is a fun one. <> is a special filehandle that reads line-by-line from each file given on the command line. However, since we've told Perl that a "line" is an entire file, calling <> once will read in the entire file given in the first argument, and storing it temporarily as a string.
Then we take that string and split() it up into pieces, using the regex /(?=(?:SERIAL NUMBER:\s+\d+))/. This uses a lookahead, which tells our regex engine "only match if this stuff comes after our match, but this stuff isn't part of our match," essentially allowing us to look ahead of our match to check on more info. It basically splits the file into pieces, where each piece (except possibly the first) begins with "SERIAL NUMBER:", some arbitrary whitespace (the \s+ part), and then some digits (the \d+ part). I can't teach you regexes, so for more info I recommend reading perldoc perlretut - they explain all of that better than I ever will.
Once we've split the string into a list, we store that list in a list called #f1.
my #f2 = split(/(?=(?:SERIAL NUMBER:\s+\d+))/, <>);
This does the same thing as the last line, only to the second file, because <> has already read the entire first file, and storing the list in another variable called #f2.
die "Error: file1 has $#f1 serials, file2 has $#f2\n" if ($#f1 != $#f2);
This line prints an error message if #f1 and #f2 are different sizes. $#f1 is a special syntax for arrays - it returns the index of the last element, which will usually be the size of the list minus one (lists are 0-indexed, like in most languages). He uses this same value in his error message, which may be deceptive, as it will print 1 fewer than might be expected. I would write it as:
die "Error: file $ARGV[0] has ", $#f1 + 1, " serials, file $ARGV[1] has ", $#f2 + 1, "\n"
if $#f1 != $#f2;
Notice I changed "file1" to "file $ARGV[0]" - that way, it will print the name of the file you specified, rather than just the ambiguous "file1". Notice also that I split up the die() function and the if() condition on two lines. I think it's more readable that way. I also might write unless $#f1 == $#f2 instead of if $#f1 != $#f2, but that's just because I happen to think != is an ugly operator. There's more than one way to do it.
foreach my $g (0 .. $#f1) {
This is a common idiom in Perl. We normally use for() or foreach() (same thing, really) to iterate over each element of a list. However, sometimes we need the indices of that list (some languages might use the term "enumerate"), so we've used the range operator (..) to make a list that goes from 0 to $#f1, i.e., through all the indices of our list, since $#f1 is the value of the highest index in our list. Perl will loop through each index, and in each loop, will assign the value of that index to the lexically-scoped variable $g (though why they didn't use $i like any sane programmer, I don't know - come on, people, this tradition has been around since Fortran!). So the first time through the loop, $g will be 0, and the second time it will be 1, and so on until the last time it is $#f1.
print (($f2[$g] =~ m/RESULT:\s+PASS/) ? $f2[$g] : $f1[$g]);
This is the body of our loop, which uses the ternary conditional operator ?:. There's nothing wrong with the ternary operator, but if the code gives you trouble we can just change it to an if(). Let's just go ahead and rewrite it to use if():
if($f2[$g] =~ m/RESULT:\s+PASS/) {
print $f2[$g];
} else {
print $f1[$g];
}
Pretty simple - we do a regex check on $f2[$g] (the entry in our second file corresponding to the current entry in our first file) that basically checks whether or not that test passed. If it did, we print $f2[$g] (which will tell us that test passed), otherwise we print $f1[$g] (which will tell us the test that failed).
print STDERR "$#f1 serials found\n";
This just prints an ending diagnostic message telling us how many serials were found (minus one, again).
I personally would rewrite that whole hairy bit where he hacks with $/ and then does two reads from <> to be a loop, because I think that would be more readable, but this code should work fine, so if you don't have to change it too much you should be in good shape.
The undef $/ line deactivates the input record separator. Instead of reading records line by line, the interpreter will read whole files at once after that.
The <>, or 'diamond operator' reads from the files from the command line or standard input, whichever makes sense. In your case, the command line is explicitely checked, so it will be files. Input record separation has been deactivated, so each time you see a <>, you can think of it as a function call returning a whole file as a string.
The split operators take this string and cut it in chunks, each time it meets the regular expression in argument. The (?= ... ) construct means "the delimiter is this, but please keep it in the chunked result anyway."
That's all there is to it. There would always be a few optimizations, simplifications, or "other ways to do it," but this should get you running.
You can get quick glimpse how the script works, by translating it into java or scala. The inccode.com translator delivers following java code:
public class script extends CRoutineProcess implements IInProcess
{
VarArray arrF1 = new VarArray();
VarArray arrF2 = new VarArray();
VarBox call ()
{
// !/usr/bin/perl -w
if (!(BoxSystem.ProgramArguments.scalar().isGT(1)))
{
BoxSystem.die(BoxString.is(VarString.is("Usage: ").join(BoxSystem.foundArgument.get(0
).toString()).join(" <file1> <file2>\n")
)
);
}
BoxSystem.InputRecordSeparator.empty();
arrF1.setValue(BoxConsole.readLine().split(BoxPattern.is("(?=(?:SERIAL NUMBER:\\s+\\d+))")));
arrF2.setValue(BoxConsole.readLine().split(BoxPattern.is("(?=(?:SERIAL NUMBER:\\s+\\d+))")));
if ((arrF1.length().isNE(arrF2.length())))
{
BoxSystem.die("Error: file1 has $#f1 serials, file2 has $#f2\n");
}
for (
VarBox varG : VarRange.is(0,arrF1.length()))
{
BoxSystem.print((arrF2.get(varG).like(BoxPattern.is("RESULT:\\s+PASS"))) ? arrF2.get(varG) : arrF1.get(varG)
);
}
return STDERR.print("$#f1 serials found\n");
}
}

How can I merge lines in a large, unsorted file without running out of memory in Perl?

I have a very large column-delimited file coming out of a database report in something like this:
field1,field2,field3,metricA,value1
field1,field2,field3,metricB,value2
I want the new file to have combine lines like this so it would look something like this:
field1,field2,field3,value1,value2
I'm able to do this using a hash. In this example, the first three fields are the key and I combine value1 and value in a certain order to be the value. After I've read in the file, I just print out the hash table's keys and values into another file. Works fine.
However, I have some concerns since my file is going to be very large. About 8 GB per file.
Would there be a more efficient way of doing this? I'm not thinking in terms of speed, but in terms of memory footprint. I'm concerned that this process could die due to memory issues. I'm just drawing a blank in terms of a solution that would work but wouldn't shove everything into, ultimately, a very large hash.
For full-disclosure, I'm using ActiveState Perl on Windows.
If your rows are sorted on the key, or for some other reason equal values of field1,field2,field3 are adjacent, then a state machine will be much faster. Just read over the lines and if the fields are the same as the previous line, emit both values.
Otherwise, at least, you can take advantage of the fact that you have exactly two values and delete the key from your hash when you find the second value -- this should substantially limit your memory usage.
If you had other Unix like tools available (for example via cygwin) then you could sort the file beforehand using the sort command (which can cope with huge files). Or possibly you could get the database to output the sorted format.
Once the file is sorted, doing this sort of merge is then easy - iterate down a line at a time, keeping the last line and the next line in memory, and output whenever the keys change.
If you don't think the data will fit in memory, you can always tie
your hash to an on-disk database:
use BerkeleyDB;
tie my %data, 'BerkeleyDB::Hash', -Filename => 'data';
while(my $line = <>){
chomp $line;
my #columns = split /,/, $line; # or use Text::CSV_XS to parse this correctly
my $key = join ',', #columns[0..2];
my $a_key = "$key:metric_a";
my $b_key = "$key:metric_b";
if($columns[3] eq 'A'){
$data{$a_key} = $columns[4];
}
elsif($columns[3] eq 'B'){
$data{$b_key} = $columns[4];
}
if(exists $data{$a_key} && exists $data{$b_key}){
my ($a, $b) = map { $data{$_} } ($a_key, $b_key);
print "$key,$a,$b\n";
# optionally delete the data here, if you don't plan to reuse the database
}
}
Would it not be better to make another export directly from the database into your new file instead of reworking the file you have already output. If this is an option then I would go that route.
You could try something with Sort::External. It reminds me of a mainframe sort that you can use right in the program logic. It's worked pretty well for what I've used it for.