Reading .fasta sequences to extract nucleotide data, and then writing to a TabDelimited file - perl

Before I continue, I thought I'd refer readers to my previous problems with Perl, being a beginner to all of this.
These were my posts over the past few days, in chronological order:
How do I average column values from a tab-separated data... (Solved)
Why do I see no computed results in my output file? (Solved)
Using a .fasta file to compute relative content of sequences
Now as I've stated above, thanks to help from a few of you, I've managed to figure out the first two queries and I've really learnt from it. I'm truly grateful. For a person who knows nothing about this, and still feels like he doesn't, the help was practically a Godsend.
The last query remains unsolved and this is a continuation. I did have a look at some of the recommended texts, but as I'm trying to get this finished before Monday, I'm unsure if I've overlooked anything completely. Either way, I have had a go at attempting the task.
Just so you know, the task is to open and read a .fasta file (I think I've finally nailed something pretty well, hallelujah!), read each sequence, compute the relative G+C nucleotide content, and then write to a TABDelimited file and the names of the genes and their respective G+C content.
Even though I've had a go at attempting this, I know that I am no where near ready to execute the program to provide the results that I'm after, which is why I'm reaching out to you guys again for some guidance, or examples of how to go about this. As with my previous, solved queries, I'd like it to be in a similar style to what I've already done them in - even though it might not be the most convenient/efficient way. It just allows me to know what I'm doing each step of the way, even though it seems like I'm spamming it up!
Anyway, the .fasta file reads something like:
>label
sequence
>label
sequence
>label
sequence
I'm unsure how to open the .fasta file, so I'm not sure what labels apply to which, but I know that the genes should be labelled either gag, pol, or env. Do I need to open the .fasta file to know what I'm doing, or can I do it 'blindly' by going with the above format?
It may be perfectly obvious, but I'm still struggling with all of this. I'm feeling like I should have caught on by now!
Anyway, the current code I have is as follows:
#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.
use strict;
my $infile = "Lab1_seq.fasta"; # This is the file path
open INFILE, $infile or die "Can't open $infile: $!"; # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt"; # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open
my $sequence = (); # This sequence variable stores the sequences from the .fasta file
my $GC = 0; # This variable checks for G + C content
my $line; # This reads the input file one-line-at-a-time
while ($line = <INFILE>) {
chomp $line; # This removes "\n" at the end of each line (this is invisible)
foreach my $line ($infile) {
if($line = ~/^\s*$/) { # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
next;
} elsif($line = ~/^\s*#/) { # This finds lines with spaces before the hash character. Removes .fasta comment
next;
} elsif($line = ~/^>/) { # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
next;
} else {
$sequence = $line;
}
}
{
$sequence =~ s/\s//g; # Whitespace characters are removed
return $sequence;
}
I'm not sure if anything's correct here, but executing it left me with a syntax error ar line 35 (beyond the last line, and hence there isn't anything there!). It said at 'EOF'. That's about all I can point out. Otherwise I'm trying to figure out how to compute the quantities of the nucleotides G + C in each of the sequences, and then tabulating this properly in an output .txt file. I believe that's what is meant by a TABDelimited file?
In any case, I apologise if this query seems to be too lengthy, 'dumb' or a repeat, but in saying that, I couldn't find any information directly pertaining to this, so your help would be much appreciated, and the explanations for each step too if possible!!
Kindest.

You have an extra brace right near the end. This should work:
#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.
use strict;
my $infile = "Lab1_seq.fasta"; # This is the file path
open INFILE, $infile or die "Can't open $infile: $!"; # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt"; # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open
my $sequence = (); # This sequence variable stores the sequences from the .fasta file
my $GC = 0; # This variable checks for G + C content
my $line; # This reads the input file one-line-at-a-time
while ($line = <INFILE>) {
chomp $line; # This removes "\n" at the end of each line (this is invisible)
if($line =~ /^\s*$/) { # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
next;
} elsif($line =~ /^\s*#/) { # This finds lines with spaces before the hash character. Removes .fasta comment
next;
} elsif($line =~ /^>/) { # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
next;
} else {
$sequence = $line;
}
$sequence =~ s/\s//g; # Whitespace characters are removed
print OUTFILE $sequence;
}
Also I edited your return line. Return will exit your loop. I suspect what you want is to print it to a file, so I have done that. You may need to do some further transformation first to get it into a tab separated format.

Related

Perl File Read: multiple lines combined while reading in list context

I am reading a file using the following code:
open ($myfile, "<file.txt") or die "Could not open the file";
#lines = <$myfile>;
foreach $line (#lines){
print $line;
}
close myfile;
The contents of the file are:
Crossroads Blues
Terraplane Blues
Come on in My Kitchen
Walking Blues
Mister Jelly Roll Maker
Last Fair Deal Gone Down
32-20 Blues
Kindhearted Woman Blues
If I Had Possession Over Judgement Day
Preaching Blues
Blind Willie's Blues
When You Got a Good Friend
Rambling on My Mind
Stones in My Passway
Wild Jelly Roll Blues
Traveling Riverside Blues
Roll My Jellyroll
Milkcow's Calf Blues
Me and the Devil Blues
Hellhound on My Trail
But the output of the program is:
Hellhound on My Trailsuesdudgement Day
It looks like the code reads only one line, and replaces the first characters with the new line that is read. I have tried different files. Only one line is printed, which is basically aggregated over all the lines.
Your original file has just a carriage-return (CR) at the end of each line when it should have a linefeed (LF) or possibly both CR and LF if it originated from a Windows system and you are reading it on Linux
Without any newlines to split up the data, #lines has only a single element which contains the entire file contents
Printing that text to the terminal results in all of the lines being displayed on top of one another as you have seen
You need to fix the creation of your file, but in the mean time you can read it correctly by changing Perl's record separator $/ like this
use strict;
use warnings 'all';
open my $fh, '<', 'file.txt' or die "Could not open the file: $!";
my #lines = do {
local $/ = "\r";
<$fh>;
};
chomp #lines;
print "$_\n" for #lines;
Please check your original script and posted script are same.
You did mention the last line is only printing by your example program. It won't. It will print the whole lines.
Always put use warnings; and use strict; in top of the program.
Then storing the whole file into an array then read from an array is a very poor method. Use while loop instead.
open ($myfile, "<","file.txt") or die "Could not open the file";
while(<$myfile>)
{
print ; # Data are store into the default variable $_. So no need to mention the $_ in print statement.
}
The below script will produce the your mentioned output.
foreach (#lines)
{
$line = $_; # this or
#new = $_; # this
}
print $line; #last line
print #new; #last line
If you want to store the particular data into another variable, look at concatenation for string($) and push or unshift for an array(#)

Change line in textfile using perl

I read other places on how to do this but they were confusing for me.
I want to read lines from a text file and when I come across a certain line I want to append something to it.
My code is:
open my $p, "$username_filename" or die "can not open $username_filename: $!";
foreach $line (<$p>){
if ($line =~ /^listen/){
`echo "whatever" >> $username_file`;
}
}
However when I run this I get this error
sh: -c: line 0: syntax error near unexpected token `newline' sh: -c: line 0: `echo "current_user" >> '
Is this way correct to edit the file and why am I getting this error?
Working with files is not like editing in a word processor. Lines are an illusion, a file is just a big string of characters. You can't change a line in the middle of a file for the same reason you can't change a line in the middle of a book, the words can't be moved around to make room.
Instead, like a book, if you want to change something you need to rewrite the whole thing.
The basic algorithm is to...
Open the file for reading.
Open a temporary file for writing.
Read a line, alter the line, write the line.
Repeat 3 until done reading.
Overwrite the file with the temp file.
Some other notes...
print writes to STDOUT by default, but you can give it a filehandle to write to instead.
foreach my $line (<$fh>) is unfortunately not optimized to read files. It will read the possibly enormous file into memory. while(my $line = <$fh>) reads one line at a time.
I've turned on strict. This forces you to declare your variables. It protects you from typos like the one you made of $username_file vs $username_filename.
You could use something like "$filename.tmp" but File::Temp provides temp files that are guaranteed to be temporary, unique and cleaned up when the program exits.
use strict;
use warnings;
use autodie; # because writing 'or die' gets old fast
use File::Temp; # provides safe temp files
my $filename = ...; # set it somehow
open my $read, "<", $filename;
my $temp = File::Temp->new;
while(my $line = <$read>) {
if( $line =~ /^listen/ ) {
chomp $line; # remove the newline
$line .= " whatever\n"; # add our content and put a newline back
}
# Write the line to the temp file
print $temp $line;
}
# Overwrite our file with the rewritten temp file
rename $temp->filename, $filename;
That's inside a program. If you just want to do it quickly, you can do it on the command line with -i and -p.
perl -i.bak -pe 'if( /^listen/ ) { chomp; $_ .= "whatever" }' filename
-p says to run the code on each line of the file. The line will be put into $_ and whatever is in $_ will be printed. -i says to edit the file in place. -i.bak makes a backup of the original file just in case you make a mistake.
There are a few problems with your attempt. The big one is that using echo >> file will append to the file, not insert at some arbitrary place inside the file.
Another problem is that you're trying to append to a file called $username_file, and you haven't declared or defined that variable.
I don't think perl lets you insert into the middle of a file. I think your best bet would be to read the file a line at a time, and on the correct line(s), append the text you want. Write each line to a new file, then swap the files around at the end.
For example:
#!/usr/bin/perl
my $in_filename = "in.txt";
my $out_filename = "out.txt";
open (my $in, "<", $in_filename) or die;
open (my $out, ">", $out_filename) or die;
while (my $lline = <$in>)
{
chomp $lline;
if ( $lline =~ /listen/ )
{
print "$lline whatever\n";
}
else
{
print "$lline\n";
}
}
close $in;
close $out;
rename $in_filename, "$in_filename.original";
rename $out_filename, $in_filename;
I use chomp to remove line endings, because <$in> gives us a line including its line endings, wish otherwise messes up the append.
As always there are many ways to achieve this. I think using sed is probably a better option for this, but you specifically asked how to do it in perl, so perl it is.

Finding an amino acid sequence in a file

I have a FASTA file of a protein sequence. I want to find if the
sequence hxxhcxc is present in the file or not, if yes, then print the
stretch. Here, h=hydrophobic, c=charged, x=any (including remaining) residue/s.
How to do this in Perl?
What I could think of is make 3 arrays—of hydrophobic, charged and all residues.
Compare each array with the file having the FASTA sequence. I can't think of anything beyond this, especially how to maintain the order—that's the main thing. I am a beginner in Perl, so please make the explanation as simple as possible.
PS: Since this is just one sequence, I can simply copy the content to a .txt file, there is no compulsion to use a fasta file (in this case). Hydrophobic and charged are residues(amino acids)- there are 9 hydrophobic and 5 charged residues. It is the name of the amino acid that is in upper case single letter as you mentioned. So what I want to do is find a sequence: hydrophobic, any, any, hydrophobic, charged, any, charged (hxxhcxc) in that order in the protein sequence (.txt file/fasta file). I struggled to re frame my question-hope I'm a little better now.
I'm not familiar with Fasta files, but regular expressions certainly seem like the way to go here.
In words
If you open the file for reading, you can process the file line by line, print-ing only those lines if they match the regular expression you specified.
In code
use strict;
use warnings;
use autodie;
open my $fh, '<', 'file.fasta'; # Open filehandle in read mode
while ( my $line = <$fh> ) { # Loop over line by line
print $line # Print line if it matches pattern
if $line =~ /h..hc.c/; # '.' in a regular expression matches
# (almost) anything
}
close $fh; # Close filehandle
So, you'll have to decide which are the "hydrophobic" amino acids, but lets just start with either V(aline),I(soleucine),L(eucine),F,W, or C.
And the charged amino acids are E,D,R or K. Using this you can define
a regex (you'll see it below)
If you just have the whole sequence in a text file parse it like this:
#!/usr/bin/perl
open(IN, "yourfile.txt") || die("couldn't open the file: $!");
$sequence = "";
while(<IN>) {
chomp();
$sequence .= $_;
}
if($sequence =~ /[VILFWC]..[VILFWC][EDRK].[EDRK]/) {
print "Found it!\n";
} else {
print "Not there\n";
}

How do I efficiently parse a CSV file in Perl?

I'm working on a project that involves parsing a large csv formatted file in Perl and am looking to make things more efficient.
My approach has been to split() the file by lines first, and then split() each line again by commas to get the fields. But this suboptimal since at least two passes on the data are required. (once to split by lines, then once again for each line). This is a very large file, so cutting processing in half would be a significant improvement to the entire application.
My question is, what is the most time efficient means of parsing a large CSV file using only built in tools?
note: Each line has a varying number of tokens, so we can't just ignore lines and split by commas only. Also we can assume fields will contain only alphanumeric ascii data (no special characters or other tricks). Also, i don't want to get into parallel processing, although it might work effectively.
edit
It can only involve built-in tools that ship with Perl 5.8. For bureaucratic reasons, I cannot use any third party modules (even if hosted on cpan)
another edit
Let's assume that our solution is only allowed to deal with the file data once it is entirely loaded into memory.
yet another edit
I just grasped how stupid this question is. Sorry for wasting your time. Voting to close.
The right way to do it -- by an order of magnitude -- is to use Text::CSV_XS. It will be much faster and much more robust than anything you're likely to do on your own. If you're determined to use only core functionality, you have a couple of options depending on speed vs robustness.
About the fastest you'll get for pure-Perl is to read the file line by line and then naively split the data:
my $file = 'somefile.csv';
my #data;
open(my $fh, '<', $file) or die "Can't read file '$file' [$!]\n";
while (my $line = <$fh>) {
chomp $line;
my #fields = split(/,/, $line);
push #data, \#fields;
}
This will fail if any fields contain embedded commas. A more robust (but slower) approach would be to use Text::ParseWords. To do that, replace the split with this:
my #fields = Text::ParseWords::parse_line(',', 0, $line);
Here is a version that also respects quotes (e.g. foo,bar,"baz,quux",123 -> "foo", "bar", "baz,quux", "123").
sub csvsplit {
my $line = shift;
my $sep = (shift or ',');
return () unless $line;
my #cells;
$line =~ s/\r?\n$//;
my $re = qr/(?:^|$sep)(?:"([^"]*)"|([^$sep]*))/;
while($line =~ /$re/g) {
my $value = defined $1 ? $1 : $2;
push #cells, (defined $value ? $value : '');
}
return #cells;
}
Use it like this:
while(my $line = <FILE>) {
my #cells = csvsplit($line); # or csvsplit($line, $my_custom_seperator)
}
As other people mentioned, the correct way to do this is with Text::CSV, and either the Text::CSV_XS back end (for FASTEST reading) or Text::CSV_PP back end (if you can't compile the XS module).
If you're allowed to get extra code locally (eg, your own personal modules) you could take Text::CSV_PP and put it somewhere locally, then access it via the use lib workaround:
use lib '/path/to/my/perllib';
use Text::CSV_PP;
Additionally, if there's no alternative to having the entire file read into memory and (I assume) stored in a scalar, you can still read it like a file handle, by opening a handle to the scalar:
my $data = stupid_required_interface_that_reads_the_entire_giant_file();
open my $text_handle, '<', \$data
or die "Failed to open the handle: $!";
And then read via the Text::CSV interface:
my $csv = Text::CSV->new ( { binary => 1 } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
while (my $row = $csv->getline($text_handle)) {
...
}
or the sub-optimal split on commas:
while (my $line = <$text_handle>) {
my #csv = split /,/, $line;
... # regular work as before.
}
With this method, the data is only copied a bit at a time out of the scalar.
You can do it in one pass if you read the file line by line. There is no need to read the whole thing into memory at once.
#(no error handling here!)
open FILE, $filename
while (<FILE>) {
#csv = split /,/
# now parse the csv however you want.
}
Not really sure if this is significantly more efficient though, Perl is pretty fast at string processing.
YOU NEED TO BENCHMARK YOUR IMPORT to see what is causing the slowdown. If for example, you are doing a db insertion that takes 85% of the time, this optimization won't work.
Edit
Although this feels like code golf, the general algorithm is to read the whole file or part of the fie into a buffer.
Iterate byte by byte through the buffer until you find a csv delimeter, or a new line.
When you find a delimiter, increment your column count.
When you find a newline increment your row count.
If you hit the end of your buffer, read more data from the file and repeat.
That's it. But reading a large file into memory is really not the best way, see my original answer for the normal way this is done.
Assuming that you have your CSV file loaded into $csv variable and that you do not need text in this variable after you successfully parsed it:
my $result=[[]];
while($csv=~s/(.*?)([,\n]|$)//s) {
push #{$result->[-1]}, $1;
push #$result, [] if $2 eq "\n";
last unless $2;
}
If you need to have $csv untouched:
local $_;
my $result=[[]];
foreach($csv=~/(?:(?<=[,\n])|^)(.*?)(?:,|(\n)|$)/gs) {
next unless defined $_;
if($_ eq "\n") {
push #$result, []; }
else {
push #{$result->[-1]}, $_; }
}
Answering within the constraints imposed by the question, you can still cut out the first split by slurping your input file into an array rather than a scalar:
open(my $fh, '<', $input_file_path) or die;
my #all_lines = <$fh>;
for my $line (#all_lines) {
chomp $line;
my #fields = split ',', $line;
process_fields(#fields);
}
And even if you can't install (the pure-Perl version of) Text::CSV, you may be able to get away with pulling up its source code on CPAN and copy/pasting the code into your project...

How can I print a matching line, one line immediately above it and one line immediately below?

From a related question asked by Bi, I've learnt how to print a matching line together with the line immediately below it. The code looks really simple:
#!perl
open(FH,'FILE');
while ($line = <FH>) {
if ($line =~ /Pattern/) {
print "$line";
print scalar <FH>;
}
}
I then searched Google for a different code that can print matching lines with the lines immediately above them. The code that would partially suit my purpose is something like this:
#!perl
#array;
open(FH, "FILE");
while ( <FH> ) {
chomp;
$my_line = "$_";
if ("$my_line" =~ /Pattern/) {
foreach( #array ){
print "$_\n";
}
print "$my_line\n"
}
push(#array,$my_line);
if ( "$#array" > "0" ) {
shift(#array);
}
};
Problem is I still can't figure out how to do them together. Seems my brain is shutting down. Does anyone have any ideas?
Thanks for any help.
UPDATE:
I think I'm sort of touched. You guys are so helpful! Perhaps a little Off-topic, but I really feel the impulse to say more.
I needed a Windows program capable of searching the contents of multiple files and of displaying the related information without having to separately open each file. I tried googling and two apps, Agent Ransack and Devas, have proved to be useful, but they display only the lines containing the matched query and I want aslo to peek at the adjacent lines. Then the idea of improvising a program popped into my head. Years ago I was impressed by a Perl script that could generate a Tomeraider format of Wikipedia so that I can handily search Wiki on my Lifedrive and I've also read somewhere on the net that Perl is easy to learn especially for some guy like me who has no experience in any programming language. Then I sort of started teaching myself Perl a couple of days ago. My first step was to learn how to do the same job as "Agent Ransack" does and it proved to be not so difficult using Perl. I first learnt how to search the contents of a single file and display the matching lines through the modification of an example used in the book titled "Perl by Example", but I was stuck there. I became totally clueless as how to deal with multiple files. No similar examples were found in the book or probably because I was too impatient. And then I tried googling again and was led here and I asked my first question "How can I search multiple files for a string pattern in Perl?" here and I must say this forum is bloody AWESOME ;). Then I looked at more example scripts and then I came up with the following code yesterday and it serves my original purpose quite well:
The codes goes like this:
#!perl
$hits=0;
print "INPUT YOUR QUERY:";
chop ($query = <STDIN>);
$dir = 'f:/corpus/';
#files = <$dir/*>;
foreach $file (#files) {
open (txt, "$file");
while($line = <txt>) {
if ($line =~ /$query/i) {
$hits++;
print "$file \n $line";
print scalar <txt>;
}
}
}
close(txt);
print "$hits RESULTS FOUND FOR THIS SEARCH\n";
In the folder "corpus", I have a lot of text files including srt pdf doc files that contain such contents as follows:
Then I dumped the body.
J'ai mis le corps dans une décharge.
I know you have a wire.
Je sais que tu as un micro.
Now I'll tell you the truth.
Alors je vais te dire la vérité.
Basically I just need to search an English phrase and look at the French equivalent, so the script I finished yesterday is quite satisfying except that it would to be better if my script can display the above line in case I want to search a French phrase and check the English. So I'm trying to improve the code. Actually I knew the "print scalar " is buggy, but it is neat and does the job of printing the subsequent line at least most of the time). I was even expecting ANOTHER SINGLE magic line that prints the previous line instead of the subsequent :) Perl seems to be fun. I think I will spend more time trying to get a better understanding of it. And as suggested by daotoad, I'll study the codes generously offered by you guys. Again thanks you guys!
It will probably be easier just to use grep for this as it allows printing of lines before and after a match. Use -B and -A to print context before and after the match respectively. See http://ss64.com/bash/grep.html
Here's a modernized version of Pax's excellent answer:
use strict;
use warnings;
open( my $fh, '<', 'qq.in')
or die "Error opening file - $!\n";
my $this_line = "";
my $do_next = 0;
while(<$fh>) {
my $last_line = $this_line;
$this_line = $_;
if ($this_line =~ /XXX/) {
print $last_line unless $do_next;
print $this_line;
$do_next = 1;
} else {
print $this_line if $do_next;
$last_line = "";
$do_next = 0;
}
}
close ($fh);
See Why is three-argument open calls with lexical filehandles a Perl best practice? for an discussion of the reasons for the most important changes.
Important changes:
3 argument open.
lexical filehandle
added strict and warnings pragmas.
variables declared with lexical scope.
Minor changes (issues of style and personal taste):
removed unneeded parens from post-fix if
converted an if-not contstruct into unless.
If you find this answer useful, be sure to up-vote Pax's original.
Given the following input file:
(1:first) Yes, this one.
(2) This one as well (XXX).
(3) And this one.
Not this one.
Not this one.
Not this one.
(4) Yes, this one.
(5) This one as well (XXX).
(6) AND this one as well (XXX).
(7:last) And this one.
Not this one.
this little snippet:
open(FH, "<qq.in");
$this_line = "";
$do_next = 0;
while(<FH>) {
$last_line = $this_line;
$this_line = $_;
if ($this_line =~ /XXX/) {
print $last_line if (!$do_next);
print $this_line;
$do_next = 1;
} else {
print $this_line if ($do_next);
$last_line = "";
$do_next = 0;
}
}
close (FH);
produces the following, which is what I think you were after:
(1:first) Yes, this one.
(2) This one as well (XXX).
(3) And this one.
(4) Yes, this one.
(5) This one as well (XXX).
(6) AND this one as well (XXX).
(7:last) And this one.
It basically works by remembering the last line read and, when it finds the pattern, it outputs it and the pattern line. Then it continues to output pattern lines plus one more (with the $do_next variable).
There's also a little bit of trickery in there to ensure no line is printed twice.
You always want to store the last line that you saw in case the next line has your pattern and you need to print it. Using an array like you did in the second code snippet is probably overkill.
my $last = "";
while (my $line = <FH>) {
if ($line =~ /Pattern/) {
print $last;
print $line;
print scalar <FH>; # next line
}
$last = $line;
}
grep -A 1 -B 1 "search line"
I am going to ignore the title of your question and focus on some of the code you posted because it is positively harmful to let this code stand without explaining what is wrong with it. You say:
code that can print matching lines with the lines immediately above them. The code that would partially suit my purpose is something like this
I am going to go through that code. First, you should always include
use strict;
use warnings;
in your scripts, especially since you are just learning Perl.
#array;
This is a pointless statement. With strict, you can declare #array using:
my #array;
Prefer the three-argument form of open unless there is a specific benefit in a particular situation to not using it. Use lexical filehandles because bareword filehandles are package global and can be the source of mysterious bugs. Finally, always check if open succeeded before proceeding. So, instead of:
open(FH, "FILE");
write:
my $filename = 'something';
open my $fh, '<', $filename
or die "Cannot open '$filename': $!";
If you use autodie, you can get away with:
open my $fh, '<', 'something';
Moving on:
while ( <FH> ) {
chomp;
$my_line = "$_";
First, read the FAQ (you should have done so before starting to write programs). See What's wrong with always quoting "$vars"?. Second, if you are going to assign the line that you just read to $my_line, you should do it in the while statement so you do not needlessly touch $_. Finally, you can be strict compliant without typing any more characters:
while ( my $line = <$fh> ) {
chomp $line;
Refer to the previous FAQ again.
if ("$my_line" =~ /Pattern/) {
Why interpolate $my_line once more?
foreach( #array ){
print "$_\n";
}
Either use an explicit loop variable or turn this into:
print "$_\n" for #array;
So, you interpolate $my_line again and add the newline that was removed by chomp earlier. There is no reason to do so:
print "$my_line\n"
And now we come to the line that motivated me to dissect the code you posted in the first place:
if ( "$#array" > "0" ) {
$#array is a number. 0 is a number. > is used to check if the number on the LHS is greater than the number on the RHS. Therefore, there is no need to convert both operands to strings.
Further, $#array is the last index of #array and its meaning depends on the value of $[. I cannot figure out what this statement is supposed to be checking.
Now, your original problem statement was
print matching lines with the lines immediately above them
The natural question, of course, is how many lines "immediately above" the match you want to print.
#!/usr/bin/perl
use strict;
use warnings;
use Readonly;
Readonly::Scalar my $KEEP_BEFORE => 4;
my $filename = $ARGV[0];
my $pattern = qr/$ARGV[1]/;
open my $input_fh, '<', $filename
or die "Cannot open '$filename': $!";
my #before;
while ( my $line = <$input_fh> ) {
$line = sprintf '%6d: %s', $., $line;
print #before, $line, "\n" if $line =~ $pattern;
push #before, $line;
shift #before if #before > $KEEP_BEFORE;
}
close $input_fh;
Command line grep is the quickest way to accomplish this, but if your goal is to learn some Perl then you'll need to produce some code.
Rather than providing code, as others have already done, I'll talk a bit about how to write your own. I hope this helps with the brain-lock.
Read my previous answer on how to write a program, it gives some tips about how to start working on your problem.
Go through each of the sample programs you have, as well as those offered here and comment out exactly what they do. Refer to the perldoc for each function and operator you don't understand. Your first example code has an error, if 2 lines in a row match, the line after the second match won't print. By error, I mean that either the code or the spec is wrong, the desired behavior in this case needs to be determined.
Write out what you want your program to do.
Start filling in the blanks with code.
Here's a sketch of a phase one write-up:
# This program reads a file and looks for lines that match a pattern.
# Open the file
# Iterate over the file
# For each line
# Check for a match
# If match print line before, line and next line.
But how do you get the next line and the previous line?
Here's where creative thinking comes in, there are many ways, all you need is one that works.
You could read in lines one at a time, but read ahead by one line.
You could read the whole file into memory and select previous and follow-on lines by indexing an array.
You could read the file and store the offset and length each line--keeping track of which ones match as you go. Then use your offset data to extract the required lines.
You could read in lines one at a time. Cache your previous line as you go. Use readline to read the next line for printing, but use seek and tell to rewind the handle so that the 'next' line can be checked for a match.
Any of these methods, and many more could be fleshed out into a functioning program. Depending on your goals, and constraints any one may be the best choice for that problem domain. Knowing how to select which one to use will come with experience. If you have time, try two or three different ways and see how they work out.
Good luck.
If you don't mind losing the ability to iterate over a filehandle, you could just slurp the file and iterate over the array:
#!/usr/bin/perl
use strict; # always do these
use warnings;
my $range = 1; # change this to print the first and last X lines
open my $fh, '<', 'FILE' or die "Error: $!";
my #file = <$fh>;
close $fh;
for (0 .. $#file) {
if($file[$_] =~ /Pattern/) {
my #lines = grep { $_ > 0 && $_ < $#file } $_ - $range .. $_ + $range;
print #file[#lines];
}
}
This might get horribly slow for large files, but is pretty easy to understand (in my opinion). Only when you know how it works can you set about trying to optimize it. If you have any questions about any of the functions or operations I used, just ask.