Finding an amino acid sequence in a file - perl

I have a FASTA file of a protein sequence. I want to find if the
sequence hxxhcxc is present in the file or not, if yes, then print the
stretch. Here, h=hydrophobic, c=charged, x=any (including remaining) residue/s.
How to do this in Perl?
What I could think of is make 3 arrays—of hydrophobic, charged and all residues.
Compare each array with the file having the FASTA sequence. I can't think of anything beyond this, especially how to maintain the order—that's the main thing. I am a beginner in Perl, so please make the explanation as simple as possible.
PS: Since this is just one sequence, I can simply copy the content to a .txt file, there is no compulsion to use a fasta file (in this case). Hydrophobic and charged are residues(amino acids)- there are 9 hydrophobic and 5 charged residues. It is the name of the amino acid that is in upper case single letter as you mentioned. So what I want to do is find a sequence: hydrophobic, any, any, hydrophobic, charged, any, charged (hxxhcxc) in that order in the protein sequence (.txt file/fasta file). I struggled to re frame my question-hope I'm a little better now.

I'm not familiar with Fasta files, but regular expressions certainly seem like the way to go here.
In words
If you open the file for reading, you can process the file line by line, print-ing only those lines if they match the regular expression you specified.
In code
use strict;
use warnings;
use autodie;
open my $fh, '<', 'file.fasta'; # Open filehandle in read mode
while ( my $line = <$fh> ) { # Loop over line by line
print $line # Print line if it matches pattern
if $line =~ /h..hc.c/; # '.' in a regular expression matches
# (almost) anything
}
close $fh; # Close filehandle

So, you'll have to decide which are the "hydrophobic" amino acids, but lets just start with either V(aline),I(soleucine),L(eucine),F,W, or C.
And the charged amino acids are E,D,R or K. Using this you can define
a regex (you'll see it below)
If you just have the whole sequence in a text file parse it like this:
#!/usr/bin/perl
open(IN, "yourfile.txt") || die("couldn't open the file: $!");
$sequence = "";
while(<IN>) {
chomp();
$sequence .= $_;
}
if($sequence =~ /[VILFWC]..[VILFWC][EDRK].[EDRK]/) {
print "Found it!\n";
} else {
print "Not there\n";
}

Related

Check whether a field from a line of text line matches a value

I have been using the following Perl code to extract text from multiple text files. It works fine.
Example of a couple of lines in one of the input files:
Fa0/19 CUTExyz notconnect 129 half 100 10/100BaseTX
Fa0/22 xyz MLS notconnect 1293 half 10 10/100BaseTX
What I need is to match the numbers in each line exactly (i.e. 129 is not matched by 1293) and print the corresponding lines.
It would also be nice to match a range of numbers leaving specific numbers out i.e. match 2 through 10 but not 11 the 12 through 20
#!/perl/bin/perl
use warnings;
my #files = <c:/perl64/files/*>;
foreach $file ( #files ) {
open( FILE, "$file" );
while ( $line = <FILE> ) {
print "$file $line" if $line =~ /123/n;
}
close FILE;
}
Thank you for the suggestions, but can it can be done using the code structure above?
I suggest that you take a look at perldoc perlre.
You need to anchor your regex pattern. The easiest way is probably using \b which is a zero-width boundary between alphanumerics and non-alphanumerics.
#!/perl/bin/perl
use warnings;
use strict;
foreach my $file ( glob "c:/perl64/files/*" ) {
open( my $input, '<', $file ) or die $!;
while (<$input>) {
print "$file $_" if m/\b123\b/;
}
close $input;
}
Note - you should use three-argument open with lexical file handles as above, because it is better practice.
I've also removed the n pattern modifier, as it appears redundant.
Following your edit though, to give us some source data. I'd suggest the solution is not to use a regex - your source data looks space delimited. (Maybe those are tabs?).
So I'd suggest you're better off using split and selecting the field you want, and testing it numerically, because you mention matching ranges. This is not a good fit for regexes because they don't understand the numeric content.
Instead:
while ( <$input> ) {
print if (split)[-4] == 129;
}
Note - I use -4 in the split, which indexes from the end of the list.
This is because column 3 contains spaces, so splitting on whitespace is going to produce the wrong result unless we count down from the end of the array. Using a negative index we get the right field each time.
If your data is tab separated then you could use chomp and split /\t/. Or potentially split on /\s{2,}/ to split on 2-or-more spaces
But by selecting the field, you can do numeric tests on it, like
if $fields[-4] > 100 and $fields[-4] < 200
etc.
I hope you don't get the answers you're asking for, which discard best practice because of your unfamiliarity with Perl. It is inappropriate to ask how to write an ugly solution because proper Perl is beyond your reach
As has been said repeatedly on this site, if you don't know how to do a job then you should hire someone who does know and pay them for their work. No other profession that I know has the expectation of getting quality work done for free
Here's a few notes on your code. Wherever you have learned your techniques, you have been looking at a very outdated resource
Do you really have a root directory perl, so that your compiler is /perl/bin/perl? That's very unusual, and there is no need to use a shebang line in Windows
You must always add use strict and use warnings 'all' at the top of every Perl program you write, and declare all of your variables using my as close as possible to their first point of use. For some reason you do this with #files but not with $file
It is better to replace <c:/perl64/files/*> with glob 'C:/perl64/files/*'. Otherwise the code is less clear because Perl overloads the <> operator
Don't put variable names inside double quotes. It is unnecessary at best, and may cause bugs. So "$file" should be $file
Always use the three-parameter version of open, so that the second parameter is the open mode
Don't use global file handles. And always test whether the file has been opened correctly, dying with a message including $!—the reason for the failure—if the open fails
open( FILE, "$file" )
should be something like
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!}
Don't rely on regex patterns for everything. In this case it looks like split would be a better option, or perhaps unpack if your records have fixed-width fields. In my solution below I have used split on "more than one space", but if your real data is different from what you have shown (tab-delimited?) then this is not going to work
Note that Fa0/129 will also be matched by your current approach
This Perl program filters your data, printing lines where the fourth field $lines[3] (delineated by more than one whitespace character) is numerically equal to 129
The output shown is produced when the input is the single file splitn.txt, containing the data shown in your question
use strict;
use warnings 'all';
for my $file ( glob 'C:/perl64/files/*' ) {
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
while ( my $line = <$fh> ) {
chomp;
my #fields = split /\s\s+/, $line;
print "$file $line" if $fields[3] == 129;
}
}
output
splitn.txt Fa0/19 CUTExyz notconnect 129 half 100 10/100BaseTX
Your question is unclear. When you say:
What I need is to match numbers in the on each line exactly
That could mean a couple of things. It could mean that each line contains nothing but a single number which you want to match. In that case, using == is probably better than using a regular expression. Or it could mean that you have lots of text on a line and you only want to match complete numbers. In that case you should use \b (the "word boundary" anchor) - /\b123\b/.
If you're clearer in your questions (perhaps by giving us sample input) then people won't have to guess at your meaning.
A few more points on your code:
Always include both use strict and use warnings.
Always check the return value from open() and take appropriate action on failure.
Use lexical filehandles and 3-arg version of open().
No need to quote $file in your open() call.
Using $_ can simplify your code.
/n on the match operator has no effect unless your regex contains parentheses.
Putting that all together (and assuming my second interpretation of your question is correct), your code could look like this:
#!/perl/bin/perl
use strict;
use warnings;
my #files = <c:/perl64/files/*>;
foreach my $file (#files) {
open my $file_h, '<', $file
or die "Can't open $file: $!";
while (<$file_h>) {
print "$file $_\n" if /\b123\b/;
}
# No need to close $file_h as it is closed
# automatically when the variable goes out
# of scope.
}

Perl using regex to compare fields with multiple delimiters

I am studying Perl.
My data.txt file contains:
Lori:James Apple
Jamie:Eric Orange
My code below prints the first line "Lori:James Apple"
open(FILE,'data.txt');
while(<FILE>){
print if /James/;
}
But how do I modify my regular expression to search for a specific field?
For example, I'd like to use 2 delimiters ' ' and ':' to make each line contain 3 fields and check if the 3rd field of the first line is Apple. Which will be equivalent to awk -F'[ :]' '$3 = "Lori"' data.txt
One simple way with regex is to use the negated character class (also see it in perlreftut)
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = <$fh>)
{
my #fields = $line =~ /([^:\s]+)/g;
}
The [^...] matches any character other than those listed inside (after ^ which "negates"). The + quantifier means to match one-or-more times so the whole pattern matches a string of consecutive characters other than : and "white space." See docs for a precise description of \s. If you actually mean to skip only a single literal space use [^: ]. All this is captured by ().
The search keeps going through the string due to the global modifier /g, finding all such matches. Since it is in the list context it returns the list of matches, which is assigned to #fields array.
One can pick elements "on the fly" by indexing into the list, ($line =~ /([^:\s]+)/g)[2]. If we are matching $_ this is (/([^:\s]+)/g)[2].
I suggest a good read through perlreftut, for starters.
On the other hand, it is often simpler and clearer to use split
my #fields = split /[:\s]/, $line;
This also uses regex for the pattern by which to split the string. The character class is not negated since here it specifies the delimiter itself, either : or \s (each delimiter may be either of these, they don't have to all be the same).
I would now like to answer the specific question, but the question isn't clear to me.
It asks to "check if the 3rd field of the first line is Apple", what can be done for example by
while (<$fh>)
{
if ( (/([^:\s]+)/g)[2] eq 'Apple' ) {
# ....
}
}
but it isn't clear what to do with it. Perhaps get the first field by what the third one is?
I suggest to get an array and then process. One can write a regex to identify and pick fields directly but that's more brittle and the regex itself then depends on the position (and number) of fields.
At this point we are in a guessing game. If you need more detail please clarify.
The given awk code would yield Lori James Lori and I don't see how that fits.
The short answer is - don't. Regular expressions are about pattern matching, and not context.
You can define a pattern that builds in delimiters and fields, but ... it's not the right tool for the job.
The answer is use split and then handle the fields separately.
open ( my $input, '<', 'data.txt' ) or die $!;
while(<$input>){
chomp;
my #fields = split /[\s:]/;
print if $fields[2] eq "Apple";
}
You can compact this further if you wish, but I'd advise caution - compressing your code at the expense of readability isn't a virtue.
Also - whilst we're at it:
open(FILE,'data.txt');
is bad style - it doesn't check for success, and it also uses a global file handle name. It would be much better to:
open ( my $input, '<', 'data.txt' ) or die $!;
The autodie pragma also does this implicitly.

Reading .fasta sequences to extract nucleotide data, and then writing to a TabDelimited file

Before I continue, I thought I'd refer readers to my previous problems with Perl, being a beginner to all of this.
These were my posts over the past few days, in chronological order:
How do I average column values from a tab-separated data... (Solved)
Why do I see no computed results in my output file? (Solved)
Using a .fasta file to compute relative content of sequences
Now as I've stated above, thanks to help from a few of you, I've managed to figure out the first two queries and I've really learnt from it. I'm truly grateful. For a person who knows nothing about this, and still feels like he doesn't, the help was practically a Godsend.
The last query remains unsolved and this is a continuation. I did have a look at some of the recommended texts, but as I'm trying to get this finished before Monday, I'm unsure if I've overlooked anything completely. Either way, I have had a go at attempting the task.
Just so you know, the task is to open and read a .fasta file (I think I've finally nailed something pretty well, hallelujah!), read each sequence, compute the relative G+C nucleotide content, and then write to a TABDelimited file and the names of the genes and their respective G+C content.
Even though I've had a go at attempting this, I know that I am no where near ready to execute the program to provide the results that I'm after, which is why I'm reaching out to you guys again for some guidance, or examples of how to go about this. As with my previous, solved queries, I'd like it to be in a similar style to what I've already done them in - even though it might not be the most convenient/efficient way. It just allows me to know what I'm doing each step of the way, even though it seems like I'm spamming it up!
Anyway, the .fasta file reads something like:
>label
sequence
>label
sequence
>label
sequence
I'm unsure how to open the .fasta file, so I'm not sure what labels apply to which, but I know that the genes should be labelled either gag, pol, or env. Do I need to open the .fasta file to know what I'm doing, or can I do it 'blindly' by going with the above format?
It may be perfectly obvious, but I'm still struggling with all of this. I'm feeling like I should have caught on by now!
Anyway, the current code I have is as follows:
#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.
use strict;
my $infile = "Lab1_seq.fasta"; # This is the file path
open INFILE, $infile or die "Can't open $infile: $!"; # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt"; # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open
my $sequence = (); # This sequence variable stores the sequences from the .fasta file
my $GC = 0; # This variable checks for G + C content
my $line; # This reads the input file one-line-at-a-time
while ($line = <INFILE>) {
chomp $line; # This removes "\n" at the end of each line (this is invisible)
foreach my $line ($infile) {
if($line = ~/^\s*$/) { # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
next;
} elsif($line = ~/^\s*#/) { # This finds lines with spaces before the hash character. Removes .fasta comment
next;
} elsif($line = ~/^>/) { # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
next;
} else {
$sequence = $line;
}
}
{
$sequence =~ s/\s//g; # Whitespace characters are removed
return $sequence;
}
I'm not sure if anything's correct here, but executing it left me with a syntax error ar line 35 (beyond the last line, and hence there isn't anything there!). It said at 'EOF'. That's about all I can point out. Otherwise I'm trying to figure out how to compute the quantities of the nucleotides G + C in each of the sequences, and then tabulating this properly in an output .txt file. I believe that's what is meant by a TABDelimited file?
In any case, I apologise if this query seems to be too lengthy, 'dumb' or a repeat, but in saying that, I couldn't find any information directly pertaining to this, so your help would be much appreciated, and the explanations for each step too if possible!!
Kindest.
You have an extra brace right near the end. This should work:
#!/usr/bin/perl -w
# This script reads several sequences and computes the relative content of G+C of each sequence.
use strict;
my $infile = "Lab1_seq.fasta"; # This is the file path
open INFILE, $infile or die "Can't open $infile: $!"; # This opens file, but if file isn't there it mentions this will not open
my $outfile = "Lab1_SeqOutput.txt"; # This is the file's output
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!"; # This opens the output file, otherwise it mentions this will not open
my $sequence = (); # This sequence variable stores the sequences from the .fasta file
my $GC = 0; # This variable checks for G + C content
my $line; # This reads the input file one-line-at-a-time
while ($line = <INFILE>) {
chomp $line; # This removes "\n" at the end of each line (this is invisible)
if($line =~ /^\s*$/) { # This finds lines with whitespaces from the beginning to the ending of the sequence. Removes blank line.
next;
} elsif($line =~ /^\s*#/) { # This finds lines with spaces before the hash character. Removes .fasta comment
next;
} elsif($line =~ /^>/) { # This finds lines with the '>' symbol at beginning of label. Removes .fasta label
next;
} else {
$sequence = $line;
}
$sequence =~ s/\s//g; # Whitespace characters are removed
print OUTFILE $sequence;
}
Also I edited your return line. Return will exit your loop. I suspect what you want is to print it to a file, so I have done that. You may need to do some further transformation first to get it into a tab separated format.

How to open/join more than one file (depending on user input) and then use 2 files simultaneously

EDIT: Sorry for the misunderstanding, I have edited a few things, to hopefully actually request what I want.
I was wondering if there was a way to open/join two or more files to run the rest of the program on.
For example, my directory has these files:
taggedchpt1_1.txt, parsedchpt1_1.txt, taggedchpt1_2.txt, parsedchpt1_2.txt etc...
The program must call a tagged and parsed simultaneously. I want to run the program on both of chpt1_1 and chpt1_2, preferably joined together in one .txt file, unless it would be very slow to do so. For instance run what would be accomplished having two files:
taggedchpt1_1_and_chpt1_2 and parsedchpt1_1_and_chpt1_2
Can this be done through Perl? Or should I just combine the text files myself(or automate that process, making chpt1.txt which would include chpt1_1, chpt1_2, chpt1_3 etc...)
#!/usr/bin/perl
use strict;
use warnings FATAL => "all";
print "Please type in the chapter and section NUMBERS in the form chp#_sec#:\n"; ##So the user inputs 31_3, for example
chomp (my $chapter_and_section = "chpt".<>);
print "Please type in the search word:\n";
chomp (my $search_key = <>);
open(my $tag_corpus, '<', "tagged${chapter_and_section}.txt") or die $!;
open(my $parse_corpus, '<', "parsed${chapter_and_section}.txt") or die $!;
For the rest of the program to work, I need to be able to have:
my #sentences = <$tag_corpus>; ##right now this is one file, I want to make it more
my #typeddependencies = <$parse_corpus>; ##same as above
EDIT2: Really sorry about the misunderstanding. In the program, after the steps shown, I do 2 for loops. Reading through the lines of the tagged and parsed.
What I want is to accomplish this with more files from the same directory, without having to re-input the next files. (ie. I can run taggedchpt31_1.txt and parsedchpt31_1.txt...... I want to run taggedchpt31 and parsedchpt31 - which includes ~chpt31_1, ~chpt31_2, etc...)
Ultimately, it would be best if I joined all the tagged files and all the parsed files that have a common chapter (in the end still requiring only two files I want to run) but not have to save the joined file to the directory... Now that I put it into words, I think I should just save files that include all the sections.
Sorry and Thanks for all your time! Look at FMc's breakdown of my question for more help.
You could iterate over the file names, opening and reading each one in turn. Or you could produce an iterator that knows how to read lines from sequence of files.
sub files_reader {
# Takes a list of file names and returns a closure that
# will yield lines from those files.
my #handles = map { open(my $h, '<', $_) or die $!; $h } #_;
return sub {
shift #handles while #handles and eof $handles[0];
return unless #handles;
return readline $handles[0];
}
}
my $reader = files_reader('foo.txt', 'bar.txt', 'quux.txt');
while (my $line = $reader->()) {
print $line;
}
Or you could use Perl's built-in iterator that can do the same thing:
local #ARGV = ('foo.txt', 'bar.txt', 'quux.txt');
while (my $line = <>) {
print $line;
}
Edit in response to follow-up questions:
Perhaps it would help to break your problem down into smaller sub-tasks. As I understand it, you have three steps.
Step 1 is to get some input from the user -- perhaps a directory name, or maybe a couple of file name patterns (taggedchpt and parsedchpt).
Step 2 is for the program to find all of the relevant file names. For this task, glob() or readdir()might be useful. There are many questions on StackOverflow related to such issues. You'll end up with two lists of file names, one for the tagged files and one for the parsed files.
Step 3 is to process the lines across all of the files in each of the two sets. Most of the answers you have received, including mine, will help you with this step.
No one has mentioned the #ARGV hack yet? Ok, here it is.
{
local #ARGV = ('taggedchpt1_1.txt', 'parsedchpt1_1.txt', 'taggedchpt1_2.txt',
'parsedchpt1_2.txt');
while (<ARGV>) {
s/THIS/THAT/;
print FH $_;
}
}
ARGV is a special filehandle that iterates through all the filenames in #ARGV, closing a file and opening the next one as necessary. Normally #ARGV contains the command-line arguments that you passed to perl, but you can set it to anything you want.
You're almost there... this is a bit more efficient than discrete opens on each file...
#!/usr/bin/perl
use strict;
use warnings FATAL => "all";
print "Please type in the chapter and section NUMBERS in the for chp#_sec#:\n";
chomp (my $chapter_and_section = "chpt".<>);
print "Please type in the search word:\n";
chomp (my $search_key = <>);
open(FH, '>output.txt') or die $!; # Open an output file for writing
foreach ("tagged${chapter_and_section}.txt", "parsed${chapter_and_section}.txt") {
open FILE, "<$_" or die $!; # Read a filename (from the array)
foreach (<FILE>) {
$_ =~ s/THIS/THAT/g; # Regex replace each line in the open file (use
# whatever you like instead of "THIS" &
# "THAT"
print FH $_; # Write to the output file
}
}

How can I print a matching line, one line immediately above it and one line immediately below?

From a related question asked by Bi, I've learnt how to print a matching line together with the line immediately below it. The code looks really simple:
#!perl
open(FH,'FILE');
while ($line = <FH>) {
if ($line =~ /Pattern/) {
print "$line";
print scalar <FH>;
}
}
I then searched Google for a different code that can print matching lines with the lines immediately above them. The code that would partially suit my purpose is something like this:
#!perl
#array;
open(FH, "FILE");
while ( <FH> ) {
chomp;
$my_line = "$_";
if ("$my_line" =~ /Pattern/) {
foreach( #array ){
print "$_\n";
}
print "$my_line\n"
}
push(#array,$my_line);
if ( "$#array" > "0" ) {
shift(#array);
}
};
Problem is I still can't figure out how to do them together. Seems my brain is shutting down. Does anyone have any ideas?
Thanks for any help.
UPDATE:
I think I'm sort of touched. You guys are so helpful! Perhaps a little Off-topic, but I really feel the impulse to say more.
I needed a Windows program capable of searching the contents of multiple files and of displaying the related information without having to separately open each file. I tried googling and two apps, Agent Ransack and Devas, have proved to be useful, but they display only the lines containing the matched query and I want aslo to peek at the adjacent lines. Then the idea of improvising a program popped into my head. Years ago I was impressed by a Perl script that could generate a Tomeraider format of Wikipedia so that I can handily search Wiki on my Lifedrive and I've also read somewhere on the net that Perl is easy to learn especially for some guy like me who has no experience in any programming language. Then I sort of started teaching myself Perl a couple of days ago. My first step was to learn how to do the same job as "Agent Ransack" does and it proved to be not so difficult using Perl. I first learnt how to search the contents of a single file and display the matching lines through the modification of an example used in the book titled "Perl by Example", but I was stuck there. I became totally clueless as how to deal with multiple files. No similar examples were found in the book or probably because I was too impatient. And then I tried googling again and was led here and I asked my first question "How can I search multiple files for a string pattern in Perl?" here and I must say this forum is bloody AWESOME ;). Then I looked at more example scripts and then I came up with the following code yesterday and it serves my original purpose quite well:
The codes goes like this:
#!perl
$hits=0;
print "INPUT YOUR QUERY:";
chop ($query = <STDIN>);
$dir = 'f:/corpus/';
#files = <$dir/*>;
foreach $file (#files) {
open (txt, "$file");
while($line = <txt>) {
if ($line =~ /$query/i) {
$hits++;
print "$file \n $line";
print scalar <txt>;
}
}
}
close(txt);
print "$hits RESULTS FOUND FOR THIS SEARCH\n";
In the folder "corpus", I have a lot of text files including srt pdf doc files that contain such contents as follows:
Then I dumped the body.
J'ai mis le corps dans une décharge.
I know you have a wire.
Je sais que tu as un micro.
Now I'll tell you the truth.
Alors je vais te dire la vérité.
Basically I just need to search an English phrase and look at the French equivalent, so the script I finished yesterday is quite satisfying except that it would to be better if my script can display the above line in case I want to search a French phrase and check the English. So I'm trying to improve the code. Actually I knew the "print scalar " is buggy, but it is neat and does the job of printing the subsequent line at least most of the time). I was even expecting ANOTHER SINGLE magic line that prints the previous line instead of the subsequent :) Perl seems to be fun. I think I will spend more time trying to get a better understanding of it. And as suggested by daotoad, I'll study the codes generously offered by you guys. Again thanks you guys!
It will probably be easier just to use grep for this as it allows printing of lines before and after a match. Use -B and -A to print context before and after the match respectively. See http://ss64.com/bash/grep.html
Here's a modernized version of Pax's excellent answer:
use strict;
use warnings;
open( my $fh, '<', 'qq.in')
or die "Error opening file - $!\n";
my $this_line = "";
my $do_next = 0;
while(<$fh>) {
my $last_line = $this_line;
$this_line = $_;
if ($this_line =~ /XXX/) {
print $last_line unless $do_next;
print $this_line;
$do_next = 1;
} else {
print $this_line if $do_next;
$last_line = "";
$do_next = 0;
}
}
close ($fh);
See Why is three-argument open calls with lexical filehandles a Perl best practice? for an discussion of the reasons for the most important changes.
Important changes:
3 argument open.
lexical filehandle
added strict and warnings pragmas.
variables declared with lexical scope.
Minor changes (issues of style and personal taste):
removed unneeded parens from post-fix if
converted an if-not contstruct into unless.
If you find this answer useful, be sure to up-vote Pax's original.
Given the following input file:
(1:first) Yes, this one.
(2) This one as well (XXX).
(3) And this one.
Not this one.
Not this one.
Not this one.
(4) Yes, this one.
(5) This one as well (XXX).
(6) AND this one as well (XXX).
(7:last) And this one.
Not this one.
this little snippet:
open(FH, "<qq.in");
$this_line = "";
$do_next = 0;
while(<FH>) {
$last_line = $this_line;
$this_line = $_;
if ($this_line =~ /XXX/) {
print $last_line if (!$do_next);
print $this_line;
$do_next = 1;
} else {
print $this_line if ($do_next);
$last_line = "";
$do_next = 0;
}
}
close (FH);
produces the following, which is what I think you were after:
(1:first) Yes, this one.
(2) This one as well (XXX).
(3) And this one.
(4) Yes, this one.
(5) This one as well (XXX).
(6) AND this one as well (XXX).
(7:last) And this one.
It basically works by remembering the last line read and, when it finds the pattern, it outputs it and the pattern line. Then it continues to output pattern lines plus one more (with the $do_next variable).
There's also a little bit of trickery in there to ensure no line is printed twice.
You always want to store the last line that you saw in case the next line has your pattern and you need to print it. Using an array like you did in the second code snippet is probably overkill.
my $last = "";
while (my $line = <FH>) {
if ($line =~ /Pattern/) {
print $last;
print $line;
print scalar <FH>; # next line
}
$last = $line;
}
grep -A 1 -B 1 "search line"
I am going to ignore the title of your question and focus on some of the code you posted because it is positively harmful to let this code stand without explaining what is wrong with it. You say:
code that can print matching lines with the lines immediately above them. The code that would partially suit my purpose is something like this
I am going to go through that code. First, you should always include
use strict;
use warnings;
in your scripts, especially since you are just learning Perl.
#array;
This is a pointless statement. With strict, you can declare #array using:
my #array;
Prefer the three-argument form of open unless there is a specific benefit in a particular situation to not using it. Use lexical filehandles because bareword filehandles are package global and can be the source of mysterious bugs. Finally, always check if open succeeded before proceeding. So, instead of:
open(FH, "FILE");
write:
my $filename = 'something';
open my $fh, '<', $filename
or die "Cannot open '$filename': $!";
If you use autodie, you can get away with:
open my $fh, '<', 'something';
Moving on:
while ( <FH> ) {
chomp;
$my_line = "$_";
First, read the FAQ (you should have done so before starting to write programs). See What's wrong with always quoting "$vars"?. Second, if you are going to assign the line that you just read to $my_line, you should do it in the while statement so you do not needlessly touch $_. Finally, you can be strict compliant without typing any more characters:
while ( my $line = <$fh> ) {
chomp $line;
Refer to the previous FAQ again.
if ("$my_line" =~ /Pattern/) {
Why interpolate $my_line once more?
foreach( #array ){
print "$_\n";
}
Either use an explicit loop variable or turn this into:
print "$_\n" for #array;
So, you interpolate $my_line again and add the newline that was removed by chomp earlier. There is no reason to do so:
print "$my_line\n"
And now we come to the line that motivated me to dissect the code you posted in the first place:
if ( "$#array" > "0" ) {
$#array is a number. 0 is a number. > is used to check if the number on the LHS is greater than the number on the RHS. Therefore, there is no need to convert both operands to strings.
Further, $#array is the last index of #array and its meaning depends on the value of $[. I cannot figure out what this statement is supposed to be checking.
Now, your original problem statement was
print matching lines with the lines immediately above them
The natural question, of course, is how many lines "immediately above" the match you want to print.
#!/usr/bin/perl
use strict;
use warnings;
use Readonly;
Readonly::Scalar my $KEEP_BEFORE => 4;
my $filename = $ARGV[0];
my $pattern = qr/$ARGV[1]/;
open my $input_fh, '<', $filename
or die "Cannot open '$filename': $!";
my #before;
while ( my $line = <$input_fh> ) {
$line = sprintf '%6d: %s', $., $line;
print #before, $line, "\n" if $line =~ $pattern;
push #before, $line;
shift #before if #before > $KEEP_BEFORE;
}
close $input_fh;
Command line grep is the quickest way to accomplish this, but if your goal is to learn some Perl then you'll need to produce some code.
Rather than providing code, as others have already done, I'll talk a bit about how to write your own. I hope this helps with the brain-lock.
Read my previous answer on how to write a program, it gives some tips about how to start working on your problem.
Go through each of the sample programs you have, as well as those offered here and comment out exactly what they do. Refer to the perldoc for each function and operator you don't understand. Your first example code has an error, if 2 lines in a row match, the line after the second match won't print. By error, I mean that either the code or the spec is wrong, the desired behavior in this case needs to be determined.
Write out what you want your program to do.
Start filling in the blanks with code.
Here's a sketch of a phase one write-up:
# This program reads a file and looks for lines that match a pattern.
# Open the file
# Iterate over the file
# For each line
# Check for a match
# If match print line before, line and next line.
But how do you get the next line and the previous line?
Here's where creative thinking comes in, there are many ways, all you need is one that works.
You could read in lines one at a time, but read ahead by one line.
You could read the whole file into memory and select previous and follow-on lines by indexing an array.
You could read the file and store the offset and length each line--keeping track of which ones match as you go. Then use your offset data to extract the required lines.
You could read in lines one at a time. Cache your previous line as you go. Use readline to read the next line for printing, but use seek and tell to rewind the handle so that the 'next' line can be checked for a match.
Any of these methods, and many more could be fleshed out into a functioning program. Depending on your goals, and constraints any one may be the best choice for that problem domain. Knowing how to select which one to use will come with experience. If you have time, try two or three different ways and see how they work out.
Good luck.
If you don't mind losing the ability to iterate over a filehandle, you could just slurp the file and iterate over the array:
#!/usr/bin/perl
use strict; # always do these
use warnings;
my $range = 1; # change this to print the first and last X lines
open my $fh, '<', 'FILE' or die "Error: $!";
my #file = <$fh>;
close $fh;
for (0 .. $#file) {
if($file[$_] =~ /Pattern/) {
my #lines = grep { $_ > 0 && $_ < $#file } $_ - $range .. $_ + $range;
print #file[#lines];
}
}
This might get horribly slow for large files, but is pretty easy to understand (in my opinion). Only when you know how it works can you set about trying to optimize it. If you have any questions about any of the functions or operations I used, just ask.