I would like to use
myscript.pl targetfolder/*
to read some number from ASCII files.
myscript.pl
#list = <#ARGV>;
# Is the whole file or only 1st line is loaded?
foreach $file ( #list ) {
open (F, $file);
}
# is this correct to judge if there is still file to load?
while ( <F> ) {
match_replace()
}
sub match_replace {
# if I want to read the 5th line in downward, how to do that?
# if I would like to read multi lines in multi array[row],
# how to do that?
if ( /^\sName\s+/ ) {
$name = $1;
}
}
I would recommend a thorough read of perlintro - it will give you a lot of the information you need. Additional comments:
Always use strict and warnings. The first will enforce some good coding practices (like for example declaring variables), the second will inform you about potential mistakes. For example, one warning produced by the code you showed would be readline() on unopened filehandle F, giving you the hint that F is not open at that point (more on that below).
#list = <#ARGV>;: This is a bit tricky, I wouldn't recommend it - you're essentially using glob, and expanding targetfolder/* is something your shell should be doing, and if you're on Windows, I'd recommend Win32::Autoglob instead of doing it manually.
foreach ... { open ... }: You're not doing anything with the files once you've opened them - the loop to read from the files needs to be inside the foreach.
"Is the whole file or only 1st line is loaded?" open doesn't read anything from the file, it just opens it and provides a filehandle (which you've named F) that you then need to read from.
I'd strongly recommend you use the more modern three-argument form of open and check it for errors, as well as use lexical filehandles since their scope is not global, as in open my $fh, '<', $file or die "$file: $!";.
"is this correct to judge if there is still file to load?" Yes, while (<$filehandle>) is a good way to read a file line-by-line, and the loop will end when everything has been read from the file. You may want to use the more explicit form while (my $line = <$filehandle>), so that your variable has a name, instead of the default $_ variable - it does make the code a bit more verbose, but if you're just starting out that may be a good thing.
match_replace(): You're not passing any parameters to the sub. Even though this code might still "work", it's passing the current line to the sub through the global $_ variable, which is not a good practice because it will be confusing and error-prone once the script starts getting longer.
if (/^\sName\s+/){$name = $1;}: Since you've named the sub match_replace, I'm guessing you want to do a search-and-replace operation. In Perl, that's called s/search/replacement/, and you can read about it in perlrequick and perlretut. As for the code you've shown, you're using $1, but you don't have any "capture groups" ((...)) in your regular expression - you can read about that in those two links as well.
"if I want to read the 5th line in downward , how to do that ?" As always in Perl, There Is More Than One Way To Do It (TIMTOWTDI). One way is with the range operator .. - you can skip the first through fourth lines by saying next if 1..4; at the beginning of the while loop, this will test those line numbers against the special $. variable that keeps track of the most recently read line number.
"and if I would like to read multi lines in multi array[row], how to do that ?" One way is to use push to add the current line to the end of an array. Since keeping the lines of a file in an array can use up more memory, especially with large files, I'd strongly recommend making sure you think through the algorithm you want to use here. You haven't explained why you would want to keep things in an array, so I can't be more specific here.
So, having said all that, here's how I might have written that code. I've added some debugging code using Data::Dumper - it's always helpful to see the data that your script is working with.
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper; # for debugging
$Data::Dumper::Useqq=1;
for my $file (#ARGV) {
print Dumper($file); # debug
open my $fh, '<', $file or die "$file: $!";
while (my $line = <$fh>) {
next if 1..4;
chomp($line); # remove line ending
match_replace($line);
}
close $fh;
}
sub match_replace {
my ($line) = #_; # get argument(s) to sub
my $name;
if ( $line =~ /^\sName\s+(.*)$/ ) {
$name = $1;
}
print Data::Dumper->Dump([$line,$name],['line','name']); # debug
# ... do more here ...
}
The above code is explicitly looping over #ARGV and opening each file, and I did say above that more verbose code can be helpful in understanding what's going on. I just wanted to point out a nice feature of Perl, the "magic" <> operator (discussed in perlop under "I/O Operators"), which will automatically open the files in #ARGV and read lines from them. (There's just one small thing, if I want to use the $. variable and have it count the lines per file, I need to use the continue block I've shown below, this is explained in eof.) This would be a more "idiomatic" way of writing that first loop:
while (<>) { # reads line into $_
next if 1..4;
chomp; # automatically uses $_ variable
match_replace($_);
} continue { close ARGV if eof } # needed for $. (and range operator)
Related
I want to read a file by one line, but it's reading just the first line. How to read all lines?
My code:
open(file_E, $file_E);
while ( <file_E> ) {
/([^\n]*)/;
print $line1;
}
close($file_E);
Let's start by looking at your code.
open(file_E, $file_E);
while ( <file_E> ) {
/([^\n]*)/;
print $line1;
}
close($file_E);
On the first line you open a file named in $file_E using the bareword filehandle file_E. This should work so long as the file successfully opens. It would be better to also check the success of this operation one of two ways: Either put use autodie; at the top of your script (but then risk applying its semantics in places where your code is incompatible with this level of error handling), or change your open to look like this:
open(file_E, $file_E) or die "Failed to open $file_E: $!\n";
Now if you fail to open the file you will get an error message that will help track down the problem.
Next lets look at the while loop, because it's here where you have an issue that is causing the bug you are experiencing. On the first line of the while loop you have this:
while ( <file_E> ) {
By consulting perldoc perlsyn you will see that line is special-cased to actually do this:
while (defined($_ = <file_E>)) {
So your code is implicitly assigning each line to $_ on successive iterations. Also by consulting perldoc perlop you'll find that when the match operator (/.../ or m/.../) is invoked without binding the match explicitly using =~, the match will bind against $_. Still then, so far so good. However, you are not actually doing anything useful with the match. The match operator will return Boolean truth / falsehood for whether or not the match succeeded. And because your pattern contains capturing parenthesis, it will capture something into the capture variable $1. But you are never testing for match success, nor are you ever referring to $1 again.
On the line that follows, you do this: print $line1. Where, in your code, is $line1 being assigned a value? Because it is never being assigned a value in what you've shown us.
I can only guess that your intent is to iterate over the lines of the file, capture the line but without the trailing newline, and then print it. It seems that you wish to print it without any newlines, so that all of the input file is printed as a single line of output.
open my $input_fh_e, '<', $file_E or die "Failed to open $file_E: $!\n";
while(my $line = <$input_fh_e>) {
chomp $line;
print $line;
}
close $input_fh_e or die "Failed to close $file_E: $!\n";
No need to capture anything -- if all that the capture is doing is just grabbing everything up to the newline, you can simply strip off the newline with chomp to begin with.
In my example I used a lexical filehandle (a file handle that is lexically scoped, declared with my). This is generally a better practice in modern Perl as it avoids using a bareword, avoids possible namespace collisions, and assures that the handle will get closed as soon as the lexical scope closes.
I also used the 'three arg' version of open, which is safer because it eliminates the potential for $file_E to be used to open a pipe or do some other nefarious or simply unintended shell manipulation.
I suggest also starting your script with use strict;, because had you done so, you would have gotten an error message at compiletime telling you that $line1 was never declared. Also start your script with use warnings, so that you would get a warning when you try to print $line1 before assigning a value to it.
Most of the issues in your code will be discussed in perldoc perlintro, which you can arrive at from your command line simply by typing perldoc perlintro, assuming you have Perl installed. It typically takes 20-40 minutes to read through perlintro. If ever there were a document that should constitute required reading before getting started writing Perl code, that reading would probably include perlintro.
Another alternative, note that $_ will include newline so you will need to chomp it if you don't want the newline in $line:
open(file_E, $file_E);
while ( <file_E> ) {
my $line = $_;
print $line;
}
close($file_E);
I have been using the following Perl code to extract text from multiple text files. It works fine.
Example of a couple of lines in one of the input files:
Fa0/19 CUTExyz notconnect 129 half 100 10/100BaseTX
Fa0/22 xyz MLS notconnect 1293 half 10 10/100BaseTX
What I need is to match the numbers in each line exactly (i.e. 129 is not matched by 1293) and print the corresponding lines.
It would also be nice to match a range of numbers leaving specific numbers out i.e. match 2 through 10 but not 11 the 12 through 20
#!/perl/bin/perl
use warnings;
my #files = <c:/perl64/files/*>;
foreach $file ( #files ) {
open( FILE, "$file" );
while ( $line = <FILE> ) {
print "$file $line" if $line =~ /123/n;
}
close FILE;
}
Thank you for the suggestions, but can it can be done using the code structure above?
I suggest that you take a look at perldoc perlre.
You need to anchor your regex pattern. The easiest way is probably using \b which is a zero-width boundary between alphanumerics and non-alphanumerics.
#!/perl/bin/perl
use warnings;
use strict;
foreach my $file ( glob "c:/perl64/files/*" ) {
open( my $input, '<', $file ) or die $!;
while (<$input>) {
print "$file $_" if m/\b123\b/;
}
close $input;
}
Note - you should use three-argument open with lexical file handles as above, because it is better practice.
I've also removed the n pattern modifier, as it appears redundant.
Following your edit though, to give us some source data. I'd suggest the solution is not to use a regex - your source data looks space delimited. (Maybe those are tabs?).
So I'd suggest you're better off using split and selecting the field you want, and testing it numerically, because you mention matching ranges. This is not a good fit for regexes because they don't understand the numeric content.
Instead:
while ( <$input> ) {
print if (split)[-4] == 129;
}
Note - I use -4 in the split, which indexes from the end of the list.
This is because column 3 contains spaces, so splitting on whitespace is going to produce the wrong result unless we count down from the end of the array. Using a negative index we get the right field each time.
If your data is tab separated then you could use chomp and split /\t/. Or potentially split on /\s{2,}/ to split on 2-or-more spaces
But by selecting the field, you can do numeric tests on it, like
if $fields[-4] > 100 and $fields[-4] < 200
etc.
I hope you don't get the answers you're asking for, which discard best practice because of your unfamiliarity with Perl. It is inappropriate to ask how to write an ugly solution because proper Perl is beyond your reach
As has been said repeatedly on this site, if you don't know how to do a job then you should hire someone who does know and pay them for their work. No other profession that I know has the expectation of getting quality work done for free
Here's a few notes on your code. Wherever you have learned your techniques, you have been looking at a very outdated resource
Do you really have a root directory perl, so that your compiler is /perl/bin/perl? That's very unusual, and there is no need to use a shebang line in Windows
You must always add use strict and use warnings 'all' at the top of every Perl program you write, and declare all of your variables using my as close as possible to their first point of use. For some reason you do this with #files but not with $file
It is better to replace <c:/perl64/files/*> with glob 'C:/perl64/files/*'. Otherwise the code is less clear because Perl overloads the <> operator
Don't put variable names inside double quotes. It is unnecessary at best, and may cause bugs. So "$file" should be $file
Always use the three-parameter version of open, so that the second parameter is the open mode
Don't use global file handles. And always test whether the file has been opened correctly, dying with a message including $!—the reason for the failure—if the open fails
open( FILE, "$file" )
should be something like
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!}
Don't rely on regex patterns for everything. In this case it looks like split would be a better option, or perhaps unpack if your records have fixed-width fields. In my solution below I have used split on "more than one space", but if your real data is different from what you have shown (tab-delimited?) then this is not going to work
Note that Fa0/129 will also be matched by your current approach
This Perl program filters your data, printing lines where the fourth field $lines[3] (delineated by more than one whitespace character) is numerically equal to 129
The output shown is produced when the input is the single file splitn.txt, containing the data shown in your question
use strict;
use warnings 'all';
for my $file ( glob 'C:/perl64/files/*' ) {
open my $fh, '<', $file or die qq{Unable to open "$file" for input: $!};
while ( my $line = <$fh> ) {
chomp;
my #fields = split /\s\s+/, $line;
print "$file $line" if $fields[3] == 129;
}
}
output
splitn.txt Fa0/19 CUTExyz notconnect 129 half 100 10/100BaseTX
Your question is unclear. When you say:
What I need is to match numbers in the on each line exactly
That could mean a couple of things. It could mean that each line contains nothing but a single number which you want to match. In that case, using == is probably better than using a regular expression. Or it could mean that you have lots of text on a line and you only want to match complete numbers. In that case you should use \b (the "word boundary" anchor) - /\b123\b/.
If you're clearer in your questions (perhaps by giving us sample input) then people won't have to guess at your meaning.
A few more points on your code:
Always include both use strict and use warnings.
Always check the return value from open() and take appropriate action on failure.
Use lexical filehandles and 3-arg version of open().
No need to quote $file in your open() call.
Using $_ can simplify your code.
/n on the match operator has no effect unless your regex contains parentheses.
Putting that all together (and assuming my second interpretation of your question is correct), your code could look like this:
#!/perl/bin/perl
use strict;
use warnings;
my #files = <c:/perl64/files/*>;
foreach my $file (#files) {
open my $file_h, '<', $file
or die "Can't open $file: $!";
while (<$file_h>) {
print "$file $_\n" if /\b123\b/;
}
# No need to close $file_h as it is closed
# automatically when the variable goes out
# of scope.
}
Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 9 years ago.
Improve this question
I am working on a program that takes information from a CSV file as a source to search with through a text file that has "customer packages". I am getting odd counts on only some of the entries, and I can't seem to figure out what is causing the duplicate counts. Can anyone look through my code and tell me if my logic/syntax is off? (probably is). All i am trying to accomplish is to count the total occurances in the text file of an entry in the csv file (packageid,package_description)
Thanks for the help! im going nuts over here.
#!/usr/bin/perl
use strict;
use Text::CSV;
# Variables already declared in the other PL file ** Remove if consolidating **
my $file2 = 'master_plist.csv';
my $csv2 = Text::CSV->new(); # Create a Text::CSV object
open (CSV2, "<", $file2) or die $!; #open CSV file for parsing
while (<CSV2>) {
if ($csv2->parse($_)) {
my #columns2 = $csv2->fields(); # Parse CSV and load into an array for each row.
my $packID = $columns2[0];
my $packDESC = $columns2[1];
my $val = 'customer_packages_report.txt';
chomp ($val);
my $cnt=0;
open (HNDL, "$val") || die "wrong filename";
while ($val = <HNDL>)
{
while ($val =~ /$packID - $packDESC/ig)
{
$cnt++;
}
}
#if ($packDESC =~ /\(/g) {
# $packDESC =~ s/\(/\(/g;
#}
print "Total iterations of $packDESC: $cnt\n";
close (HNDL);
# End original code
} # Close IF
} # Close WHILE
close CSV;
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
# Variables already declared in the other PL file ** Remove if consolidating **
my $file2 = 'master_plist.csv';
my $csv2 = Text::CSV->new(); # Create a Text::CSV object
open (CSV2, "<", $file2) or die "I die while opening $file2! $!"; #open CSV file for parsing
while ($each_csv2_line=<CSV2>) {
if ($csv2->parse($each_csv2_line)) {
my #columns2 = $csv2->fields(); # Parse CSV and load into an array for each row.
my $packID = $columns2[0];
my $packDESC = $columns2[1];
my $val = 'customer_packages_report.txt';
chomp ($val);
my $cnt=0;
open (HNDL,"<","$val") or die "wrong filename: $val! $!";
while (<HNDL>){
$cnt++ while (/$packID - $packDESC/ig);
}
#if ($packDESC =~ /\(/g) {
# $packDESC =~ s/\(/\(/g;
#}
print "Total iterations of $packDESC: $cnt\n";
close (HNDL);
# End original code
} # Close IF
} # Close WHILE
# end of script
close CSV;
My recommendations:
Use $HNDL instead of HNDL <- lexical variables for filehandles more better.
Try to catch all mistakes (by defined and ==0 and eq "")
I try to format your code and add some features that i sometimes use. Be better than me and read first Style Coding for Little Perl Monk. And you can be more impressive with this language and write not only writeonly code.
Example (and also a quote):
"The situation is exactly the same for the line-input operator, <>, although Perl does this for you automatically.
It looks like you’re testing the line from STDIN in this while:
while (<STDIN>) {
do_something($_);
}
However, this is a special case in which Perl automatically converts to check $_ for definedness:
while ( defined( $_ = <STDIN> ) ) { # implicitly done
do_something($_);
}
"
Effective Perl Programming, page 24.
You could do a number of things to improve your code:
use warnings;.
Use proper indentation.
Use descriptive variable names. Instead of $file2 (has no meaning, and why is there no file 1?), use $package_file or whatever makes sense.
if you are already using Text::CSV, you can use $csv->getline() to go through the file line by line. This will simplify your code. See the documentation for an example.
chomp($val) removes a newline from the end of a string. You are using it on a string literal you just declared, which has no newline. That doesn't make sense.
Never use the same variable ($val) to do two completely different things. This is extremely confusing.
Might the variables that you are interpolating in the regex contain special characters? If so, you need to escape them. For example, if $packDESC contained a period, it would match any character in the regex. To treat the contents of the variable literally, use \Q..\E, as in this example: /\Q$packID - $packDESC\E/ig.
You are opening customer_packages_report.txt and going through it line-by-line on every line of the csv file. You could simplify this by reading it in once and storing the results in an array.
You don't need a while loop to count matches: $cnt = () = /$packID - $packDESC/ig;. This puts the match in array context, returning an array of matches, then puts it back in scalar context to count the matches. A little bit tricky, but simpler.
It's hard to say exactly what is causing your problem without seeing the data. Might you have some unnecessary repetition that stems from your nested looping over both files? I would start by rewriting to improve your code, then see if the problem still exists.
Your code seems to compile with perl -c without errors, so that's good. If I were to guess, I would assume your problem lies in having meta characters in some of your fields. The regex /$packID - $packDESC/ is vulnerable to meta characters. For example
my $str = "foo? bar";
$str =~ /$str/; # returns false, because ? is a meta character
In the above example, the question mark ? is a quantifier which affects whatever comes before it, so that o? means "0 or 1 o". To solve the meta character problem, use the \Q ... \E escape:
$str =~ /\Q$str/; # will now match
Terminating the escape sequence with \E is optional.
Some other things to note:
It is very good that you use use strict. You should also always use warnings. Not doing so is not removing the issues with your code, only hiding them.
You create a Text::CSV object with default settings. Depending on your input, that may or may not be appropriate. Setting binary => 1 is recommended in the documentation.
Using the parse() function may not be the best option, the documentation has good things to say about getline.
As loldop points out in the comments, you are reusing $val to read from your file. While technically that should work, it is asking for trouble.
Style and practice notes and practical tips:
Using three-argument open and lexical file handles is a good thing to do. Three-argument in essence means to use an explicit open mode, which makes your script safer to use. Using lexical file handles means that you will not have global scope on your file handle, which is a good thing.
This code
my #columns2 = $csv2->fields();
my $packID = $columns2[0];
my $packDESC = $columns2[1];
Can be written like this
my ($packID, $packDESC) = $csv2->fields();
You are chomping $val right after you assign it. That is redundant, because chomp by default only removes newlines from the end of your strings, and you did not add any such. It doesn't change anything, but not required here. If you read something from stdin or a file, you would probably want to use chomp, though.
Using die without referring to the error $! is a sure way to make yourself annoyed.
Do not underestimate how much easier it becomes to write code when you use proper indentation. Use a text editor with automatic indentation and colouring. I can warmly recommend vim (gvim if you are using windows). Though it has a learning curve, is is a powerful editor that also often comes already installed on many systems.
Since so many people have already commented on your program itself, I'm going to talk about how you can become a better Perl programmer, and help write in such a way that will help eliminate many of your issues.
Take a look at Perl::Tidy and run your program thorough that. That will help improve your syntax and Perl and will help you catch a lot of the various issues you're having.
Also, you should get a copy of Perl Best Practices which is where most of Perl Tidy is taken from. And, as someone already referenced Effective Perl Programming is another excellent book.
The big issue with Perl is that few people learn it. Most are tossed into a situation where we had to pick it up ourselves. Plus, Perl is a fairly old and rather crufty language. Most Perl books still lean heavily on Perl 3.x ways of programming and fail to mention such basics as using use strict; and use warnings;.
You combine old programming practices, with most people learning Perl by hacking their way through old programs with old syntax (and probably written by people who learned Perl by hacking their way through even older programs), and you can see why Perl has a reputation of being a write-only language.
You may want to use the getline method from Text::CSV, which saves a few lines of code.
The problem is likely to be because you have regex metacharacters in the strings you are searching for. Escape them with \Q...\E in the regex so that they are taken literally. In the rewrite below I have also added \s* instead of a literal space, just in case there isn't exactly one space on either side of the hyphen.
I have also changed the filehandles to lexical ones, which have the advantage that they will be closed automatically when the handle goes out of scope.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $file2 = 'master_plist.csv';
my $csv2 = Text::CSV->new();
open(my $csv_fh, '<', $file2) or die $!;
while (my $row = $csv2->getline($csv_fh)) {
my ($packID, $packDESC) = #$row;
my $val = 'customer_packages_report.txt';
chomp($val);
open(my $fh, '<', $val) or die "wrong filename";
my $cnt = 0;
while ($val = <$fh>) {
while ($val =~ /\Q$packID\E\s*-\s*\Q$packDESC\E/ig) {
$cnt++;
}
}
print "Total iterations of $packDESC: $cnt\n";
}
From a related question asked by Bi, I've learnt how to print a matching line together with the line immediately below it. The code looks really simple:
#!perl
open(FH,'FILE');
while ($line = <FH>) {
if ($line =~ /Pattern/) {
print "$line";
print scalar <FH>;
}
}
I then searched Google for a different code that can print matching lines with the lines immediately above them. The code that would partially suit my purpose is something like this:
#!perl
#array;
open(FH, "FILE");
while ( <FH> ) {
chomp;
$my_line = "$_";
if ("$my_line" =~ /Pattern/) {
foreach( #array ){
print "$_\n";
}
print "$my_line\n"
}
push(#array,$my_line);
if ( "$#array" > "0" ) {
shift(#array);
}
};
Problem is I still can't figure out how to do them together. Seems my brain is shutting down. Does anyone have any ideas?
Thanks for any help.
UPDATE:
I think I'm sort of touched. You guys are so helpful! Perhaps a little Off-topic, but I really feel the impulse to say more.
I needed a Windows program capable of searching the contents of multiple files and of displaying the related information without having to separately open each file. I tried googling and two apps, Agent Ransack and Devas, have proved to be useful, but they display only the lines containing the matched query and I want aslo to peek at the adjacent lines. Then the idea of improvising a program popped into my head. Years ago I was impressed by a Perl script that could generate a Tomeraider format of Wikipedia so that I can handily search Wiki on my Lifedrive and I've also read somewhere on the net that Perl is easy to learn especially for some guy like me who has no experience in any programming language. Then I sort of started teaching myself Perl a couple of days ago. My first step was to learn how to do the same job as "Agent Ransack" does and it proved to be not so difficult using Perl. I first learnt how to search the contents of a single file and display the matching lines through the modification of an example used in the book titled "Perl by Example", but I was stuck there. I became totally clueless as how to deal with multiple files. No similar examples were found in the book or probably because I was too impatient. And then I tried googling again and was led here and I asked my first question "How can I search multiple files for a string pattern in Perl?" here and I must say this forum is bloody AWESOME ;). Then I looked at more example scripts and then I came up with the following code yesterday and it serves my original purpose quite well:
The codes goes like this:
#!perl
$hits=0;
print "INPUT YOUR QUERY:";
chop ($query = <STDIN>);
$dir = 'f:/corpus/';
#files = <$dir/*>;
foreach $file (#files) {
open (txt, "$file");
while($line = <txt>) {
if ($line =~ /$query/i) {
$hits++;
print "$file \n $line";
print scalar <txt>;
}
}
}
close(txt);
print "$hits RESULTS FOUND FOR THIS SEARCH\n";
In the folder "corpus", I have a lot of text files including srt pdf doc files that contain such contents as follows:
Then I dumped the body.
J'ai mis le corps dans une décharge.
I know you have a wire.
Je sais que tu as un micro.
Now I'll tell you the truth.
Alors je vais te dire la vérité.
Basically I just need to search an English phrase and look at the French equivalent, so the script I finished yesterday is quite satisfying except that it would to be better if my script can display the above line in case I want to search a French phrase and check the English. So I'm trying to improve the code. Actually I knew the "print scalar " is buggy, but it is neat and does the job of printing the subsequent line at least most of the time). I was even expecting ANOTHER SINGLE magic line that prints the previous line instead of the subsequent :) Perl seems to be fun. I think I will spend more time trying to get a better understanding of it. And as suggested by daotoad, I'll study the codes generously offered by you guys. Again thanks you guys!
It will probably be easier just to use grep for this as it allows printing of lines before and after a match. Use -B and -A to print context before and after the match respectively. See http://ss64.com/bash/grep.html
Here's a modernized version of Pax's excellent answer:
use strict;
use warnings;
open( my $fh, '<', 'qq.in')
or die "Error opening file - $!\n";
my $this_line = "";
my $do_next = 0;
while(<$fh>) {
my $last_line = $this_line;
$this_line = $_;
if ($this_line =~ /XXX/) {
print $last_line unless $do_next;
print $this_line;
$do_next = 1;
} else {
print $this_line if $do_next;
$last_line = "";
$do_next = 0;
}
}
close ($fh);
See Why is three-argument open calls with lexical filehandles a Perl best practice? for an discussion of the reasons for the most important changes.
Important changes:
3 argument open.
lexical filehandle
added strict and warnings pragmas.
variables declared with lexical scope.
Minor changes (issues of style and personal taste):
removed unneeded parens from post-fix if
converted an if-not contstruct into unless.
If you find this answer useful, be sure to up-vote Pax's original.
Given the following input file:
(1:first) Yes, this one.
(2) This one as well (XXX).
(3) And this one.
Not this one.
Not this one.
Not this one.
(4) Yes, this one.
(5) This one as well (XXX).
(6) AND this one as well (XXX).
(7:last) And this one.
Not this one.
this little snippet:
open(FH, "<qq.in");
$this_line = "";
$do_next = 0;
while(<FH>) {
$last_line = $this_line;
$this_line = $_;
if ($this_line =~ /XXX/) {
print $last_line if (!$do_next);
print $this_line;
$do_next = 1;
} else {
print $this_line if ($do_next);
$last_line = "";
$do_next = 0;
}
}
close (FH);
produces the following, which is what I think you were after:
(1:first) Yes, this one.
(2) This one as well (XXX).
(3) And this one.
(4) Yes, this one.
(5) This one as well (XXX).
(6) AND this one as well (XXX).
(7:last) And this one.
It basically works by remembering the last line read and, when it finds the pattern, it outputs it and the pattern line. Then it continues to output pattern lines plus one more (with the $do_next variable).
There's also a little bit of trickery in there to ensure no line is printed twice.
You always want to store the last line that you saw in case the next line has your pattern and you need to print it. Using an array like you did in the second code snippet is probably overkill.
my $last = "";
while (my $line = <FH>) {
if ($line =~ /Pattern/) {
print $last;
print $line;
print scalar <FH>; # next line
}
$last = $line;
}
grep -A 1 -B 1 "search line"
I am going to ignore the title of your question and focus on some of the code you posted because it is positively harmful to let this code stand without explaining what is wrong with it. You say:
code that can print matching lines with the lines immediately above them. The code that would partially suit my purpose is something like this
I am going to go through that code. First, you should always include
use strict;
use warnings;
in your scripts, especially since you are just learning Perl.
#array;
This is a pointless statement. With strict, you can declare #array using:
my #array;
Prefer the three-argument form of open unless there is a specific benefit in a particular situation to not using it. Use lexical filehandles because bareword filehandles are package global and can be the source of mysterious bugs. Finally, always check if open succeeded before proceeding. So, instead of:
open(FH, "FILE");
write:
my $filename = 'something';
open my $fh, '<', $filename
or die "Cannot open '$filename': $!";
If you use autodie, you can get away with:
open my $fh, '<', 'something';
Moving on:
while ( <FH> ) {
chomp;
$my_line = "$_";
First, read the FAQ (you should have done so before starting to write programs). See What's wrong with always quoting "$vars"?. Second, if you are going to assign the line that you just read to $my_line, you should do it in the while statement so you do not needlessly touch $_. Finally, you can be strict compliant without typing any more characters:
while ( my $line = <$fh> ) {
chomp $line;
Refer to the previous FAQ again.
if ("$my_line" =~ /Pattern/) {
Why interpolate $my_line once more?
foreach( #array ){
print "$_\n";
}
Either use an explicit loop variable or turn this into:
print "$_\n" for #array;
So, you interpolate $my_line again and add the newline that was removed by chomp earlier. There is no reason to do so:
print "$my_line\n"
And now we come to the line that motivated me to dissect the code you posted in the first place:
if ( "$#array" > "0" ) {
$#array is a number. 0 is a number. > is used to check if the number on the LHS is greater than the number on the RHS. Therefore, there is no need to convert both operands to strings.
Further, $#array is the last index of #array and its meaning depends on the value of $[. I cannot figure out what this statement is supposed to be checking.
Now, your original problem statement was
print matching lines with the lines immediately above them
The natural question, of course, is how many lines "immediately above" the match you want to print.
#!/usr/bin/perl
use strict;
use warnings;
use Readonly;
Readonly::Scalar my $KEEP_BEFORE => 4;
my $filename = $ARGV[0];
my $pattern = qr/$ARGV[1]/;
open my $input_fh, '<', $filename
or die "Cannot open '$filename': $!";
my #before;
while ( my $line = <$input_fh> ) {
$line = sprintf '%6d: %s', $., $line;
print #before, $line, "\n" if $line =~ $pattern;
push #before, $line;
shift #before if #before > $KEEP_BEFORE;
}
close $input_fh;
Command line grep is the quickest way to accomplish this, but if your goal is to learn some Perl then you'll need to produce some code.
Rather than providing code, as others have already done, I'll talk a bit about how to write your own. I hope this helps with the brain-lock.
Read my previous answer on how to write a program, it gives some tips about how to start working on your problem.
Go through each of the sample programs you have, as well as those offered here and comment out exactly what they do. Refer to the perldoc for each function and operator you don't understand. Your first example code has an error, if 2 lines in a row match, the line after the second match won't print. By error, I mean that either the code or the spec is wrong, the desired behavior in this case needs to be determined.
Write out what you want your program to do.
Start filling in the blanks with code.
Here's a sketch of a phase one write-up:
# This program reads a file and looks for lines that match a pattern.
# Open the file
# Iterate over the file
# For each line
# Check for a match
# If match print line before, line and next line.
But how do you get the next line and the previous line?
Here's where creative thinking comes in, there are many ways, all you need is one that works.
You could read in lines one at a time, but read ahead by one line.
You could read the whole file into memory and select previous and follow-on lines by indexing an array.
You could read the file and store the offset and length each line--keeping track of which ones match as you go. Then use your offset data to extract the required lines.
You could read in lines one at a time. Cache your previous line as you go. Use readline to read the next line for printing, but use seek and tell to rewind the handle so that the 'next' line can be checked for a match.
Any of these methods, and many more could be fleshed out into a functioning program. Depending on your goals, and constraints any one may be the best choice for that problem domain. Knowing how to select which one to use will come with experience. If you have time, try two or three different ways and see how they work out.
Good luck.
If you don't mind losing the ability to iterate over a filehandle, you could just slurp the file and iterate over the array:
#!/usr/bin/perl
use strict; # always do these
use warnings;
my $range = 1; # change this to print the first and last X lines
open my $fh, '<', 'FILE' or die "Error: $!";
my #file = <$fh>;
close $fh;
for (0 .. $#file) {
if($file[$_] =~ /Pattern/) {
my #lines = grep { $_ > 0 && $_ < $#file } $_ - $range .. $_ + $range;
print #file[#lines];
}
}
This might get horribly slow for large files, but is pretty easy to understand (in my opinion). Only when you know how it works can you set about trying to optimize it. If you have any questions about any of the functions or operations I used, just ask.
I have recently started learning Perl and one of my latest assignments involves searching a bunch of files for a particular string. The user provides the directory name as an argument and the program searches all the files in that directory for the pattern. Using readdir() I have managed to build an array with all the searchable file names and now need to search each and every file for the pattern, my implementation looks something like this -
sub searchDir($) {
my $dirN = shift;
my #dirList = glob("$dirN/*");
for(#dirList) {
push #fileList, $_ if -f $_;
}
#ARGV = #fileList;
while(<>) {
## Search for pattern
}
}
My question is - is it alright to manually load the #ARGV array as has been done above and use the <> operator to scan in individual lines or should I open / scan / close each file individually? Will it make any difference if this processing exists in a subroutine and not in the main function?
On the topic of manipulating #ARGV - that's definitely working code, Perl certainly allows you to do that. I don't think it's a good coding habit though. Most of the code I've seen that uses the "while (<>)" idiom is using it to read from standard input, and that's what I initially expect your code to do. A more readable pattern might be to open/close each input file individually:
foreach my $file (#files) {
open FILE, "<$file" or die "Error opening file $file ($!)";
my #lines = <FILE>;
close FILE or die $!;
foreach my $line (#file) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
}
That would read more easily to me, although it is a few more lines of code. Perl allows you a lot of flexibility, but I think that makes it that much more important to develop your own style in Perl that's readable and understandable to you (and your co-workers, if that's important for your code/career).
Putting subroutines in the main function or in a subroutine is also mostly a stylistic decision that you should play around with and think about. Modern computers are so fast at this stuff that style and readability is much more important for scripts like this, as you're not likely to encounter situations in which such a script over-taxes your hardware.
Good luck! Perl is fun. :)
Edit: It's of course true that if he had a very large file, he should do something smarter than slurping the entire file into an array. In that case, something like this would definitely be better:
while ( my $line = <FILE> ) {
if ( $line =~ /$pattern/ ) {
# do something here!
}
}
The point when I wrote "you're not likely to encounter situations in which such a script over-taxes your hardware" was meant to cover that, sorry for not being more specific. Besides, who even has 4GB hard drives, let alone 4GB files? :P
Another Edit: After perusing the Internet on the advice of commenters, I've realized that there are hard drives that are much larger than 4GB available for purchase. I thank the commenters for pointing this out, and promise in the future to never-ever-ever try to write a sarcastic comment on the internet.
I would prefer this more explicit and readable version:
#!/usr/bin/perl -w
foreach my $file (<$ARGV[0]/*>){
open(F, $file) or die "$!: $file";
while(<F>){
# search for pattern
}
close F;
}
But it is also okay to manipulate #ARGV:
#!/usr/bin/perl -w
#ARGV = <$ARGV[0]/*>;
while(<>){
# search for pattern
}
Yes, it is OK to adjust the argument list before you start the 'while (<>)' loop; it would be more nearly foolhardy to adjust it while inside the loop. If you process option arguments, for instance, you typically remove items from #ARGV; here, you are adding items, but it still changes the original value of #ARGV.
It makes no odds whether the code is in a subroutine or in the 'main function'.
The previous answers cover your main Perl-programming question rather well.
So let me comment on the underlying question: How to find a pattern in a bunch of files.
Depending on the OS it might make sense to call a specialised external program, say
grep -l <pattern> <path>
on unix.
Depending on what you need to do with the files containing the pattern, and how big the hit/miss ratio is, this might save quite a bit of time (and re-uses proven code).
The big issue with tweaking #ARGV is that it is a global variable. Also, you should be aware that while (<>) has special magic attributes. (reading each file in #ARGV or processing STDIN if #ARGV is empty, testing for definedness rather than truth). To reduce the magic that needs to be understood, I would avoid it, except for quickie-hack-jobs.
You can get the filename of the current file by checking $ARGV.
You may not realize it, but you are actually affecting two global variables, not just #ARGV. You are also hitting $_. It is a very, very good idea to localize $_ as well.
You can reduce the impact of munging globals by using local to localize the changes.
BTW, there is another important, subtle bit of magic with <>. Say you want to return the line number of the match in the file. You might think, ok, check perlvar and find $. gives the linenumber in the last handle accessed--great. But there is an issue lurking here--$. is not reset between #ARGV files. This is great if you want to know how many lines total you have processed, but not if you want a line number for the current file. Fortunately there is a simple trick with eof that will solve this problem.
use strict;
use warnings;
...
searchDir( 'foo' );
sub searchDir {
my $dirN = shift;
my $pattern = shift;
local $_;
my #fileList = grep { -f $_ } glob("$dirN/*");
return unless #fileList; # Don't want to process STDIN.
local #ARGV;
#ARGV = #fileList;
while(<>) {
my $found = 0;
## Search for pattern
if ( $found ) {
print "Match at $. in $ARGV\n";
}
}
continue {
# reset line numbering after each file.
close ARGV if eof; # don't use eof().
}
}
WARNING: I just modified your code in my browser. I have not run it so it, may have typos, and probably won't work without a bit of tweaking
Update: The reason to use local instead of my is that they do very different things. my creates a new lexical variable that is only visible in the contained block and cannot be accessed through the symbol table. local saves the existing package variable and aliases it to a new variable. The new localized version is visible in any subsequent code, until we leave the enclosing block. See perlsub: Temporary Values Via local().
In the general case of making new variables and using them, my is the correct choice. local is appropriate when you are working with globals, but you want to make sure you don't propagate your changes to the rest of the program.
This short script demonstrates local:
$foo = 'foo';
print_foo();
print_bar();
print_foo();
sub print_bar {
local $foo;
$foo = 'bar';
print_foo();
}
sub print_foo {
print "Foo: $foo\n";
}