issues for a code snippet to handle the input file - perl

I am studying a Perl program, which includes the following segment for handling an input file. I do not understand what is s/^\s+//; used for? Moreover, what are '|' and '||' stand for in open(FILE, "cat $fileName |") || die "could not open file";
open(FILE, "cat $fileName |") || die "could not open file";
while (<FILE>)
{
s/^\s+//;
my #line = split;
if ($line[0]!~ /\:/) {$mark=0}
my $var = $line[$mark];
## some other code
}

You can read the documentation for the various functions in perlfunc.
This code will open a file for reading, by the rather circumspect way of piping from cat instead of simply opening the file. The | means that the shell command cat is piped to the open command, and our file handle will read from the output.
|| is simply or. Open the pipe, and if that fails, the program dies.
while(<FILE>) will read through every line of the input and assign each line to $_. That line is then used implicitly in the substitution and split below. I.e. s/^\s+// is equal to $_ =~ s/^\s+//, and split is equal to split(' ', $_).
s/^\s+//
Will remove leading whitespace. The split will split each line on whitespace, and the elements are stored in the array #line.
Because of the use of implicit split on whitespace, the stripping the leading whitespace with s/^\s+// is not really needed, as that is done automatically.
If the first element does not contain a colon :, $mark is set to 0. Otherwise, it is not set, and will presumably use the value from the previous iteration, since it is not defined inside the loop. Finally, $var is initialized as element number $mark, which is either 0 or whatever.
ETA: As a rather insidious oops: If $mark is undefined, i.e. it does not contain a colon, then $var will still be assigned $line[0], since undef will be converted to 0, with a warning. If use warnings is not in effect, this error is silent, and therefore insidious.
This code seems to be written by someone who does not know too much about perl, and it might not be very safe to use.

The substitution trims leading whitespace that appears at the beginning of the line (^), leaving any non-whitespace characters as the first.
The || operator in open... || die ... is a high-precedence or. If open fails, die executes.
open(FILE, "cat $fileName |") is a waste of an external process. To read a file for input, simply do:
open FILE, '<', $filename or die qq{Could not open "$filename" for reading: $!};
The parentheses for the open call are optional because or does not bind tightly.
It is also better to use lexical file handles:
open my $fh, '<' $filename or die qq{Could not open "$filename" for reading: $!};
This file handle is assigned to a lexical variable that lives only within the scope it is declared. Once the program flow exits this scope, the file closes automatically.

Part of the confusion is that the developer is using the default variable, $_. Many Perl commands (I would say about 1/3 of them) act upon $_ when you don't specify the name of the variable in the function. For example, these are syntactically the same:
my $uppercase_name = uc($_);
my $uppercase_name = uc;
In both cases, the uc function will print the string in the $_ variable in upper case characters. In fact, even the print statement uses the $_ variable. Again, these are both the same:
print $_;
print;
It's frowned upon to use the default variable in newer Perl scripts because it doesn't add clarity to the program and it doesn't make the program faster. I've rewritten the same code snippet you used in order to show the missing $_ variable. It might make the code easier to understand:
open(FILE, "cat $fileName |") || die "could not open file";
while ($_ = <FILE>)
{
$_ =~ s/^\s+//;
my #line = split $_;
if ($line[0] !~ /\:/) {
$mark = 0;
}
my $var = $line[$mark];
## some other code
}
Notice that the while statement is putting the value of the line read into the $_ variable and that the substitute command (the s/^\s+//) is also operating on the $_ variable. I hope that clarifies the code a bit for you.
Now for your questions:
_[W]hat do '|' and '||' stand for?
The || means or as in do this or that. In practice, the or can be thought of as an if statement:
if (not open(FILE, "cat $fileName |")) {
die "could not open file";
}
That is, if the open statement failed, then execute the die statement. If the open statement did manage to open the file, then don't execute the die statement.
In Perl, you now see or instead of || in cases like this:
open(FILE, "cat $fileName |") or die "could not open file";
which makes the meaning a bit more obvious: Open the file, or kill the program.
The single pipe (|) at the end of the file name means execute the command in the open statement (the cat $filename) and read from the output of this command. Imagine something like this:
open (COMMAND, "java -jar foo.war|") or die "Can't execute 'java -jar foo.war'";
Now, I'm running the command java -jar foo.war and using its output in my Perl script.
You can do this the other way around too:
open (MAIL, "|mail $recipient") or die "Can't mail $recipient";
print MAIL "Dear $recipient\n\n";
print MAIL "I hope everything is well.\n";
print MAIL "Sincerely,\n\nDavid";
close MAIL;
I'm now opening the command mail $recipient and writing to it with the print statements. In this case, I'm emailing $recipient with a simple message.
I do not understand what is s/^\s+//; used for?
In the original program, it was on a line by itself:
s/^\s+//;
I've added the missing variable which should help clarify it a bit:
$_ =~ s/^\s+//;
This is the Substitute command in Perl. It's taking the $_ variable and substituting the regular expression ^\s+ with nothing. If you don't understand what are regular expressions, you should take a look at the Perldoc tutorial on the subject. Basically, this is removing all spaces, tabs, and other forms of white space from the beginning of the line.

Related

Replace multiple lines in text file

I have text files containing the text below (amongst other text)
DIFF_COEFF= 1.000e+07,1.000e+07,1.000e+07,1.000e+07,
1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,
1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,
1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,
1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,
1.000e+07,1.000e+07,1.000e+07,1.000e+07,1.000e+07,4.000e+05,
and I need to replace it with the following text:
DIFF_COEFF= 2.000e+07,2.000e+07,2.000e+07,2.000e+07,
2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,
2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,
2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,
2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,
2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,8.000e+05,
Each line above corresponds to a new line in the text file.
After some googling, I thought making use of Perl in the following might work, but it did not. I got the error message
Illegal division by zero at -e line 1, <> chunk 1
s_orig='DIFF_COEFF=*4.000e+05,'
s_new='DIFF_COEFF= 2.000e+07,2.000e+07,2.000e+07,2.000e+07,\n2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,\n2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,\n2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,\n2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,\n2.000e+07,2.000e+07,2.000e+07,2.000e+07,2.000e+07,8.000e+05,'
perl -0 -i -pe "s:\Q${s_orig}\E:${s_new}:/igs" file.txt
Does anyone here know the right way to do this?
Edit - some more details: the text after this block is "DIFF_COEFF_Q=" followed by the same set of numbers, so I need to search for and replace the specific lines shown. The text files are not very large in size.
Copy the file over to a new one, except that within the range of text between these markers drop the replacement text instead. Then move that file to replace the original, as it may be needed judging by the attempted perl -0 -i in the question.
Note that when changing a file we have to build new content and then replace the file. There are a few ways to do this and modules that make it easier, shown further below.
The code below uses the range operator and the fact that it returns the counter for lines within the range, 1 for the first and the number ending with E0 for the last. So we don't copy lines inside that region while we write the replacement text (and the post-region-end marker) on the last line.
I consider the region of interest to end right before DIFF_COEFF_Q= line, per the question edit.
use warnings;
use strict;
use feature 'say';
use File::Copy 'move';
my $replacement = "replacement text";
my $file = 'input.txt';
my $out_file = 'new_' . $file;
open my $fh_out, '>', $out_file or die "Can't open $out_file: $!";
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>)
{
if (my $range_cnt = /^\s*DIFF_COEFF\s*=/ .. /^\s*DIFF_COEFF_Q\s*=/) #/
{
if ($range_cnt =~ /E0$/)
{
print $fh_out $replacement; # may need a newline
print $fh_out $_;
}
}
else {
print $fh_out $_;
}
}
close $fh or die "Can't close $file: $!"; # don't overwrite original
close $fh_out or die "Can't close $out_file: $!"; # if there are problems
#move $out_file, $file or die "Can't move $file to $out_file: $!";
Uncomment the move line once this has been tested well enough on your actual files, if you want to replace the original. You may or may not need a newline after $replacement, depending on it.
An alternative is to use flags for entering/leaving that range. But this won't be cleaner since there are two distinct actions, to stop copying when entering the range and write replacement when leaving. Thus multiple flags need be set and checked, what may end up messier.
If the files can't ever be huge it is simpler to read and process the file in memory. Then open the same file for writing and dump the new content
my $text = do { # slurp file into a scalar
local $/;
open my $fh, '<', $file or die "Can't open $file: $!";
<$fh>
};
$text =~ s/^\s*DIFF_COEFF\s*=.*?(\n\s*DIFF_COEFF_Q)/$replacement$1/ms;
# Change $out_file to $file to overwrite
open my $fh_out, '>', $out_file or die "Can't open $out_file: $!";
print $fh_out $text;
Here /m modifier is for multiline mode in which we can use ^ for the beginning of a line (not the whole string), what is helpful here. The /s makes . match a newline, too. Also note that we can slurp a file with Path::Tiny as simply as: my $text = path($file)->slurp;
Another option is to use Path::Tiny, which in newer versions has edit and edit_lines methods
use Path::Tiny;
# NOTE: edits $file in place (changes it)
path($file)->edit(
sub { s/DIFF_COEFF=.*?(\n\s*DIFF_COEFF_Q)/$replacement$1/s }
);
For more on this see, for example, this post and this post and this post.
The first and last way change the inode number of the file. See this post if that is a problem.
It's an interesting error that you've made and I can see what has led you to make it. But I don't think I've ever seen anyone else make the same mistake :-)
Your substitution statement is this:
s:\Q${s_orig}\E:${s_new}:/igs
So you've decided to use : as the delimiter of the substitution operator. But you want to use the options i, g and s and everywhere you've seen people talk about options on a substitution operator, they talk about using / to introduce the options. So you've added /igs to your substitution operator.
But what you've missed (and I completely understand why) is that the / that comes before the options is actually the closing delimiter of the standard, s/.../.../, version of the substitution operator. If you change the delimiter (as you have done) then your altered closing delimiter is all you need.
In your case, Perl doesn't expect the / as it has already seen the closing delimiter. It, therefore, decides that the / is a division operator and tries to divide the result of your substitution by igs. It interprets igs as zero and you get your error.
The fix is to remove that / so:
s:\Q${s_orig}\E:${s_new}:/igs
becomes:
s:\Q${s_orig}\E:${s_new}:igs

system grep inside a perl script gives unbalanced parenthesis

system ("grep -E 'Type|group|slack (' $a > temp.rpt");
The above line is giving me an error Unmatched ( or \(
What is wrong here?
I have tried a backslash before ( too. it shows the same error.
Since you are in a script, why not do that in Perl?
my $infile = '...';
open my $fh, '<', $infile or die "Can't open $file: $!";
open my $out_fh, '>', $temp.rpt or die "Can't open $temp.rpt: $!";
while (<$fh>) {
print $out_fh $_ if /Type|group|slack \(/;
}
Adjust your regex as needed. Generally, it is far easier to change and tweak things now.
If the input file isn't too large you can process in one line as well, once you opened files
print $out_fh grep { /Type|group|slack \(/ } <$fh>;
The grep imposes the list context on the <> operator so it reads and at once returns all lines in a list, and the ones that pass the condition are printed.
A comment on regex. As it stands, it matches either Type or group or slack (. If, by any chance, you intend to match either of the words, then followed by space-paren, you need grouping parenthesis, /(?:Type|group|slack) \(/. The ?: is there so it is not needlessly captured.
You need to use three backslashes
system ("grep -E 'Type|group|slack \\\(' $a ");
( is a regex metacharacter in grep -E. That's why you get the error.
Adding a single backslash doesn't fix it because that's processed by perl: "\(" is the same string as "(".
To fix it, you need to either use two backslashes ("\\(", which turns into the two character string \(, which is then interpreted by grep), or remove the -E option because ( isn't special in POSIX "basic" regexes (which is what grep uses by default).

perl + print to file_out in place to standard output

I have the follwoing script
#!/usr/bin/perl
open IN, "/tmp/file";
s/(.*)=/$k{$1}++;"$1$k{$1}="/e and print while <IN>;
how to print the output of the script to file_out in place to print to standard output?
lidia
#!/usr/bin/perl
open IN, "/tmp/file";
open OUT, ">file_out.txt";
s/(.*)=/$k{$1}++;"$1$k{$1}="/e and print OUT while <IN>;
Explanation:
`open IN, "/tmp/file"
open command to open file
IN filehandle name
/tmp/file name of file and specifier that it is for reading
if there is no modifier, it means reading
if there is a <, i.e. "</tmp/file" it also means reading
`open OUT, ">file_out.txt"
open command to open file
OUT filehandle name
>file_out.txt name of file and specifier that it is for reading
there must be a >, i.e. ">file_out.txt" to write
s/.../.../e your substitution (I assume you know what it does)
and is a boolean operator that short-circuits, meaning it only does the thing afterwards if the thing beforehand is true. In this case, it will only print if the substitution actually matched something.
print OUT print to the filehandle OUT
while <IN> for each line from the file behind filehandle IN
Note:
Used this way, it makes extensive use of the magical default variable $_. Do a search for $_ on the perlintro site. In short:
If you don't tell a s/// substitution what string to work on, it uses $_
If you don't tell a print what to print, it prints $_
If you don't tell a while loop going through a filehandle's data where to put each line, it gets put into $_
Your program could have been rewritten:
#!/usr/bin/perl
open IN, "/tmp/file";
open OUT, ">file_out.txt";
while( defined( $line = <IN> ) )
{
$line =~ s/(.*)=/$k{$1}++;"$1$k{$1}="/e or next;
print OUT $line;
}
Simply add the filehandle you are printing to after the print statement; opening for writing is a small change from opening for reading:
#!/usr/bin/perl -w
open IN, "/tmp/file";
open OUT, '>', "/tmp/file_out";
s/(.*)=/Sk_$1_++;"$1Sk_$1_="/ and print OUT while <IN>;
(I munged the replacement a bit, so it was easier for me to test.)

How can I create a new file using a variable value as the name in Perl?

Eg:
$variable = "10000";
for($i=0; $i<3;$i++)
{
$variable++;
$file = $variable."."."txt";
open output,'>$file' or die "Can't open the output file!";
}
This doesn't work. Please suggest a new way.
Everyone here has it right, you are using single quotes in your call to open. Single quotes do not interpolate variables into the quoted string. Double quotes do.
my $foo = 'cat';
print 'Why does the dog chase the $foo?'; # prints: Why does the dog chase the $foo?
print "Why does the dog chase the $foo?"; # prints: Why does the dog chase the cat?
So far, so good. But, the others have neglected to give you some important advice about open.
The open function has evolved over the years, as has the way that Perl works with filehandles. In the old days, open was always called with the mode and the file name combined in the second argument. The first argument was always a global filehandle.
Experience showed that this was a bad idea. Combining the mode and the filename in one argument created security problems. Using global variables, well, is using global variables.
Since Perl 5.6.0 you can use a 3 argument form of open that is much more secure, and you can store your filehandle in a lexically scoped scalar.
open my $fh, '>', $file or die "Can't open $file - $!\n";
print $fh "Goes into the file\n";
There are many nice things about lexical filehandles, but one excellent property is that they are automatically closed when their refcount drops to 0 and they are destroyed. There is no need to explicitly close them.
Something else worth noting is that it is considered by most of the Perl community that it is a good idea to always use the strict and warnings pragmas. Using them helps catch many bugs early in the development process and can be a huge time saver.
use strict;
use warnings;
for my $base ( 10_001..10_003 ) {
my $file = "$base.txt";
print "file: $file\n";
open my $fh,'>', $file or die "Can't open the output file: $!";
# Do stuff with handle.
}
I simplified your code a bit too. I used the range operator to generate your base numbers for the file names. Since we are working with numbers and not strings, I was able to use the _, as the thousands separator to improve readability without impacting the final result. Finally, I used an idiomatic perl for loop instead of the C style for you had.
I hope you find this helpful.
use double quotes: ">$file". single quotes will not interpolate your variable.
$variable = "10000";
for($i=0; $i<3;$i++)
{
$variable++;
$file = $variable."."."txt";
print "file: $file\n";
open $output,">$file" or die "Can't open the output file!";
close($output);
}
The problem is that you're using single quotes for the second argument to open, and single-quoted strings do not interpolate variables mentioned in them. Perl interpreted your code as though you wanted to open a file that really had a dollar sign for the first character of its name. (Check your disk; you should see an empty file named $file there.)
You can avoid the issue by using the three-argument version of open:
open output, '>', $file
Then the file-name argument can't accidentally interfere with the open-mode argument, and there's no unnecessary variable interpolation or concatenation.
$variable = "10000";
for($i=0; $i<3;$i++)
{
$variable++;
$file = $variable . 'txt';
open output,'>$file' or die "Can't open the output file!";
}
this works
1.txt
2.txt and so on ..
Use a file handle:
my $file = "whatevernameyouwant";
open (MYFILE, ">>$file");
print MYFILE "Bob\n";
close (MYFILE);
print '$file' yields $file, whereas print "$file" yields whatevernameyouwant.
You almost have it right, but there are a couple of issues.
1 - You need to use double quotes around the file you're opening.
open output,">$file" or die[...]
2 - Minor niggles, you don't close the files afterwards.
I'd rewrite your code something like this:
#!/usr/bin/perl
$variable = "1000";
for($i=0; $i<3;$i++) {
$variable++;
$file = $variable."."."txt";
open output,">$file" or die "Can't open the output file!";
}

How do I get a filehandle from the command line?

I have a subroutine that takes a filehandle as an argument. How do I make a filehandle from a file path specified on the command line? I don't want to do any processing of this file myself, I just want to pass it off to this other subroutine, which returns an array of hashes with all the parsed data from the file.
Here's what the command line input I'm using looks like:
$ ./getfile.pl /path/to/some/file.csv
Here's what the beginning of the subroutine I'm calling looks like:
sub parse {
my $handle = shift;
my #data = <$handle>;
while (my $line = shift(#data)) {
# do stuff
}
}
Command line arguments are available in the predefined #ARGV array. You can get the file name from there and use open to open a filehandle to it. Assuming that you want read-only access to the file, you would do it this way:
my $file = shift #ARGV;
open(my $fh, '<', $file) or die "Can't read file '$file' [$!]\n";
parse($fh);
Note that the or die... checks the call open for success and dies with an error message if it wasn't. The built-in variable $! will contain the (OS dependent) error message on failure that tells you why the call wasn't successful. e.g. "Permission denied."
parse(*ARGV) is the simplest solution: the explanation is a bit long, but an important part of learning how to use Perl effectively is to learn Perl.
When you use a null filehandle (<>), it actually reads from the magical ARGV filehandle, which has special semantics: it reads from all the files named in #ARGV, or STDIN if #ARGV is empty.
From perldoc perlop:
The null filehandle <> is special: it can be used to emulate the
behavior of sed and awk. Input from <> comes either from standard
input, or from each file listed on the command line. Here’s how it
works: the first time <> is evaluated, the #ARGV array is checked, and
if it is empty, $ARGV[0] is set to "-", which when opened gives you
standard input. The #ARGV array is then processed as a list of
filenames. The loop
while (<>) {
... # code for each line
}
is equivalent to the following Perl-like pseudo code:
unshift(#ARGV, '-') unless #ARGV;
while ($ARGV = shift) {
open(ARGV, $ARGV);
while (<ARGV>) {
... # code for each line
}
}
except that it isn’t so cumbersome to say, and will actually work. It
really does shift the #ARGV array and put the current filename into the
$ARGV variable. It also uses filehandle ARGV internally--<> is just a
synonym for <ARGV>, which is magical. (The pseudo code above doesn’t
work because it treats <ARGV> as non-magical.)
You don't have to use <> in a while loop -- my $data = <> will read one line from the first non-empty file, my #data = <>; will slurp it all up at once, and you can pass *ARGV around as if it were a normal filehandle.
This is what the -n switch is for!
Take your parse method, and do this:
#!/usr/bin/perl -n
#do stuff
Each line is stored in $_. So you run
./getfile.pl /path/to.csv
And it does this.
See here and here for some more info about these. I like -p too, and have found the combo of -a and -F to be really useful.
Also, if you want to do some extra processing, add BEGIN and end blocks.
#!/usr/bin/perl -n
BEGIN {
my $accumulator;
}
# do stuff
END {
print process_total($accumulator);
}
or whatever. This is very, very useful.
Am I missing something or are you just looking for the open() call?
open($fh, "<$ARGV[0]") or die "couldn't open $ARGV[0]: $!";
do_something_with_fh($fh);
close($fh);