Multiple text parsing and writing using the while statement, the diamond operator <> and $ARGV variable in Perl - perl

I have some text files, inside a directory and i want to parse their content and write it to a file. So far the code i am using is this:
#!/usr/bin/perl
#The while loop repeats the execution of a block as long as a certain condition is evaluated true
use strict; # Always!
use warnings; # Always!
my $header = 1; # Flag to tell us to print the header
while (<*.txt>) { # read a line from a file
if ($header) {
# This is the first line, print the name of the file
**print "========= $ARGV ========\n";**
# reset the flag to a false value
$header = undef;
}
# Print out what we just read in
print;
}
continue { # This happens before the next iteration of the loop
# Check if we finished the previous file
$header = 1 if eof;
}
When i run this script i am only getting the headers of the files, plus a compiled.txt entry.
I also receive the following message in cmd : use of uninitialized $ARGV in concatenation <.> or string at concat.pl line 12
So i guess i am doing something wrong and $ARGV isn't used at all. Plus instead of $header i should use something else in order to retrieve the text.
Need some assistance!

<*.txt> does not read a line from a file, even if you say so in a comment. It runs
glob '*.txt'
i.e. the while loop iterates over the file names, not over their contents. Use empty <> to iterate over all the files.
BTW, instead of $header = undef, you can use undef $header.

As I understand you want to print a header with the filename just before the first line, and concatenate them all to a new one. Then a one-liner could be enough for the task.
It checks first line with variable $. and closes the filehandle to reset its value between different input files:
perl -pe 'printf qq|=== %s ===\n|, $ARGV if $. == 1; close ARGV if eof' *.txt
An example in my machine yields:
=== file1.txt ===
one
=== file2.txt ===
one
two

Related

Can one concatenate two Perl scripts which use different input record separators?

Two Perl scripts, using different input record separators, work together to convert a LaTeX file into something easily searched for human-readable phrases and sentences. Of course, they could be wrapped together by a single shell script. But I am curious whether they can be incorporated into a single Perl script.
The reason for these scripts: It would be a hassle to find "two three" inside short.tex, for instance. But after conversion, grep 'two three' will return the first paragraph.
For any LaTeX file (here, short.tex), the scripts are invoked as follows.
cat short.tex | try1.pl | try2.pl
try1.pl works on paragraphs. It gets rid of LaTeX comments. It makes sure that each word is separated from its neighbors by a single space, so that no sneaky tabs, form feeds, etc., lurk between words. The resulting paragraph occupies a single line, consisting of visible characters separated by single spaces --- and at the end, a sequence of at least two newlines.
try2.pl slurps the entire file. It makes sure that paragraphs are separated from each other by exactly two newlines. And it ensures that the last line of the file is non-trivial, containing visible character(s).
Can one elegantly concatenate two operations such as these, which depend on different input record separators, into a single Perl script, say big.pl? For instance, could the work of try1.pl and try2.pl be accomplished by two functions or bracketed segments inside the larger script?
Incidentally, is there a Stack Overflow keyword for "input record separator"?
###File try1.pl:
#!/usr/bin/perl
use strict;
use warnings;
use 5.18.2;
local $/ = ""; # input record separator: loop through one paragraph at a time. position marker $ comes only at end of paragraph.
while (<>) {
s/[\x25].*\n/ /g; # remove all LaTeX comments. They start with %
s/[\t\f\r ]+/ /g; # collapse each "run" of whitespace to one single space
s/^\s*\n/\n/g; # any line that looks blank is converted to a pure newline;
s/(.)\n/$1/g; # Any line that does not look blank is joined to the subsequent line
print;
print "\n\n"; # make sure each paragraph is separated from its fellows by newlines
}
###File try2.pl:
#!/usr/bin/perl
use strict;
use warnings;
use 5.18.2;
local $/ = undef; # input record separator: entire text or file is a single record.
while (<>) {
s/[\n][\n]+/\n\n/g; # exactly 2 blank lines separate paragraphs. Like cat -s
s/[\n]+$/\n/; # last line is nontrivial; no blank line at the end
print;
}
###File short.tex:
\paragraph{One}
% comment
two % also 2
three % or 3
% comment
% comment
% comment
% comment
% comment
% comment
So they said%
that they had done it.
% comment
% comment
% comment
Fleas.
% comment
% comment
After conversion:
\paragraph{One} two three
So they said that they had done it.
Fleas.
To combine try1.pl and try2.pl into a single script you could try:
local $/ = "";
my #lines;
while (<>) {
[...] # Same code as in try1.pl except print statements
push #lines, $_;
}
$lines[-1] =~ s/\n+$/\n/;
print for #lines;
A pipe connects the output of one process to the input of another process. Neither one knows about the other nor cares how it operates.
But, putting things together like this breaks the Unix pipeline philosophy of small tools that each excel at a very narrow job. Should you link these two things, you'll always have to do both tasks even if you want one (although you could get into configuration to turn off one, but that's a lot of work).
I process a lot of LaTeX, and I control everything through a Makefile. I don't really care about what the commands look like and I don't even have to remember what they are:
short-clean.tex: short.tex
cat short.tex | try1.pl | try2.pl > $#
Let's do it anyways
I'll limit myself to the constraint of basic concatenation instead of complete rewriting or rearranging, most because there are some interesting things to show.
Consider what happens should you concatenate those two programs by simply adding the text of the second program at the end of the text of the first program.
The output from the original first program still goes to standard output and the second program now doesn't get that output as input.
The input to the program is likely exhausted by the original first program and the second program now has nothing to read. That's fine because it would have read the unprocessed input to the first program.
There are various ways to fix this, but none of them make much sense when you already have two working program that do their job. I'd shove that in the Makefile and forget about it.
But, suppose you do want it all in one file.
Rewrite the first section to send its output to a filehandle connected to a string. It's output is now in the programs memory. This basically uses the same interface, and you can even use select to make that the default filehandle.
Rewrite the second section to read from a filehandle connected to that string.
Alternately, you can do the same thing by writing to a temporary file in the first part, then reading that temporary file in the second part.
A much more sophisticated program would the first program write to a pipe (inside the program) that the second program is simultaneously reading. However, you have to pretty much rewrite everything so the two programs are happening simultaneously.
Here's Program 1, which uppercases most of the letters:
#!/usr/bin/perl
use v5.26;
$|++;
while( <<>> ) { # safer line input operator
print tr/a-z/A-Z/r;
}
and here's Program 2, which collapses whitespace:
#!/usr/bin/perl
use v5.26;
$|++;
while( <<>> ) { # safer line input operator
print s/\s+/ /gr;
}
They work serially to get the job done:
$ perl program1.pl
The quick brown dog jumped over the lazy fox.
THE QUICK BROWN DOG JUMPED OVER THE LAZY FOX.
^D
$ perl program2.pl
The quick brown dog jumped over the lazy fox.
The quick brown dog jumped over the lazy fox.
^D
$ perl program1.pl | perl program2.pl
The quick brown dog jumped over the lazy fox.
THE QUICK BROWN DOG JUMPED OVER THE LAZY FOX.
^D
Now I want to combine those. First, I'll make some changes that don't affect the operation but will make it easier for me later. Instead of using implicit filehandles, I'll make those explicit and one level removed from the actual filehandles:
Program 1:
#!/usr/bin/perl
use v5.26;
$|++;
my $output_fh = \*STDOUT;
while( <<>> ) { # safer line input operator
print { $output_fh } tr/a-z/A-Z/r;
}
Program 2:
#!/usr/bin/perl
$|++;
my $input_fh = \*STDIN;
while( <$input_fh> ) { # safer line input operator
print s/\s+/ /gr;
}
Now I have the chance to change what those filehandles are without disturbing the meat of the program. The while doesn't know or care what that filehandle is, so let's start by writing to a file in Program 1 and reading from that same file in Program 2:
Program 1:
#!/usr/bin/perl
use v5.26;
open my $output_fh, '>', 'program1.out' or die "$!";
while( <<>> ) { # safer line input operator
print { $output_fh } tr/a-z/A-Z/r;
}
close $output_fh;
Program 2:
#!/usr/bin/perl
$|++;
open my $input_fh, '<', 'program1.out' or die "$!";
while( <$input_fh> ) { # safer line input operator
print s/\h+/ /gr;
}
However, you can no longer run these in a pipeline because Program 1 doesn't use standard output and Program 2 doesn't read standard input:
% perl program1.pl
% perl program2.pl
You can, however, now join the programs, shebang and all:
#!/usr/bin/perl
use v5.26;
open my $output_fh, '>', 'program1.out' or die "$!";
while( <<>> ) { # safer line input operator
print { $output_fh } tr/a-z/A-Z/r;
}
close $output_fh;
#!/usr/bin/perl
$|++;
open my $input_fh, '<', 'program1.out' or die "$!";
while( <$input_fh> ) { # safer line input operator
print s/\h+/ /gr;
}
You can skip the file and use a string instead, but at this point, you've gone beyond merely concatenating files and need a little coordination for them to share the scalar with the data. Still, the meat of the program doesn't care how you made those filehandles:
#!/usr/bin/perl
use v5.26;
my $output_string;
open my $output_fh, '>', \ $output_string or die "$!";
while( <<>> ) { # safer line input operator
print { $output_fh } tr/a-z/A-Z/r;
}
close $output_fh;
#!/usr/bin/perl
$|++;
open my $input_fh, '<', \ $output_string or die "$!";
while( <$input_fh> ) { # safer line input operator
print s/\h+/ /gr;
}
So let's go one step further and do what the shell was already doing for us.
#!/usr/bin/perl
use v5.26;
pipe my $input_fh, my $output_fh;
$output_fh->autoflush(1);
while( <<>> ) { # safer line input operator
print { $output_fh } tr/a-z/A-Z/r;
}
close $output_fh;
while( <$input_fh> ) { # safer line input operator
print s/\h+/ /gr;
}
From here, it gets a bit tricky and I'm not going to go to the next step with polling filehandles so one thing can write and the the next thing reads. There are plenty of things that do that for you. And, you're now doing a lot of work to avoid something that was already simple and working.
Instead of all that pipe nonsense, the next step is to separate code into functions (likely in a library), and deal with those chunks of code as named things that hide their details:
use Local::Util qw(remove_comments minify);
while( <<>> ) {
my $result = remove_comments($_);
$result = minify( $result );
...
}
That can get even fancier where you simply go through a series of steps without knowing what they are or how many of them there will be. And, since all the baby steps are separate and independent, you're basically back to the pipeline notion:
use Local::Util qw(get_input remove_comments minify);
my $result;
my #steps = qw(get_input remove_comments minify)
while( ! eof() ) { # or whatever
no strict 'refs'
$result = &{$_}( $result ) for #steps;
}
A better way makes that an object so you can skip the soft reference:
use Local::Processor;
my #steps = qw(get_input remove_comments minify);
my $processer = Local::Processor->new( #steps );
my $result;
while( ! eof() ) { # or whatever
$result = $processor->$_($result) for #steps;
}
Like I did before, the meat of the program doesn't care or know about the steps ahead of time. That means that you can move the sequence of steps to configuration and use the same program for any combination and sequence:
use Local::Config;
use Local::Processor;
my #steps = Local::Config->new->get_steps;
my $processer = Local::Processor->new;
my $result;
while( ! eof() ) { # or whatever
$result = $processor->$_($result) for #steps;
}
I write quite a bit about this sort of stuff in Mastering Perl and Effective Perl Programming. But, because you can do it doesn't mean you should. This reinvents a lot that make can already do for you. I don't do this sort of thing without good reason—bash and make have to be pretty annoying to motivate me to go this far.
The motivating problem was to generate a "cleaned" version of a LaTeX file, which would be easy to search, using regex, for complex phrases or sentences.
The following single Perl script does the job, whereas previously I required one shell script and two Perl scripts, entailing three invocations of Perl. This new, single script incorporates three consecutive loops, each with a different input record separator.
First loop:
input = STDIN, or a file passed as argument; record separator=default, loop by line; print result to fileafterperlLIN, a temporary
file on the hard drive.
Second loop:
input = fileafterperlLIN;
record separator = "", loop by paragraph;
print result to fileafterperlPRG, a temporary file on the hard drive.
Third loop:
input = fileafterperlPRG;
record separator = undef, slurp entire file
print result to STDOUT
This has the disadvantage of printing to and reading from two files on the hard drive, which may slow it down. Advantages are that the operation seems to require only one process; and all the code resides in a single file, which should make it easier to maintain.
#!/usr/bin/perl
# 2019v04v05vFriv17h18m41s
use strict;
use warnings;
use 5.18.2;
my $diagnose;
my $diagnosticstring;
my $exitcode;
my $userName = $ENV{'LOGNAME'};
my $scriptpath;
my $scriptname;
my $scriptdirectory;
my $cdld;
my $fileafterperlLIN;
my $fileafterperlPRG;
my $handlefileafterperlLIN;
my $handlefileafterperlPRG;
my $encoding;
my $count;
sub diagnosticmessage {
return unless ( $diagnose );
print STDERR "$scriptname: ";
foreach $diagnosticstring (#_) {
printf STDERR "$diagnosticstring\n";
}
}
# Routine setup
$scriptpath = $0;
$scriptname = $scriptpath;
$scriptname =~ s|.*\x2f([^\x2f]+)$|$1|;
$cdld = "$ENV{'cdld'}"; # A directory to hold temporary files used by scripts
$exitcode = system("test -d $cdld && test -w $cdld || { printf '%\n' 'cdld not a writeable directory'; exit 1; }");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
$scriptdirectory = "$cdld/$scriptname"; # To hold temporary files used by this script
$exitcode = system("test -d $scriptdirectory || mkdir $scriptdirectory");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
diagnosticmessage ( "scriptdirectory=$scriptdirectory" );
$exitcode = system("test -w $scriptdirectory && test -x $scriptdirectory || exit 1;");
die "$scriptname: system returned exitcode=$exitcode: $scriptdirectory not writeable or not executable. bail\n" unless $exitcode == 0;
$fileafterperlLIN = "$scriptdirectory/afterperlLIN.tex";
diagnosticmessage ( "fileafterperlLIN=$fileafterperlLIN" );
$exitcode = system("printf '' > $fileafterperlLIN;");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
$fileafterperlPRG = "$scriptdirectory/afterperlPRG.tex";
diagnosticmessage ( "fileafterperlPRG=$fileafterperlPRG" );
$exitcode=system("printf '' > $fileafterperlPRG;");
die "$scriptname: system returned exitcode=$exitcode: bail\n" unless $exitcode == 0;
# This script's job: starting with a LaTeX file, which may compile beautifully in pdflatex but be difficult
# to read visually or search automatically,
# (1) convert any line that looks blank --- a "trivial line", containing only whitespace --- to a pure newline. This is because
# (a) LaTeX interprets any whitespace line following a non-blank or "nontrivial" line as end of paragraph, whereas
# (b) Perl needs two consecutive newlines to signal end of paragraph.
# (2) remove all LaTeX comments;
# (3) deal with the \unskip LaTeX construct, etc.
# The result will be
# (4) each LaTeX paragraph will occupy a unique line
# (5) exactly one pair of newlines --- visually, one blank line --- will divide each pair of consecutive paragraphs
# (6) first paragraph will be on first line (no opening blank line) and last paragraph will be on last line (no ending blank line)
# (7) whitespace in output will consist of only
# (a) a single space between readable strings, or
# (b) double newline between paragraphs
#
$handlefileafterperlLIN = undef;
$handlefileafterperlPRG = undef;
$encoding = ":encoding(UTF-8)";
diagnosticmessage ( "fileafterperlLIN=$fileafterperlLIN" );
open($handlefileafterperlLIN, ">> $encoding", $fileafterperlLIN) || die "$0: can't open $fileafterperlLIN for appending: $!";
# Loop 1 / line:
# Default input record separator: loop through one line at a time, delimited by \n
$count = 0;
while (<>) {
$count = $count + 1;
diagnosticmessage ( "line $count" );
s/^\s*\n/\n/mg; # Convert any trivial line to a pure newline.
print $handlefileafterperlLIN $_;
}
close($handlefileafterperlLIN);
open($handlefileafterperlLIN, "< $encoding", $fileafterperlLIN) || die "$0: can't open $fileafterperlLIN for reading: $!";
open($handlefileafterperlPRG, ">> $encoding", $fileafterperlPRG) || die "$0: can't open $fileafterperlPRG for appending: $!";
# Loop PRG / paragraph:
local $/ = ""; # Input record separator: loop through one paragraph at a time. position marker $ comes only at end of paragraph.
$count = 0;
while (<$handlefileafterperlLIN>) {
$count = $count + 1;
diagnosticmessage ( "paragraph $count" );
s/(?<!\x5c)[\x25].*\n/ /g; # Remove all LaTeX comments.
# They start with % not \% and extend to end of line or newline character. Join to next line.
# s/(?<!\x5c)([\x24])/\x2a/g; # 2019v04v01vMonv13h44m09s any $ not preceded by backslash \, replace $ by * or something.
# This would be only if we are going to run detex on the output.
s/(.)\n/$1 /g; # Any line that has something other than newline, and then a newline, is joined to the subsequent line
s|([^\x2d])\s*(\x2d\x2d\x2d)([^\x2d])|$1 $2$3|g; # consistent treatment of triple hyphen as em dash
s|([^\x2d])(\x2d\x2d\x2d)\s*([^\x2d])|$1$2 $3|g; # consistent treatment of triple hyphen as em dash, continued
s/[\x0b\x09\x0c\x20]+/ /gm; # collapse each "run" of whitespace other than newline, to a single space.
s/\s*[\x5c]unskip(\x7b\x7d)?\s*(\S)/$2/g; # LaTeX whitespace-collapse across newlines
s/^\s*//; # Any nontrivial line: No indenting. No whitespace in first column.
print $handlefileafterperlPRG $_;
print $handlefileafterperlPRG "\n\n"; # make sure each paragraph ends with 2 newlines, hence at least 1 blank line.
}
close($handlefileafterperlPRG);
open($handlefileafterperlPRG, "< $encoding", $fileafterperlPRG) || die "$0: can't open $fileafterperlPRG for reading: $!";
# Loop slurp
local $/ = undef; # Input record separator: entire file is a single record.
$count = 0;
while (<$handlefileafterperlPRG>) {
$count = $count + 1;
diagnosticmessage ( "slurp $count" );
s/[\n][\n]+/\n\n/g; # Exactly 2 blank lines (newlines) separate paragraphs. Like cat -s
s/[\n]+$/\n/; # Last line is visible or "nontrivial"; no trivial (blank) line at the end
s/^[\n]+//; # No trivial (blank) line at the start. The first line is "nontrivial."
print STDOUT;
}

Using Perl to find and fix errors in CSV files

I am dealing with very large amounts of data. Every now and then there is a slip up. I want to identify each row with an error, under a condition of my choice. With that I want the row number along with the line number of each erroneous row. I will be running this script on a handful of files and I will want to output the report to one.
So here is my example data:
File_source,ID,Name,Number,Date,Last_name
1.csv,1,Jim,9876,2014-08-14,Johnson
1.csv,2,Jim,9876,2014-08-14,smith
1.csv,3,Jim,9876,2014-08-14,williams
1.csv,4,Jim,9876,not_a_date,jones
1.csv,5,Jim,9876,2014-08-14,dean
1.csv,6,Jim,9876,2014-08-14,Ruzyck
Desired output:
Row#5,4.csv,4,Jim,9876,not_a_date,jones (this is an erroneous row)
The condition I have chosen is print to output if anything in the date field is not a date.
As you can see, my desired output contains the line number where the error occurred, along with the data itself.
After I have my output that shows the lines within each file that are in error, I want to grab that line from the untouched original CSV file to redo (both modified and original files contain the same amount of rows). After I have a file of these redone rows, I can omit and clean up where needed to prevent interruption of an import.
Folder structure will contain:
Modified: 4.txt
Original: 4.csv
I have something started here, written in Perl, which by the logic will at least return the rows I need. However I believe my syntax is a little off and I do not know how to plug in the other subroutines.
Code:
$count = 1;
while (<>) {
unless ($F[4] =~ /\d+[-]\d+[-]\d+/)
print "Row#" . $count++ . "," . "$_";
}
The code above is supposed to give me my erroneous rows, but to be able to extract them from the originals is beyond me. The above code also contains some syntax errors.
This will do as you ask.
Please be certain that none of the fields in the data can ever contain a comma , otherwise you will need to use Text::CSV to process it instead of just a simple split.
use strict;
use warnings;
use 5.010;
use autodie;
open my $fh, '<', 'example.csv';
<$fh>; # Skip header
while (<$fh>) {
my #fields = split /,/;
if( $fields[4] !~ /^\d{4}-\d{2}-\d{2}$/ ) {
print "Row#$.,$_";
}
}
output
Row#5,4.csv,4,Jim,9876,not_a_date,jones
Update
If you want to process a number of files then you need this instead.
The close ARGV at the end of the loop is there so that the line counter $. is reset to
1 at the start of each file. Without it it just continues from 1 upwards across all the files.
You would run this like
rob#Samurai-U:~$ perl findbad.pl *.csv
or you could list the files individually, separated by spaces.
For the test I have created files 1.csv and 2.csv which are identical to your example data except that the first field of each line is the name of the file containing the data.
You may not want the line in the output that announces each file name, in which case you should replace the entire first if block with just next if $. == 1.
use strict;
use warnings;
#ARGV = map { glob qq{"$_"} } #ARGV; # For Windows
while (<>) {
if ($. == 1) {
print "\n\nFile: $ARGV\n\n";
next;
}
my #fields = split /,/;
unless ( $fields[4] =~ /^\d{4}-\d{2}-\d{2}$/ ) {
printf "Row#%d,%s", $., $_;
}
close ARGV if eof ARGV;
}
output
File: 1.csv
Row#5,1.csv,4,Jim,9876,not_a_date,jones
File: 2.csv
Row#5,2.csv,4,Jim,9876,not_a_date,jones

How to print result STDOUT to a temporary blank new file in the same directory in Perl?

I'm new in Perl, so it's maybe a very basic case that i still can't understand.
Case:
Program tell user to types the file name.
User types the file name (1 or more files).
Program read the content of file input.
If it's single file input, then it just prints the entire content of it.
if it's multi files input, then it combines the contents of each file in a sequence.
And then print result to a temporary new file, which located in the same directory with the program.pl .
file1.txt:
head
a
b
end
file2.txt:
head
c
d
e
f
end
SINGLE INPUT program ioSingle.pl:
#!/usr/bin/perl
print "File name: ";
$userinput = <STDIN>; chomp ($userinput);
#read content from input file
open ("FILEINPUT", $userinput) or die ("can't open file");
#PRINT CONTENT selama ada di file tsb
while (<FILEINPUT>) {
print ; }
close FILEINPUT;
SINGLE RESULT in cmd:
>perl ioSingle.pl
File name: file1.txt
head
a
b
end
I found tutorial code that combine content from multifiles input but cannot adapt the while argument to code above:
while ($userinput = <>) {
print ($userinput);
}
I was stucked at making it work for multifiles input,
How am i suppose to reformat the code so my program could give result like this?
EXPECTED MULTIFILES RESULT in cmd:
>perl ioMulti.pl
File name: file1.txt file2.txt
head
a
b
end
head
c
d
e
f
end
i appreciate your response :)
A good way to start working on a problem like this, is to break it down into smaller sections.
Your problem seems to break down to this:
get a list of filenames
for each file in the list
display the file contents
So think about writing subroutines that do each of these tasks. You already have something like a subroutine to display the contents of the file.
sub display_file_contents {
# filename is the first (and only argument) to the sub
my $filename = shift;
# Use lexical filehandl and three-arg open
open my $filehandle, '<', $filename or die $!;
# Shorter version of your code
print while <$filehandle>;
}
The next task is to get our list of files. You already have some of that too.
sub get_list_of_files {
print 'File name(s): ';
my $files = <STDIN>;
chomp $files;
# We might have more than one filename. Need to split input.
# Assume filenames are separated by whitespace
# (Might need to revisit that assumption - filenames can contain spaces!)
my #filenames = split /\s+/, $files;
return #filenames;
}
We can then put all of that together in the main program.
#!/usr/bin/perl
use strict;
use warnings;
my #list_of_files = get_list_of_files();
foreach my $file (#list_of_files) {
display_file_contents($file);
}
By breaking the task down into smaller tasks, each one becomes easier to deal with. And you don't need to carry the complexity of the whole program in you head at one time.
p.s. But like JRFerguson says, taking the list of files as command line parameters would make this far simpler.
The easy way is to use the diamond operator <> to open and read the files specified on the command line. This would achieve your objective:
while (<>) {
chomp;
print "$_\n";
}
Thus: ioSingle.pl file1.txt file2.txt
If this is the sole objective, you can reduce this to a command line script using the -p or -n switch like:
perl -pe '1' file1.txt file2.txt
perl -ne 'print' file1.txt file2.txt
These switches create implicit loops around the -e commands. The -p switch prints $_ after every loop as if you had written:
LINE:
while (<>) {
# your code...
} continue {
print;
}
Using -n creates:
LINE:
while (<>) {
# your code...
}
Thus, -p adds an implicit print statement.

Perl - while (<>) file handling [duplicate]

This question already has an answer here:
Which file is Perl's diamond operator (null file handle) currently reading from?
(1 answer)
Closed 10 years ago.
A simple program with while( <> ) handles files given as arguments (./program 1.file 2.file 3.file) and standard input of Unix systems.
I think it concatenates them together in one file and work is line by line. The problem is, how do I know that I'm working with the first file? And then with the second one.
For a simple example, I want to print the file's content in one line.
while( <> ){
print "\n" if (it's the second file already);
print $_;
}
The diamond operator does not concatenate the files, it just opens and reads them consecutively. How you control this depends on how you need it controlled. A simple way to check when we have read the last line of a file is to use eof:
while (<>) {
chomp; # remove newline
print; # print the line
print "\n" if eof; # at end of file, print a newline
}
You can also consider a counter to keep track of which file in order you are processing
$counter++ if eof;
Note that this count will increase by one at the last line of the file, so do not use it prematurely.
If you want to keep track of line number $. in the current file handle, you can close the ARGV file handle to reset this counter:
while (<>) {
print "line $. : ", $_;
close ARGV if eof;
}
The <> is a special case of the readline operator. It usually takes a filehandle: <$fh>.
If the filehandle is left out, then the the magic ARGV filehandle is used.
If no command line arguments are given, then ARGV is STDIN. If command line arguments are given, then ARGV will be opened to each of those in turn. This is similar to
# Pseudocode
while ($ARGV = shift #ARGV) {
open ARGV, $ARGV or do{
warn "Can't open $ARGV: $!";
next;
};
while (<ARGV>) {
...; # your code
}
}
The $ARGV variable is real, and holds the filename of the file currently opened.
Please be aware that the two-arg form of open (which is probably used here behind the scenes), is quite unsafe. The filename rm -rf * | may not do what you want.
The name of the current file for <> is contained in special $ARGV variable.
You can cross-match your list of files from #ARGV parameter array with current file name to get the file's position in the list. Assuming the only parameters you expect are filenames, you can simply do:
my %filename_positions = map { ( $ARGV[$_] => $_ ) } 0..$#ARGV;
while (<>) {
my $file_number = $filename_positions{$ARGV};
#... if ($file_number == 0) { #first file
}

perl split on empty file

I have basically the following perl I'm working with:
open I,$coupon_file or die "Error: File $coupon_file will not Open: $! \n";
while (<I>) {
$lctr++;
chomp;
my #line = split/,/;
if (!#line) {
print E "Error: $coupon_file is empty!\n\n";
$processFile = 0; last;
}
}
I'm having trouble determining what the split/,/ function is returning if an empty file is given to it. The code block if (!#line) is never being executed. If I change that to be
if (#line)
than the code block is executed. I've read information on the perl split function over at
http://perldoc.perl.org/functions/split.html and the discussion here about testing for an empty array but not sure what is going on here.
I am new to Perl so am probably missing something straightforward here.
If the file is empty, the while loop body will not run at all.
Evaluating an array in scalar context returns the number of elements in the array.
split /,/ always returns a 1+ elements list if $_ is defined.
You might try some debugging:
...
chomp;
use Data::Dumper;
$Data::Dumper::Useqq = 1;
print Dumper( { "line is" => $_ } );
my #line = split/,/;
print Dumper( { "split into" => \#line } );
if (!#line) {
...
Below are a few tips to make your code more idiomatic:
The special variable $. already holds the current line number, so you can likely get rid of $lctr.
Are empty lines really errors, or can you ignore them?
Pull apart the list returned from split and give the pieces names.
Let Perl do the opening with the "diamond operator":
The null filehandle <> is special: it can be used to emulate the behavior of sed and awk. Input from <> comes either from standard input, or from each file listed on the command line. Here's how it works: the first time <> is evaluated, the #ARGV array is checked, and if it is empty, $ARGV[0] is set to "-", which when opened gives you standard input. The #ARGV array is then processed as a list of filenames. The loop
while (<>) {
... # code for each line
}
is equivalent to the following Perl-like pseudo code:
unshift(#ARGV, '-') unless #ARGV;
while ($ARGV = shift) {
open(ARGV, $ARGV);
while (<ARGV>) {
... # code for each line
}
}
except that it isn't so cumbersome to say, and will actually work.
Say your input is in a file named input and contains
Campbell's soup,0.50
Mac & Cheese,0.25
Then with
#! /usr/bin/perl
use warnings;
use strict;
die "Usage: $0 coupon-file\n" unless #ARGV == 1;
while (<>) {
chomp;
my($product,$discount) = split /,/;
next unless defined $product && defined $discount;
print "$product => $discount\n";
}
that we run as below on Unix:
$ ./coupons input
Campbell's soup => 0.50
Mac & Cheese => 0.25
Empty file or empty line? Regardless, try this test instead of !#line.
if (scalar(#line) == 0) {
...
}
The scalar method returns the array's length in perl.
Some clarification:
if (#line) {
}
Is the same as:
if (scalar(#line)) {
}
In a scalar context, arrays (#line) return the length of the array. So scalar(#line) forces #line to evaluate in a scalar context and returns the length of the array.
I'm not sure whether you're trying to detect if the line is empty (which your code is trying to) or whether the whole file is empty (which is what the error says).
If the line, please fix your error text and the logic should be like the other posters said (or you can put if ($line =~ /^\s*$/) as your if).
If the file, you simply need to test if (!$lctr) {} after the end of your loop - as noted in another answer, the loop will not be entered if there's no lines in the file.