is there a way to designate the line token delimiter in Perl's file reader? - perl

I'm reading a text file via CGI in, in perl, and noticing that when the file is saved in mac's textEdit the line separator is recognized, but when I upload a CSV that is exported straight from excel, they are not. I'm guessing it's a \n vs. \r issue, but it got me thinking that I don't know how to specify what I would like the line terminator token to be, if I didn't want the one it's looking for by default.

Yes. You'll want to overwrite the value of $/. From perlvar
$/
The input record separator, newline by default. This influences Perl's idea of what a "line" is. Works like awk's RS variable, including treating empty lines as a terminator if set to the null string. (An empty line cannot contain any spaces or tabs.) You may set it to a multi-character string to match a multi-character terminator, or to undef to read through the end of file. Setting it to "\n\n" means something slightly different than setting to "", if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline. (Mnemonic: / delimits line boundaries when quoting poetry.)
local $/; # enable "slurp" mode
local $_ = <FH>; # whole file now here
s/\n[ \t]+/ /g;
Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)
Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer. So this:
local $/ = \32768; # or \"32768", or \$var_containing_32768
open my $fh, "<", $myfile or die $!;
local $_ = <$fh>;
will read a record of no more than 32768 bytes from FILE. If you're not reading from a record-oriented file (or your OS doesn't have record-oriented files), then you'll likely get a full chunk of data with every read. If a record is larger than the record size you've set, you'll get the record back in pieces. Trying to set the record size to zero or less will cause reading in the (rest of the) whole file.
On VMS, record reads are done with the equivalent of sysread, so it's best not to mix record and non-record reads on the same file. (This is unlikely to be a problem, because any file you'd want to read in record mode is probably unusable in line mode.) Non-VMS systems do normal I/O, so it's safe to mix record and non-record reads of a file.
See also "Newlines" in perlport. Also see $..

The variable has multiple names:
$/
$RS
$INPUT_RECORD_SEPARATOR
For the longer names, you need:
use English;
Remember to localize carefully:
{
local($/) = "\r\n";
...code to read...
}

If you are reading in a file with CRLF line terminators, you can open it with the CRLF discipline, or set the binmode of the handle to do automatic translation.
open my $fh, '<:crlf', 'the_csv_file.csv' or die "Oh noes $!";
This will transparently convert \r\n sequences into \n sequences.
You can also apply this translation to an existing handle by doing:
binmode( $fh, ':crlf' );
:crlf mode is typically default in Win32 Perl environments and works very well in practice.

For reading a CSV file, follow Robert-P's advice in his comment, and use a CSV module.
But for the general case of reading lines from a file with different line-endings, what I generally do is slurp the file whole and split it on \R. If it's not a multi-gigabytes file, that should be the safest and easiest way.
So:
perl -ln -0777 -e 'my #lines = split /\R/;
print length($_), " bytes split into ", scalar(#lines), " lines."' $YOUR_FILE
or in your script:
{
local $/ = undef;
open F, $YOUR_FILE or die;
#lines = split /\R/, <F>;
close F;
}
\R works with Unix LF (\x0A), Windows/Internet CRLF, and also with CR (\x0D) which was used by Macs in the nineties, but is in fact still used by some Mac programs.
From the perldoc :
\R matches a generic newline; that is, anything considered a linebreak
sequence by Unicode. This includes all characters matched by \v
(vertical whitespace), and the multi character sequence "\x0D\x0A"
(carriage return followed by a line feed, sometimes called the network
newline; it's the end of line sequence used in Microsoft text files
opened in binary mode)
Or see this much nicer and exhaustive explanation about \R in Brian D Foy's article : The \R generic line ending which even has a couple of fun videos.

Related

Counting records separated by CR/LF (carriage return and newline) in Perl

I'm trying to create a simple script to read a text file that contains records of book titles. Each record is separated with a plain old double space (\r\n\r\n). I need to count how many records are in the file.
For example here is the input file:
record 1
some text
record 2
some text
...
I'm using a regex to check for carriage return and newline, but it fails to match. What am I doing wrong? I'm at my wits' end.
sub readInputFile {
my $inputFile = $_[0]; #read first argument from the commandline as fileName
open INPUTFILE, "+<", $inputFile or die $!; #Open File
my $singleLine;
my #singleRecord;
my $recordCounter = 0;
while (<INPUTFILE>) { # loop through the input file line-by-line
$singleLine = $_;
push(#singleRecord, $singleLine); # start adding each line to a record array
if ($singleLine =~ m/\r\n/) { # check for carriage return and new line
$recordCounter += 1;
createHashTable(#singleRecord); # send record make a hash table
#singleRecord = (); # empty the current record to start a new record
}
}
print "total records : $recordCounter \n";
close(INPUTFILE);
}
It sounds like you are processing a Windows text file on Linux, in which case you want to open the file with the :crlf layer, which will convert all CRLF line-endings to the standard Perl \n ending.
If you are reading Windows files on a Windows platform then the conversion is already done for you, and you won't find CRLF sequences in the data you have read. If you are reading a Linux file then there are no CR characters in there anyway.
It also sounds like your records are separated by a blank line. Setting the built-in input record separator variable $/ to a null string will cause Perl to read a whole record at a time.
I believe this version of your subroutine is what you need. Note that people familiar with Perl will thank you for using lower-case letters and underscore for variables and subroutine names. Mixed case is conventionally reserved for package names.
You don't show create_hash_table so I can't tell what data it needs. I have chomped and split the record into lines, and passed a list of the lines in the record with the newlines removed. It would probably be better to pass the entire record as a single string and leave create_hash_table to process it as required.
sub read_input_file {
my ($input_file) = #_;
open my $fh, '<:crlf', $input_file or die $!;
local $/ = '';
my $record_counter = 0;
while (my $record = <$fh>) {
chomp;
++$record_counter;
create_hash_table(split /\n/, $record);
}
close $fh;
print "Total records : $record_counter\n";
}
You can do this more succinctly by changing Perl's record-separator, which will make the loop return a record at a time instead of a line at a time.
E.g. after opening your file:
local $/ = "\r\n\r\n";
my $recordCounter = 0;
$recordCounter++ while(<INPUTFILE>);
$/ holds Perl's global record-separator, and scoping it with local allows you to override its value temporarily until the end of the enclosing block, when it will automatically revert back to its previous value.
But it sounds like the file you're processing may actually have "\n\n" record-separators, or even "\r\r". You'd need to set the record-separator correctly for whatever file you're processing.
If your files are not huge multi-gigabytes files, the easiest and safest way is to read the whole file, and use the generic newline metacharacter \R.
This way, it also works if some file actually uses LF instead of CRLF (or even the old Mac standard CR).
Use it with split if you also need the actual records:
perl -ln -0777 -e 'my #records = split /\R\R/; print scalar(#records)' $Your_File
Or if you only want to count the records:
perl -ln -0777 -e 'my $count=()=/\R\R/g; print $count' $Your_File
For more details, see also my other answer here to a similar question.

Perl shortcuts "$/" and "$\"

I am confused by the perl shortcuts as to how they are used exactly.
I am much more confused about the variables $/ and $\.
Can you please help me in this as I am new to perl scripting.
For $/: It is the input separator. When you read from an input source (e.g. a file) with my $line = <FILEHANDLE> then Perl will read as much data from the file until it encounters the content of $/. It therefore defaults to the newline character "\n" which gives us the normal understanding of what a line is.
However, when you unset $/ then Perl will read the whole input stream in one call. It's therefore a common idiom to unset $/ locally and read the whole file, e.g.
my $whole_file = do {
local $/;
<FILE_HANDLE>
};
or something similar.
$\ on the other hand is always appended after each call to print. It is by default undefined, meaning you have to add things like newline characters yourself.
All those things are explained in detail in the perlvar documentation page.

extracting paragraphs from text with perl

I want to extract the paragraphs from a text variable that retrieved from the DB.
for extracting the pargaphs from file handler i use the below code :
local $/ = undef;
#paragarphs =<STDIN>
what is the best option to extract paragraphs from a text variable using perl and if there are module on cpan that do this type of task ?
You're almost there. Setting $/ to undef will slurp in the entire text in one go.
What you want is local $/ = ""; to enable paragraph mode, as per perldoc perlvar (emphasis my own):
$/
The input record separator, newline by default. This influences Perl's
idea of what a "line" is. Works like awk's RS variable, including
treating empty lines as a terminator if set to the null string (an
empty line cannot contain any spaces or tabs). You may set it to a
multi-character string to match a multi-character terminator, or to
undef to read through the end of file. Setting it to "\n\n" means
something slightly different than setting to "" , if the file contains
consecutive empty lines. Setting to "" will treat two or more
consecutive empty lines as a single empty line. Setting to "\n\n"
will blindly assume that the next input character belongs to the next
paragraph, even if it's a newline.
Of course, it is possible to get a filehandle to read from a string instead of a file:
use strict;
use warnings;
use autodie;
my $text = <<TEXT;
This is a paragraph.
Here's another one that
spans over multiple lines.
Last paragraph
TEXT
local $/ = "";
open my $fh, '<', \$text;
while ( <$fh> ) {
print "New Paragraph: $_";
}
close $fh;
Output
New Paragraph: This is a paragraph.
New Paragraph: Here's another one that
spans over multiple lines.
New Paragraph: Last paragraph
You already have the answer for a script (local $/ = "";), but it may be worth noting that there is a shortcut for one-liners: the -00 option.
perl -00 -ne '$count++; END {print "Counted $count paragraphs\n"}' somefile.txt
From man perlrun :
-0[octal/hexadecimal]
specifies the input record separator ($/) [...]
The special value 00 will cause Perl to slurp files in paragraph
mode.
If the text is in a variable, for example:
$text = "Here is a paragraph.\nHere is another paragraph.";
or:
$text = 'Paragraph 1
Paragraph2';
You can simply get the paragraphs by splitting the text with "\n".
#paragraphs = split("\n",$text);
If your paragraphs are separated by double newlines or a combination of \n and \r (like in Windows) you can change the split command accordingly.

Perl New Line separator issue

I have a file that uses CR/LF to separate records, but individual records sometimes contain a LF.
while (<$in>)
{
#extract record data
}
I am trying to read this code as above and this (as I would expect) splits the records that contain a LF only. I would however have expected that a reassigned $/ would resolve this issue but it does appear to cause the complete file to me read in one iteration.
$/ = "\r\n";
while (<$in>)
{
#extract record data
}
Anyone here who can suggest a working solution?
I am using Activestate Perl on Windows.
On windows, perl converts the incoming CRLF line endings to LF only, making a distinction between CRLF and LF impossible by reading in the data as text (perlport). Therefore, you have to read your data in binary mode using binmode on your file-handle:
binmode($in);
After that, you can set the input record separator to "\015\012" and read-in your records as usual:
$/ = "\015\012";
while (<$in>) {
...
}
greets, Matthias
PS: I have no chance to test that locally, at the moment, so I regret if it does not work.
Try setting $/ to "\n". From Newlines in perlport:
Perl uses \n to represent the "logical" newline, where what is logical
may depend on the platform in use. In MacPerl, \n always means \015.
In DOSish perls, \n usually means \012, but when accessing a file in
"text" mode, perl uses the :crlf layer that translates it to (or from)
\015\012, depending on whether you're reading or writing.
try this before while
binmode($in);

CR vs LF perl parsing

I have a perl script which parses a text file and breaks it up per line into an array.
It works fine when each line are terminated by LF but when they terminate by CR my script is not handling properly.
How can I modify this line to fix this
my #allLines = split(/^/, $entireFile);
edit:
My file has a mixture of lines with either
ending LF or ending CR it just collapses all lines when its ending in CR
Perl can handle both CRLF and LF line-endings with the built-in :crlf PerlIO layer:
open(my $in, '<:crlf', $filename);
will automatically convert CRLF line endings to LF, and leave LF line endings unchanged. But CR-only files are the odd-man out. If you know that the file uses CR-only, then you can set $/ to "\r" and it will read line-by-line (but it won't change the CR to a LF).
If you have to deal with files of unknown line endings (or even mixed line endings in a single file), you might want to install the PerlIO::eol module. Then you can say:
open(my $in, '<:raw:eol(LF)', $filename);
and it will automatically convert CR, CRLF, or LF line endings into LF as you read the file.
Another option is to set $/ to undef, which will read the entire file in one slurp. Then split it on /\r\n?|\n/. But that assumes that the file is small enough to fit in memory.
If you have mixed line endings, you can normalize them by matching a generalized line ending:
use v5.10;
$entireFile =~ s/\R/\n/g;
You can also open a filehandle on a string and read lines just like you would from a file:
open my $fh, '<', \ $entireFile;
my #lines = <$fh>;
close $fh;
You can even open the string with the layers that cjm shows.
You can probably just handle the different line endings when doing the split, e.g.:
my #allLines = split(/\r\n|\r|\n/, $entireFile);
It will automatically split the input into lines if you read with <>, but you need to you need to change $/ to \r.
$/ is the "input record separator". see perldoc perlvar for details.
There is not any way to change what a regular expression considers to be the end-of-line - it's always newline.