CR vs LF perl parsing - perl

I have a perl script which parses a text file and breaks it up per line into an array.
It works fine when each line are terminated by LF but when they terminate by CR my script is not handling properly.
How can I modify this line to fix this
my #allLines = split(/^/, $entireFile);
edit:
My file has a mixture of lines with either
ending LF or ending CR it just collapses all lines when its ending in CR

Perl can handle both CRLF and LF line-endings with the built-in :crlf PerlIO layer:
open(my $in, '<:crlf', $filename);
will automatically convert CRLF line endings to LF, and leave LF line endings unchanged. But CR-only files are the odd-man out. If you know that the file uses CR-only, then you can set $/ to "\r" and it will read line-by-line (but it won't change the CR to a LF).
If you have to deal with files of unknown line endings (or even mixed line endings in a single file), you might want to install the PerlIO::eol module. Then you can say:
open(my $in, '<:raw:eol(LF)', $filename);
and it will automatically convert CR, CRLF, or LF line endings into LF as you read the file.
Another option is to set $/ to undef, which will read the entire file in one slurp. Then split it on /\r\n?|\n/. But that assumes that the file is small enough to fit in memory.

If you have mixed line endings, you can normalize them by matching a generalized line ending:
use v5.10;
$entireFile =~ s/\R/\n/g;
You can also open a filehandle on a string and read lines just like you would from a file:
open my $fh, '<', \ $entireFile;
my #lines = <$fh>;
close $fh;
You can even open the string with the layers that cjm shows.

You can probably just handle the different line endings when doing the split, e.g.:
my #allLines = split(/\r\n|\r|\n/, $entireFile);

It will automatically split the input into lines if you read with <>, but you need to you need to change $/ to \r.
$/ is the "input record separator". see perldoc perlvar for details.
There is not any way to change what a regular expression considers to be the end-of-line - it's always newline.

Related

How to remove newline from the end of a file using Perl

I have a file that reads like this:
dog cat mouse
apple orange pear
red yellow green
There is a tab \t separating the words on each row, and a newline \n separating each of the rows. Below the last line, red yellow green there is a blank line due to a newline \n after green.
I would like to use Perl to remove the newline.
I have seen a few articles like this How can I delete a newline if it is the last character in a file? that give solutions for Perl, but I would like to do this in hard code so that I can incorporate it into my Perl script.
I don't know if this might be possible using chomp, or if chomp works on each line separately (I would like to keep the newline between lines).
Also I have seen previously comments that suggest maintaining a newline at the end of a file because Unix commands work better when a file ends with a newline. However, I have created a script which relies on input files not ending with a newline, therefore I really feel removing the newlines is necessary for my work.
You can try this:
perl -pe 'chomp if eof' file.txt
Here is another simple way, if you need it in a script:
open $fh, "file.txt";
#lines=<$fh>; # read all lines and store in array
close $fh;
chomp $lines[-1]; # remove newline from last line
print #lines;
Or something like this (in script), as suggested by jnhc for the command line:
open $fh, "file.txt";
while (<$fh>) {
chomp if eof $fh;
print;
}
close $fh;

Perl 5.12.3 fails to loop CSV file line by line

I'm sure someone has an explanation as to what is happening with the following script:
Please note, the file I specify is available and is opening. I know this because the last line of the file is output when the program is run, but it is only the last line.
Note about the .csv file: it's generated on windows (I'm using OS X 10.7.4 with Perl 5.12.3) and uses \r line breaks. I attempted to tell perl that the line break character was \r at the top of the script but it does not work. I know they're \r as the grep search finds them in a text editor.
The script runs and only prints the last line of the file. If I plug in a regular expression it will grab the first matching field from the first line and echo it fine, but I cannot iterate over the entire file.
Any clarification is appreciated as I am new to perl.
#!/usr/bin/perl
use warnings;
print "Please enter your filename:";
my ($dataline);
open(INFO,'./expensereport.csv') || die("can't open datafile: $!");
while (my $line = <INFO>) {
chomp $line;
print $line;
}
print $!;
The carriage returns without linefeed are causing print to overwrite each line on the same line, so all you see is the last.
Run dos2unix on your input file before processing.
There are several ways to tell perl that your input file is windows-style :crlf.
perldoc -f binmode or perldoc -f open
open(INFO, '<:crlf', './expensereport.csv')
...
Ahh, that's clear! :)
Look, you have a file with \r (carriage return, literally) and \n (newline). chomp cuts off \n (new line). So you print over the same line (remember "carriage return") again and again.
Use print "$line\n"; instead

Opening a CSV file created in Mac Excel with Perl

I'm having a bit of trouble with the Perl code below. I can open and read in a CSV file that I've made manually, but if I try to open any Mac Excel spreadsheet that I save as a CSV file, the code below reads it all as a single line.
#!/usr/bin/perl
use strict;
use warnings;
open F, "file.csv";
foreach (<F>)
{
($first, $second, undef, undef) = split (',', $_);
}
print "$first : $second\n";
close(F);
Always use a specialised module (such as Text::CSV or Text::CSV_XS) for this purpose as there are lots of cases where split-ing will not help (for example when the fields contain a comma which is not a field separator but is within quotes).
Traditional Macintosh (System 9 and previous) uses CR (0x0D, \r) as the line separator. Mac OS X (Unix based) uses LF(0x0A, \n) as the default line separator, so the perl script, being a Unix tool, is probably expecting LF but is getting CR. Since there are no line separators in the file perl thinks there is only one line. If it had Windows line endings (CR,LF) you'd probably be getting an invisible CR at the end of each line.
A quick loop over the input replacing 0x0D with 0x0A should fix your problem.
I've directly experienced this problem with Excel 2004 for Mac. The line endings are indeed \r, and IIRC, the text uses the MacRoman character set, rather than Latin-1 or UTF-8 as you might expect.
So as well as the good advice to use Text::CSV / Text::CSV_XS and splitting on \r, you will want to open the file using the MacRoman encoding like so:
open my $fh, "<:encoding(MacRoman)", $filename
or die "Can't read $filename: $!";
Likewise, when reading a file exported with Excel on Windows, you may wish to use :encoding(cp1252) instead of :encoding(MacRoman) in that code.
Not sure about Mac excel, but certainly the windows version tends to enclose all values in quotes: "like","this". Also, you need to take into account the possibility of there being a quote in the value, which would show up "like""this" (there's only a single " in that value).
To actually answer your question however, it's likely that it's using a different newline character from what you'd expect. It's probably saving as \r\n instead of \n, or vice versa.
As others have suspected, your line endings are probably to blame. On my Linux-based system there are builtin utilities to change these line endings. mac2unix (which I think is just a wrapper around dos2unix will read your file and change the line endings for you. You should have something similar both on Linux and Mac (Microsoft may not care about you).
If you want to handle this in Perl, look into setting the $/ variable to set the "input record separator" from "\n" to "\r" (if thats the right ending). Try local $/ = "\r" before you read the file. Read more about it in perldoc perlvar (near $/) or in perldoc perlport (devoted to writing portable Perl code.
P.S. if I have some part of this incorrect let me know, I don't use Mac, I just think I know the theory
if you set the "special variable" that handles what it considers a newline to \r you'll be able to read one line at a time: $/="\r"; in this particular case the mac new line for perl is default \n but the file is probably using \r. This builds off what Flynn1179 & Mark Thalman said but shows you what to do to use the while () style reading.

is there a way to designate the line token delimiter in Perl's file reader?

I'm reading a text file via CGI in, in perl, and noticing that when the file is saved in mac's textEdit the line separator is recognized, but when I upload a CSV that is exported straight from excel, they are not. I'm guessing it's a \n vs. \r issue, but it got me thinking that I don't know how to specify what I would like the line terminator token to be, if I didn't want the one it's looking for by default.
Yes. You'll want to overwrite the value of $/. From perlvar
$/
The input record separator, newline by default. This influences Perl's idea of what a "line" is. Works like awk's RS variable, including treating empty lines as a terminator if set to the null string. (An empty line cannot contain any spaces or tabs.) You may set it to a multi-character string to match a multi-character terminator, or to undef to read through the end of file. Setting it to "\n\n" means something slightly different than setting to "", if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline. (Mnemonic: / delimits line boundaries when quoting poetry.)
local $/; # enable "slurp" mode
local $_ = <FH>; # whole file now here
s/\n[ \t]+/ /g;
Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)
Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer. So this:
local $/ = \32768; # or \"32768", or \$var_containing_32768
open my $fh, "<", $myfile or die $!;
local $_ = <$fh>;
will read a record of no more than 32768 bytes from FILE. If you're not reading from a record-oriented file (or your OS doesn't have record-oriented files), then you'll likely get a full chunk of data with every read. If a record is larger than the record size you've set, you'll get the record back in pieces. Trying to set the record size to zero or less will cause reading in the (rest of the) whole file.
On VMS, record reads are done with the equivalent of sysread, so it's best not to mix record and non-record reads on the same file. (This is unlikely to be a problem, because any file you'd want to read in record mode is probably unusable in line mode.) Non-VMS systems do normal I/O, so it's safe to mix record and non-record reads of a file.
See also "Newlines" in perlport. Also see $..
The variable has multiple names:
$/
$RS
$INPUT_RECORD_SEPARATOR
For the longer names, you need:
use English;
Remember to localize carefully:
{
local($/) = "\r\n";
...code to read...
}
If you are reading in a file with CRLF line terminators, you can open it with the CRLF discipline, or set the binmode of the handle to do automatic translation.
open my $fh, '<:crlf', 'the_csv_file.csv' or die "Oh noes $!";
This will transparently convert \r\n sequences into \n sequences.
You can also apply this translation to an existing handle by doing:
binmode( $fh, ':crlf' );
:crlf mode is typically default in Win32 Perl environments and works very well in practice.
For reading a CSV file, follow Robert-P's advice in his comment, and use a CSV module.
But for the general case of reading lines from a file with different line-endings, what I generally do is slurp the file whole and split it on \R. If it's not a multi-gigabytes file, that should be the safest and easiest way.
So:
perl -ln -0777 -e 'my #lines = split /\R/;
print length($_), " bytes split into ", scalar(#lines), " lines."' $YOUR_FILE
or in your script:
{
local $/ = undef;
open F, $YOUR_FILE or die;
#lines = split /\R/, <F>;
close F;
}
\R works with Unix LF (\x0A), Windows/Internet CRLF, and also with CR (\x0D) which was used by Macs in the nineties, but is in fact still used by some Mac programs.
From the perldoc :
\R matches a generic newline; that is, anything considered a linebreak
sequence by Unicode. This includes all characters matched by \v
(vertical whitespace), and the multi character sequence "\x0D\x0A"
(carriage return followed by a line feed, sometimes called the network
newline; it's the end of line sequence used in Microsoft text files
opened in binary mode)
Or see this much nicer and exhaustive explanation about \R in Brian D Foy's article : The \R generic line ending which even has a couple of fun videos.

How do I process lines with CRLF, NEL line terminators?

I need to process a file with shift_jis encoding. However the line terminators are in a format that im not familar with.
> file record.CSV
record.CSV: Non-ISO extended-ASCII text, with CRLF, NEL line terminators
Im using the general:
open my $CSV_FILE, "<:encoding(shift_jis)", $filename or die "Could not open: $CSV_FILE : $!";
while (<$CSV_FILE>) {
chomp;
# do stuff
}
However it is still leaving a CR at the end of each record.
What is the correct way to terminate files of these types?
Why not do $_ =~ s/\r// manually?
Edit: apparently, you can also do
require Encode;
use Unicode::Normalize;
s/\x{0085}//g;
to remove the NEL: Next Line, U+0085 characters.
You need to consider who's consuming the data and learn more about the environment which produced these files. If it's a plain-vanilla CSV output file you're after in the end, use any old string manipulation you like to get rid of them (and produce CRLF terminators in their stead) and you'll be fine.