Perl New Line separator issue - perl

I have a file that uses CR/LF to separate records, but individual records sometimes contain a LF.
while (<$in>)
{
#extract record data
}
I am trying to read this code as above and this (as I would expect) splits the records that contain a LF only. I would however have expected that a reassigned $/ would resolve this issue but it does appear to cause the complete file to me read in one iteration.
$/ = "\r\n";
while (<$in>)
{
#extract record data
}
Anyone here who can suggest a working solution?
I am using Activestate Perl on Windows.

On windows, perl converts the incoming CRLF line endings to LF only, making a distinction between CRLF and LF impossible by reading in the data as text (perlport). Therefore, you have to read your data in binary mode using binmode on your file-handle:
binmode($in);
After that, you can set the input record separator to "\015\012" and read-in your records as usual:
$/ = "\015\012";
while (<$in>) {
...
}
greets, Matthias
PS: I have no chance to test that locally, at the moment, so I regret if it does not work.

Try setting $/ to "\n". From Newlines in perlport:
Perl uses \n to represent the "logical" newline, where what is logical
may depend on the platform in use. In MacPerl, \n always means \015.
In DOSish perls, \n usually means \012, but when accessing a file in
"text" mode, perl uses the :crlf layer that translates it to (or from)
\015\012, depending on whether you're reading or writing.

try this before while
binmode($in);

Related

Counting records separated by CR/LF (carriage return and newline) in Perl

I'm trying to create a simple script to read a text file that contains records of book titles. Each record is separated with a plain old double space (\r\n\r\n). I need to count how many records are in the file.
For example here is the input file:
record 1
some text
record 2
some text
...
I'm using a regex to check for carriage return and newline, but it fails to match. What am I doing wrong? I'm at my wits' end.
sub readInputFile {
my $inputFile = $_[0]; #read first argument from the commandline as fileName
open INPUTFILE, "+<", $inputFile or die $!; #Open File
my $singleLine;
my #singleRecord;
my $recordCounter = 0;
while (<INPUTFILE>) { # loop through the input file line-by-line
$singleLine = $_;
push(#singleRecord, $singleLine); # start adding each line to a record array
if ($singleLine =~ m/\r\n/) { # check for carriage return and new line
$recordCounter += 1;
createHashTable(#singleRecord); # send record make a hash table
#singleRecord = (); # empty the current record to start a new record
}
}
print "total records : $recordCounter \n";
close(INPUTFILE);
}
It sounds like you are processing a Windows text file on Linux, in which case you want to open the file with the :crlf layer, which will convert all CRLF line-endings to the standard Perl \n ending.
If you are reading Windows files on a Windows platform then the conversion is already done for you, and you won't find CRLF sequences in the data you have read. If you are reading a Linux file then there are no CR characters in there anyway.
It also sounds like your records are separated by a blank line. Setting the built-in input record separator variable $/ to a null string will cause Perl to read a whole record at a time.
I believe this version of your subroutine is what you need. Note that people familiar with Perl will thank you for using lower-case letters and underscore for variables and subroutine names. Mixed case is conventionally reserved for package names.
You don't show create_hash_table so I can't tell what data it needs. I have chomped and split the record into lines, and passed a list of the lines in the record with the newlines removed. It would probably be better to pass the entire record as a single string and leave create_hash_table to process it as required.
sub read_input_file {
my ($input_file) = #_;
open my $fh, '<:crlf', $input_file or die $!;
local $/ = '';
my $record_counter = 0;
while (my $record = <$fh>) {
chomp;
++$record_counter;
create_hash_table(split /\n/, $record);
}
close $fh;
print "Total records : $record_counter\n";
}
You can do this more succinctly by changing Perl's record-separator, which will make the loop return a record at a time instead of a line at a time.
E.g. after opening your file:
local $/ = "\r\n\r\n";
my $recordCounter = 0;
$recordCounter++ while(<INPUTFILE>);
$/ holds Perl's global record-separator, and scoping it with local allows you to override its value temporarily until the end of the enclosing block, when it will automatically revert back to its previous value.
But it sounds like the file you're processing may actually have "\n\n" record-separators, or even "\r\r". You'd need to set the record-separator correctly for whatever file you're processing.
If your files are not huge multi-gigabytes files, the easiest and safest way is to read the whole file, and use the generic newline metacharacter \R.
This way, it also works if some file actually uses LF instead of CRLF (or even the old Mac standard CR).
Use it with split if you also need the actual records:
perl -ln -0777 -e 'my #records = split /\R\R/; print scalar(#records)' $Your_File
Or if you only want to count the records:
perl -ln -0777 -e 'my $count=()=/\R\R/g; print $count' $Your_File
For more details, see also my other answer here to a similar question.

Perl shortcuts "$/" and "$\"

I am confused by the perl shortcuts as to how they are used exactly.
I am much more confused about the variables $/ and $\.
Can you please help me in this as I am new to perl scripting.
For $/: It is the input separator. When you read from an input source (e.g. a file) with my $line = <FILEHANDLE> then Perl will read as much data from the file until it encounters the content of $/. It therefore defaults to the newline character "\n" which gives us the normal understanding of what a line is.
However, when you unset $/ then Perl will read the whole input stream in one call. It's therefore a common idiom to unset $/ locally and read the whole file, e.g.
my $whole_file = do {
local $/;
<FILE_HANDLE>
};
or something similar.
$\ on the other hand is always appended after each call to print. It is by default undefined, meaning you have to add things like newline characters yourself.
All those things are explained in detail in the perlvar documentation page.

Properly detect line-endings of a file in Perl?

Problem: I have data (mostly in CSV format) produced on both Windows and *nix, and processed mostly on *nix. Windows uses CRLF for line endings and Unix uses LF. For any particular file I don't know whether it has windows or *nix line endings. Up until now, I've been writing something like this to handle the difference:
while (<$fh>){
tr/\r\n//d;
my #fields = split /,/, $_;
# ...
}
On *nix the \n part is equivalent to chomping, and additionally gets rid of \r (CR) if it's a windows-produced file.
But now I want to Text::CSV_XS b/c I'm starting to get weirder data files with quoted data, potentially with embedded line-breaks, etc. In order to get this module to read such files, Text::CSV_XS::getline() requires that you specify the end-of-line characters. (I can't read each line as above, tr/\n\r//d, and them parse it with Text::CSV b/c that wouldn't handle embedded line-breaks properly). How do I properly detect whether an arbitrary file uses windows or *nix style line endings, so I can tell Text::CSV_XS::eol() how to chomp()?
I couldn't find a module on CPAN that simply detects line endings. I don't want to to first convert all my datafiles via dos2unix, b/c the files are huge (hundreds of gigabytes), and spending 10+ minutes for each file to deal with something so simple seems silly. I thought about writing a function which reads the first several hundred bytes of a file and counts LF's vs CRLF's, but I refuse to believe this doesn't have a better solution.
Any help?
Note: all files are either have entirely windows-line endings or *nix endings, ie, they are not both mixed in a single file.
You could just open the file using the :crlf PerlIO layer and then tell Text::CSV_XS to use \n as the line ending character. This will silently map any CR/LF pairs to single line feeds, but that's presumably what you want.
use Text::CSV_XS;
my $csv = Text::CSV_XS->new( { binary => 1, eol => "\n" } );
open( $fh, '<:crlf', 'data.csv' ) or die $!;
while ( my $row = $csv->getline( $fh ) ) {
# do something with $row
}
Since Perl 5.10, you can use this to check general line endings,
s/\R//g;
It should work in all cases, both *nix and Windows.
Read in the first line of each file, look at its last but one character. If it is \r, the file comes from Windows, if not, it is *nix. Then seek to the begin and start processing.
If it is possible for a file to have mixed line endings (e.g. different type for embeded newlines), you can only guess.
In theory line endings cannot be determined reliably: Is this file a single line with DOS line endings with embeded \ns or is this a bunch of lines with a few stray \r characters at the end of some lines?
foo\n
ba\r\n
versus
foo\nba\r\n
If statistical analysis is not an option because it is too inaccurate and expensive (it takes time to scan such huge files), you have to actually know what the encoding is.
It would be best to specify the exact file format if you have control over the producing applications or to use some kind of metadata to keep track of the platform the data was produced on.
In Perl, the character \n represents is locale dependent: \n/\012 on *nix machines, \r/\015 on old Macs and the sequence \r\n/\015\012 on DOS-descendants aka Windows. So to do reliable processing, you should use the octal values.
You can use the PERLIO variable. This has the advantage of not having to modify the source code of your scripts depending on the platform.
If you're dealing with DOS text files, set the environment variable PERLIO to :unix:crlf:
$ PERLIO=:unix:crlf my-script.pl dos-text-file.txt
If you're mainly dealing with DOS text files (e.g. on Cygwin), you could put this in your .bashrc:
export PERLIO=:unix:crlf
(I think that value should be the default for PERLIO on Cygwin, but apparently it's not.)

Opening a CSV file created in Mac Excel with Perl

I'm having a bit of trouble with the Perl code below. I can open and read in a CSV file that I've made manually, but if I try to open any Mac Excel spreadsheet that I save as a CSV file, the code below reads it all as a single line.
#!/usr/bin/perl
use strict;
use warnings;
open F, "file.csv";
foreach (<F>)
{
($first, $second, undef, undef) = split (',', $_);
}
print "$first : $second\n";
close(F);
Always use a specialised module (such as Text::CSV or Text::CSV_XS) for this purpose as there are lots of cases where split-ing will not help (for example when the fields contain a comma which is not a field separator but is within quotes).
Traditional Macintosh (System 9 and previous) uses CR (0x0D, \r) as the line separator. Mac OS X (Unix based) uses LF(0x0A, \n) as the default line separator, so the perl script, being a Unix tool, is probably expecting LF but is getting CR. Since there are no line separators in the file perl thinks there is only one line. If it had Windows line endings (CR,LF) you'd probably be getting an invisible CR at the end of each line.
A quick loop over the input replacing 0x0D with 0x0A should fix your problem.
I've directly experienced this problem with Excel 2004 for Mac. The line endings are indeed \r, and IIRC, the text uses the MacRoman character set, rather than Latin-1 or UTF-8 as you might expect.
So as well as the good advice to use Text::CSV / Text::CSV_XS and splitting on \r, you will want to open the file using the MacRoman encoding like so:
open my $fh, "<:encoding(MacRoman)", $filename
or die "Can't read $filename: $!";
Likewise, when reading a file exported with Excel on Windows, you may wish to use :encoding(cp1252) instead of :encoding(MacRoman) in that code.
Not sure about Mac excel, but certainly the windows version tends to enclose all values in quotes: "like","this". Also, you need to take into account the possibility of there being a quote in the value, which would show up "like""this" (there's only a single " in that value).
To actually answer your question however, it's likely that it's using a different newline character from what you'd expect. It's probably saving as \r\n instead of \n, or vice versa.
As others have suspected, your line endings are probably to blame. On my Linux-based system there are builtin utilities to change these line endings. mac2unix (which I think is just a wrapper around dos2unix will read your file and change the line endings for you. You should have something similar both on Linux and Mac (Microsoft may not care about you).
If you want to handle this in Perl, look into setting the $/ variable to set the "input record separator" from "\n" to "\r" (if thats the right ending). Try local $/ = "\r" before you read the file. Read more about it in perldoc perlvar (near $/) or in perldoc perlport (devoted to writing portable Perl code.
P.S. if I have some part of this incorrect let me know, I don't use Mac, I just think I know the theory
if you set the "special variable" that handles what it considers a newline to \r you'll be able to read one line at a time: $/="\r"; in this particular case the mac new line for perl is default \n but the file is probably using \r. This builds off what Flynn1179 & Mark Thalman said but shows you what to do to use the while () style reading.

is there a way to designate the line token delimiter in Perl's file reader?

I'm reading a text file via CGI in, in perl, and noticing that when the file is saved in mac's textEdit the line separator is recognized, but when I upload a CSV that is exported straight from excel, they are not. I'm guessing it's a \n vs. \r issue, but it got me thinking that I don't know how to specify what I would like the line terminator token to be, if I didn't want the one it's looking for by default.
Yes. You'll want to overwrite the value of $/. From perlvar
$/
The input record separator, newline by default. This influences Perl's idea of what a "line" is. Works like awk's RS variable, including treating empty lines as a terminator if set to the null string. (An empty line cannot contain any spaces or tabs.) You may set it to a multi-character string to match a multi-character terminator, or to undef to read through the end of file. Setting it to "\n\n" means something slightly different than setting to "", if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline. (Mnemonic: / delimits line boundaries when quoting poetry.)
local $/; # enable "slurp" mode
local $_ = <FH>; # whole file now here
s/\n[ \t]+/ /g;
Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)
Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer. So this:
local $/ = \32768; # or \"32768", or \$var_containing_32768
open my $fh, "<", $myfile or die $!;
local $_ = <$fh>;
will read a record of no more than 32768 bytes from FILE. If you're not reading from a record-oriented file (or your OS doesn't have record-oriented files), then you'll likely get a full chunk of data with every read. If a record is larger than the record size you've set, you'll get the record back in pieces. Trying to set the record size to zero or less will cause reading in the (rest of the) whole file.
On VMS, record reads are done with the equivalent of sysread, so it's best not to mix record and non-record reads on the same file. (This is unlikely to be a problem, because any file you'd want to read in record mode is probably unusable in line mode.) Non-VMS systems do normal I/O, so it's safe to mix record and non-record reads of a file.
See also "Newlines" in perlport. Also see $..
The variable has multiple names:
$/
$RS
$INPUT_RECORD_SEPARATOR
For the longer names, you need:
use English;
Remember to localize carefully:
{
local($/) = "\r\n";
...code to read...
}
If you are reading in a file with CRLF line terminators, you can open it with the CRLF discipline, or set the binmode of the handle to do automatic translation.
open my $fh, '<:crlf', 'the_csv_file.csv' or die "Oh noes $!";
This will transparently convert \r\n sequences into \n sequences.
You can also apply this translation to an existing handle by doing:
binmode( $fh, ':crlf' );
:crlf mode is typically default in Win32 Perl environments and works very well in practice.
For reading a CSV file, follow Robert-P's advice in his comment, and use a CSV module.
But for the general case of reading lines from a file with different line-endings, what I generally do is slurp the file whole and split it on \R. If it's not a multi-gigabytes file, that should be the safest and easiest way.
So:
perl -ln -0777 -e 'my #lines = split /\R/;
print length($_), " bytes split into ", scalar(#lines), " lines."' $YOUR_FILE
or in your script:
{
local $/ = undef;
open F, $YOUR_FILE or die;
#lines = split /\R/, <F>;
close F;
}
\R works with Unix LF (\x0A), Windows/Internet CRLF, and also with CR (\x0D) which was used by Macs in the nineties, but is in fact still used by some Mac programs.
From the perldoc :
\R matches a generic newline; that is, anything considered a linebreak
sequence by Unicode. This includes all characters matched by \v
(vertical whitespace), and the multi character sequence "\x0D\x0A"
(carriage return followed by a line feed, sometimes called the network
newline; it's the end of line sequence used in Microsoft text files
opened in binary mode)
Or see this much nicer and exhaustive explanation about \R in Brian D Foy's article : The \R generic line ending which even has a couple of fun videos.