Compare two UTF-8 text files and ignore lines that are blank or all whitespace - perl

I am an author maintaining Kindle(HTML) and Open Office versions of a book. I sometimes forget to make changes to one or the other, and the documents are diverging.
My procedure is to copy the text from each and paste into separate text files (using paste and match style in TextEdit) in UTF-8, then perform a differencing operation. However the HTML paste adds blank lines between paragraphs.
I have a file differencing tool, but it has no option to ignore blank lines. My thought was to write a Perl script to remove the blank lines. However, the output of that script screws up the special characters - like ndashes, curly quotes, etc. I have tried using BINMODE and other tricks, to no avail.
I will accept a pointer to a free comparator for MAC OS X that ignores blank lines, or a way to get Perl to not screw up the UTF-8 special characters. I am using Perl 5.14. I prefer answers that do not rely upon newer features, but if I have to install a new Perl, I will.
UPDATE:
This does not work:
use open IO => ":encoding(iso-8859-7)";
open(FILE, "From HTML.txt") or die "$!\n";
open(OUT, ">From HTML - no blank lines.txt") or die "$!\n";
while(<FILE>) {
next if /^\s*$/;
print OUT $_;
}
close FILE; close OUT;
I also tried calling binmode(OUT, ":utf8");
UPDATE: Tried without success this tip from another Stackoverflow question:
open(my $fh, "<:encoding(UTF-8)", "filename");

GNU diff has -B/--ignore-blank-lines and -b/--ignore-space-change.

Err, that "use open" says that your data is not UTF-8. Try binmode on both FILE and OUT?

I ended up using the XCode text editor. By selecting a newline and pasting it into the search/replace dialog, I was able to replace all double newlines with single newlines.
Then I saved the file and used my Compare utility.

Related

Different behaviors of reading from files generated on different machines

I have a folder of several hundred text file. Each file has the same format, for instance the file with the name ATextFile1.txt reads
ATextFile1.txt 09 Oct 2013
1
2
3
4
...
I have a simplified Perl script that is supposed to read the file and print it back out in the terminal window:
#!/usr/bin/Perl
use warnings;
use strict;
my $fileName = shift(#ARGV);
open(my $INFILE, "<:encoding(UTF-8)", $fileName) || die("Cannot open $fileName: $!.\n");
foreach (<$INFILE>){
print("$_"); # Uses the newline character from the file
}
When I use this script on files generated by the Windows version of the program that generates the ATextFile1.txt, my output is exactly as I'd expect (being the content of the text file), however, when I run this script on files generated by the Mac version of the file generating program, the output looks like the following:
2016tFile1.txt 09 Oct 2013
After some testing, it seems that it is only printing the first line of the text where the first 4 characters are overwritten by what can be expressed in RegEx as /[0-9][0-9]16/. If in my Perl script, I replace the output statement with print("\t$_");, I get the following line printed to STDOUT:
2016 ATextFile1.txt 09 Oct 2013
Each of these files can be read normally using any standard text editor but for some reason, my Perl script can't seem to properly read and write from the file. Any help would be greatly appreciated (I'm hoping it's something obvious that I'm missing). Thanks in advance!
Note that if you are printing UTF-8 characters to STDOUT you will need to use
binmode STDOUT, ':encoding(utf8)';
beforehand.
It looks as if your Mac files have just CR as the line ending. I understood that recent versions of Macintosh systems used LF as the line ending (the same as Linux) but Mac OS 9 uses just CR, while Windows uses the two characters CR LF inside the file, which is converted to just LF by the PerlIO layer when perl is running in a Windows platform.
If there are no linefeeds in the file, then Perl will read the entire file as a single record, and printing it will overlay all lines on top of one another.
As long as the files are relatively small, the easiest way to read either file format with the same Perl code is to read the whole file and split it on either CR or LF. Anything else will need different code according to the source of the input files.
Try this version of your code.
use strict;
use warnings;
my #contents = do {
open my $fh, '<:encoding(utf8)', $ARGV[0];
local $/;
my $contents = <$fh>;
split /[\r\n]+/, $contents;
}
print "$_\n" for #contents;
Update
One alternative you might try is to use the PerlIO::eol module, which provides a PerlIO layer that translates any line ending to LF when the record is read. I'm not certain that it plays nice with UTF-8, but as long as you add it after the encoding layer it should be fine.
It is not a core module so you will probably need to install it, but after that the program becomes just
use strict;
use warnings;
open my $fh, '<:encoding(UTF-8):eol(LF)', $ARGV[0];
binmode STDOUT, ':encoding(utf8)';
print while <$fh>;
I have created Windows, Linux, and Mac-style text files and this program works fine wioth all of them, but I have been unable to check whether a UTF-8 character that has 0x0D or 0x0A as part of its encoding are passed through properly, so be careful.
Update 2
After thinking briefly about this, of course there are no UTF-8 encodings that contain CR or LF apart from those characters themselves. All characters outside the ASCII range contain only bytes with the top bit set, so they are over 0x80 and can never be 0x0D or 0x0A.

Properly detect line-endings of a file in Perl?

Problem: I have data (mostly in CSV format) produced on both Windows and *nix, and processed mostly on *nix. Windows uses CRLF for line endings and Unix uses LF. For any particular file I don't know whether it has windows or *nix line endings. Up until now, I've been writing something like this to handle the difference:
while (<$fh>){
tr/\r\n//d;
my #fields = split /,/, $_;
# ...
}
On *nix the \n part is equivalent to chomping, and additionally gets rid of \r (CR) if it's a windows-produced file.
But now I want to Text::CSV_XS b/c I'm starting to get weirder data files with quoted data, potentially with embedded line-breaks, etc. In order to get this module to read such files, Text::CSV_XS::getline() requires that you specify the end-of-line characters. (I can't read each line as above, tr/\n\r//d, and them parse it with Text::CSV b/c that wouldn't handle embedded line-breaks properly). How do I properly detect whether an arbitrary file uses windows or *nix style line endings, so I can tell Text::CSV_XS::eol() how to chomp()?
I couldn't find a module on CPAN that simply detects line endings. I don't want to to first convert all my datafiles via dos2unix, b/c the files are huge (hundreds of gigabytes), and spending 10+ minutes for each file to deal with something so simple seems silly. I thought about writing a function which reads the first several hundred bytes of a file and counts LF's vs CRLF's, but I refuse to believe this doesn't have a better solution.
Any help?
Note: all files are either have entirely windows-line endings or *nix endings, ie, they are not both mixed in a single file.
You could just open the file using the :crlf PerlIO layer and then tell Text::CSV_XS to use \n as the line ending character. This will silently map any CR/LF pairs to single line feeds, but that's presumably what you want.
use Text::CSV_XS;
my $csv = Text::CSV_XS->new( { binary => 1, eol => "\n" } );
open( $fh, '<:crlf', 'data.csv' ) or die $!;
while ( my $row = $csv->getline( $fh ) ) {
# do something with $row
}
Since Perl 5.10, you can use this to check general line endings,
s/\R//g;
It should work in all cases, both *nix and Windows.
Read in the first line of each file, look at its last but one character. If it is \r, the file comes from Windows, if not, it is *nix. Then seek to the begin and start processing.
If it is possible for a file to have mixed line endings (e.g. different type for embeded newlines), you can only guess.
In theory line endings cannot be determined reliably: Is this file a single line with DOS line endings with embeded \ns or is this a bunch of lines with a few stray \r characters at the end of some lines?
foo\n
ba\r\n
versus
foo\nba\r\n
If statistical analysis is not an option because it is too inaccurate and expensive (it takes time to scan such huge files), you have to actually know what the encoding is.
It would be best to specify the exact file format if you have control over the producing applications or to use some kind of metadata to keep track of the platform the data was produced on.
In Perl, the character \n represents is locale dependent: \n/\012 on *nix machines, \r/\015 on old Macs and the sequence \r\n/\015\012 on DOS-descendants aka Windows. So to do reliable processing, you should use the octal values.
You can use the PERLIO variable. This has the advantage of not having to modify the source code of your scripts depending on the platform.
If you're dealing with DOS text files, set the environment variable PERLIO to :unix:crlf:
$ PERLIO=:unix:crlf my-script.pl dos-text-file.txt
If you're mainly dealing with DOS text files (e.g. on Cygwin), you could put this in your .bashrc:
export PERLIO=:unix:crlf
(I think that value should be the default for PERLIO on Cygwin, but apparently it's not.)

Opening a CSV file created in Mac Excel with Perl

I'm having a bit of trouble with the Perl code below. I can open and read in a CSV file that I've made manually, but if I try to open any Mac Excel spreadsheet that I save as a CSV file, the code below reads it all as a single line.
#!/usr/bin/perl
use strict;
use warnings;
open F, "file.csv";
foreach (<F>)
{
($first, $second, undef, undef) = split (',', $_);
}
print "$first : $second\n";
close(F);
Always use a specialised module (such as Text::CSV or Text::CSV_XS) for this purpose as there are lots of cases where split-ing will not help (for example when the fields contain a comma which is not a field separator but is within quotes).
Traditional Macintosh (System 9 and previous) uses CR (0x0D, \r) as the line separator. Mac OS X (Unix based) uses LF(0x0A, \n) as the default line separator, so the perl script, being a Unix tool, is probably expecting LF but is getting CR. Since there are no line separators in the file perl thinks there is only one line. If it had Windows line endings (CR,LF) you'd probably be getting an invisible CR at the end of each line.
A quick loop over the input replacing 0x0D with 0x0A should fix your problem.
I've directly experienced this problem with Excel 2004 for Mac. The line endings are indeed \r, and IIRC, the text uses the MacRoman character set, rather than Latin-1 or UTF-8 as you might expect.
So as well as the good advice to use Text::CSV / Text::CSV_XS and splitting on \r, you will want to open the file using the MacRoman encoding like so:
open my $fh, "<:encoding(MacRoman)", $filename
or die "Can't read $filename: $!";
Likewise, when reading a file exported with Excel on Windows, you may wish to use :encoding(cp1252) instead of :encoding(MacRoman) in that code.
Not sure about Mac excel, but certainly the windows version tends to enclose all values in quotes: "like","this". Also, you need to take into account the possibility of there being a quote in the value, which would show up "like""this" (there's only a single " in that value).
To actually answer your question however, it's likely that it's using a different newline character from what you'd expect. It's probably saving as \r\n instead of \n, or vice versa.
As others have suspected, your line endings are probably to blame. On my Linux-based system there are builtin utilities to change these line endings. mac2unix (which I think is just a wrapper around dos2unix will read your file and change the line endings for you. You should have something similar both on Linux and Mac (Microsoft may not care about you).
If you want to handle this in Perl, look into setting the $/ variable to set the "input record separator" from "\n" to "\r" (if thats the right ending). Try local $/ = "\r" before you read the file. Read more about it in perldoc perlvar (near $/) or in perldoc perlport (devoted to writing portable Perl code.
P.S. if I have some part of this incorrect let me know, I don't use Mac, I just think I know the theory
if you set the "special variable" that handles what it considers a newline to \r you'll be able to read one line at a time: $/="\r"; in this particular case the mac new line for perl is default \n but the file is probably using \r. This builds off what Flynn1179 & Mark Thalman said but shows you what to do to use the while () style reading.

is there a way to designate the line token delimiter in Perl's file reader?

I'm reading a text file via CGI in, in perl, and noticing that when the file is saved in mac's textEdit the line separator is recognized, but when I upload a CSV that is exported straight from excel, they are not. I'm guessing it's a \n vs. \r issue, but it got me thinking that I don't know how to specify what I would like the line terminator token to be, if I didn't want the one it's looking for by default.
Yes. You'll want to overwrite the value of $/. From perlvar
$/
The input record separator, newline by default. This influences Perl's idea of what a "line" is. Works like awk's RS variable, including treating empty lines as a terminator if set to the null string. (An empty line cannot contain any spaces or tabs.) You may set it to a multi-character string to match a multi-character terminator, or to undef to read through the end of file. Setting it to "\n\n" means something slightly different than setting to "", if the file contains consecutive empty lines. Setting to "" will treat two or more consecutive empty lines as a single empty line. Setting to "\n\n" will blindly assume that the next input character belongs to the next paragraph, even if it's a newline. (Mnemonic: / delimits line boundaries when quoting poetry.)
local $/; # enable "slurp" mode
local $_ = <FH>; # whole file now here
s/\n[ \t]+/ /g;
Remember: the value of $/ is a string, not a regex. awk has to be better for something. :-)
Setting $/ to a reference to an integer, scalar containing an integer, or scalar that's convertible to an integer will attempt to read records instead of lines, with the maximum record size being the referenced integer. So this:
local $/ = \32768; # or \"32768", or \$var_containing_32768
open my $fh, "<", $myfile or die $!;
local $_ = <$fh>;
will read a record of no more than 32768 bytes from FILE. If you're not reading from a record-oriented file (or your OS doesn't have record-oriented files), then you'll likely get a full chunk of data with every read. If a record is larger than the record size you've set, you'll get the record back in pieces. Trying to set the record size to zero or less will cause reading in the (rest of the) whole file.
On VMS, record reads are done with the equivalent of sysread, so it's best not to mix record and non-record reads on the same file. (This is unlikely to be a problem, because any file you'd want to read in record mode is probably unusable in line mode.) Non-VMS systems do normal I/O, so it's safe to mix record and non-record reads of a file.
See also "Newlines" in perlport. Also see $..
The variable has multiple names:
$/
$RS
$INPUT_RECORD_SEPARATOR
For the longer names, you need:
use English;
Remember to localize carefully:
{
local($/) = "\r\n";
...code to read...
}
If you are reading in a file with CRLF line terminators, you can open it with the CRLF discipline, or set the binmode of the handle to do automatic translation.
open my $fh, '<:crlf', 'the_csv_file.csv' or die "Oh noes $!";
This will transparently convert \r\n sequences into \n sequences.
You can also apply this translation to an existing handle by doing:
binmode( $fh, ':crlf' );
:crlf mode is typically default in Win32 Perl environments and works very well in practice.
For reading a CSV file, follow Robert-P's advice in his comment, and use a CSV module.
But for the general case of reading lines from a file with different line-endings, what I generally do is slurp the file whole and split it on \R. If it's not a multi-gigabytes file, that should be the safest and easiest way.
So:
perl -ln -0777 -e 'my #lines = split /\R/;
print length($_), " bytes split into ", scalar(#lines), " lines."' $YOUR_FILE
or in your script:
{
local $/ = undef;
open F, $YOUR_FILE or die;
#lines = split /\R/, <F>;
close F;
}
\R works with Unix LF (\x0A), Windows/Internet CRLF, and also with CR (\x0D) which was used by Macs in the nineties, but is in fact still used by some Mac programs.
From the perldoc :
\R matches a generic newline; that is, anything considered a linebreak
sequence by Unicode. This includes all characters matched by \v
(vertical whitespace), and the multi character sequence "\x0D\x0A"
(carriage return followed by a line feed, sometimes called the network
newline; it's the end of line sequence used in Microsoft text files
opened in binary mode)
Or see this much nicer and exhaustive explanation about \R in Brian D Foy's article : The \R generic line ending which even has a couple of fun videos.

How do I process lines with CRLF, NEL line terminators?

I need to process a file with shift_jis encoding. However the line terminators are in a format that im not familar with.
> file record.CSV
record.CSV: Non-ISO extended-ASCII text, with CRLF, NEL line terminators
Im using the general:
open my $CSV_FILE, "<:encoding(shift_jis)", $filename or die "Could not open: $CSV_FILE : $!";
while (<$CSV_FILE>) {
chomp;
# do stuff
}
However it is still leaving a CR at the end of each record.
What is the correct way to terminate files of these types?
Why not do $_ =~ s/\r// manually?
Edit: apparently, you can also do
require Encode;
use Unicode::Normalize;
s/\x{0085}//g;
to remove the NEL: Next Line, U+0085 characters.
You need to consider who's consuming the data and learn more about the environment which produced these files. If it's a plain-vanilla CSV output file you're after in the end, use any old string manipulation you like to get rid of them (and produce CRLF terminators in their stead) and you'll be fine.