Different behaviors of reading from files generated on different machines - perl

I have a folder of several hundred text file. Each file has the same format, for instance the file with the name ATextFile1.txt reads
ATextFile1.txt 09 Oct 2013
1
2
3
4
...
I have a simplified Perl script that is supposed to read the file and print it back out in the terminal window:
#!/usr/bin/Perl
use warnings;
use strict;
my $fileName = shift(#ARGV);
open(my $INFILE, "<:encoding(UTF-8)", $fileName) || die("Cannot open $fileName: $!.\n");
foreach (<$INFILE>){
print("$_"); # Uses the newline character from the file
}
When I use this script on files generated by the Windows version of the program that generates the ATextFile1.txt, my output is exactly as I'd expect (being the content of the text file), however, when I run this script on files generated by the Mac version of the file generating program, the output looks like the following:
2016tFile1.txt 09 Oct 2013
After some testing, it seems that it is only printing the first line of the text where the first 4 characters are overwritten by what can be expressed in RegEx as /[0-9][0-9]16/. If in my Perl script, I replace the output statement with print("\t$_");, I get the following line printed to STDOUT:
2016 ATextFile1.txt 09 Oct 2013
Each of these files can be read normally using any standard text editor but for some reason, my Perl script can't seem to properly read and write from the file. Any help would be greatly appreciated (I'm hoping it's something obvious that I'm missing). Thanks in advance!

Note that if you are printing UTF-8 characters to STDOUT you will need to use
binmode STDOUT, ':encoding(utf8)';
beforehand.
It looks as if your Mac files have just CR as the line ending. I understood that recent versions of Macintosh systems used LF as the line ending (the same as Linux) but Mac OS 9 uses just CR, while Windows uses the two characters CR LF inside the file, which is converted to just LF by the PerlIO layer when perl is running in a Windows platform.
If there are no linefeeds in the file, then Perl will read the entire file as a single record, and printing it will overlay all lines on top of one another.
As long as the files are relatively small, the easiest way to read either file format with the same Perl code is to read the whole file and split it on either CR or LF. Anything else will need different code according to the source of the input files.
Try this version of your code.
use strict;
use warnings;
my #contents = do {
open my $fh, '<:encoding(utf8)', $ARGV[0];
local $/;
my $contents = <$fh>;
split /[\r\n]+/, $contents;
}
print "$_\n" for #contents;
Update
One alternative you might try is to use the PerlIO::eol module, which provides a PerlIO layer that translates any line ending to LF when the record is read. I'm not certain that it plays nice with UTF-8, but as long as you add it after the encoding layer it should be fine.
It is not a core module so you will probably need to install it, but after that the program becomes just
use strict;
use warnings;
open my $fh, '<:encoding(UTF-8):eol(LF)', $ARGV[0];
binmode STDOUT, ':encoding(utf8)';
print while <$fh>;
I have created Windows, Linux, and Mac-style text files and this program works fine wioth all of them, but I have been unable to check whether a UTF-8 character that has 0x0D or 0x0A as part of its encoding are passed through properly, so be careful.
Update 2
After thinking briefly about this, of course there are no UTF-8 encodings that contain CR or LF apart from those characters themselves. All characters outside the ASCII range contain only bytes with the top bit set, so they are over 0x80 and can never be 0x0D or 0x0A.

Related

Reading file breaks encoding in Perl

I have script for reading html files in Perl, it works, but it breaks encoding.
This is my script:
use utf8;
use Data::Dumper;
open my $fr, '<', 'file.html' or die "Can't open file $!";
my $content_from_file = do { local $/; <$fr> };
print Dumper($content_from_file);
Content of file.html:
<span class="previews-counter">Počet hodnotení: [%product.rating_votes%]</span>
[%L10n.msg('Zobraziť recenzie')%]
Output from reading:
<span class=\"previews-counter\">Po\x{10d}et hodnoten\x{ed}: [%product.rating_votes%]</span>
[%L10n.msg('Zobrazi\x{165} recenzie')%]
As you can see lot of characters are escaped, how can I read this file and show content of it as it is?
You open the file with perl's default encoding:
open my $fh, '<', ...;
If that encoding doesn't match the actual encoding, Perl might translate some characters incorrectly. If you know the encoding, specify it in the open mode:
open my $fh, '<:utf8', ...;
You aren't done yet, though. Now that you have a probably decoded string, you want to output it. You have the same problem again. The standard output file handle's encoding has to match what you are trying to print to. If you've set up your terminal (or whatever) to expect UTF-8, you need to actually output UTF-8. One way to fix that is to make the standard filehandles use UTF-8:
use open qw(:std :utf8);
You have use utf8, but that only signals the encoding for your program file.
I've written a much longer primer for Perl and Unicode in the back of Learning Perl. The StackOverflow question Why does modern Perl avoid UTF-8 by default? has lots of good advice.

Properly detect line-endings of a file in Perl?

Problem: I have data (mostly in CSV format) produced on both Windows and *nix, and processed mostly on *nix. Windows uses CRLF for line endings and Unix uses LF. For any particular file I don't know whether it has windows or *nix line endings. Up until now, I've been writing something like this to handle the difference:
while (<$fh>){
tr/\r\n//d;
my #fields = split /,/, $_;
# ...
}
On *nix the \n part is equivalent to chomping, and additionally gets rid of \r (CR) if it's a windows-produced file.
But now I want to Text::CSV_XS b/c I'm starting to get weirder data files with quoted data, potentially with embedded line-breaks, etc. In order to get this module to read such files, Text::CSV_XS::getline() requires that you specify the end-of-line characters. (I can't read each line as above, tr/\n\r//d, and them parse it with Text::CSV b/c that wouldn't handle embedded line-breaks properly). How do I properly detect whether an arbitrary file uses windows or *nix style line endings, so I can tell Text::CSV_XS::eol() how to chomp()?
I couldn't find a module on CPAN that simply detects line endings. I don't want to to first convert all my datafiles via dos2unix, b/c the files are huge (hundreds of gigabytes), and spending 10+ minutes for each file to deal with something so simple seems silly. I thought about writing a function which reads the first several hundred bytes of a file and counts LF's vs CRLF's, but I refuse to believe this doesn't have a better solution.
Any help?
Note: all files are either have entirely windows-line endings or *nix endings, ie, they are not both mixed in a single file.
You could just open the file using the :crlf PerlIO layer and then tell Text::CSV_XS to use \n as the line ending character. This will silently map any CR/LF pairs to single line feeds, but that's presumably what you want.
use Text::CSV_XS;
my $csv = Text::CSV_XS->new( { binary => 1, eol => "\n" } );
open( $fh, '<:crlf', 'data.csv' ) or die $!;
while ( my $row = $csv->getline( $fh ) ) {
# do something with $row
}
Since Perl 5.10, you can use this to check general line endings,
s/\R//g;
It should work in all cases, both *nix and Windows.
Read in the first line of each file, look at its last but one character. If it is \r, the file comes from Windows, if not, it is *nix. Then seek to the begin and start processing.
If it is possible for a file to have mixed line endings (e.g. different type for embeded newlines), you can only guess.
In theory line endings cannot be determined reliably: Is this file a single line with DOS line endings with embeded \ns or is this a bunch of lines with a few stray \r characters at the end of some lines?
foo\n
ba\r\n
versus
foo\nba\r\n
If statistical analysis is not an option because it is too inaccurate and expensive (it takes time to scan such huge files), you have to actually know what the encoding is.
It would be best to specify the exact file format if you have control over the producing applications or to use some kind of metadata to keep track of the platform the data was produced on.
In Perl, the character \n represents is locale dependent: \n/\012 on *nix machines, \r/\015 on old Macs and the sequence \r\n/\015\012 on DOS-descendants aka Windows. So to do reliable processing, you should use the octal values.
You can use the PERLIO variable. This has the advantage of not having to modify the source code of your scripts depending on the platform.
If you're dealing with DOS text files, set the environment variable PERLIO to :unix:crlf:
$ PERLIO=:unix:crlf my-script.pl dos-text-file.txt
If you're mainly dealing with DOS text files (e.g. on Cygwin), you could put this in your .bashrc:
export PERLIO=:unix:crlf
(I think that value should be the default for PERLIO on Cygwin, but apparently it's not.)

Compare two UTF-8 text files and ignore lines that are blank or all whitespace

I am an author maintaining Kindle(HTML) and Open Office versions of a book. I sometimes forget to make changes to one or the other, and the documents are diverging.
My procedure is to copy the text from each and paste into separate text files (using paste and match style in TextEdit) in UTF-8, then perform a differencing operation. However the HTML paste adds blank lines between paragraphs.
I have a file differencing tool, but it has no option to ignore blank lines. My thought was to write a Perl script to remove the blank lines. However, the output of that script screws up the special characters - like ndashes, curly quotes, etc. I have tried using BINMODE and other tricks, to no avail.
I will accept a pointer to a free comparator for MAC OS X that ignores blank lines, or a way to get Perl to not screw up the UTF-8 special characters. I am using Perl 5.14. I prefer answers that do not rely upon newer features, but if I have to install a new Perl, I will.
UPDATE:
This does not work:
use open IO => ":encoding(iso-8859-7)";
open(FILE, "From HTML.txt") or die "$!\n";
open(OUT, ">From HTML - no blank lines.txt") or die "$!\n";
while(<FILE>) {
next if /^\s*$/;
print OUT $_;
}
close FILE; close OUT;
I also tried calling binmode(OUT, ":utf8");
UPDATE: Tried without success this tip from another Stackoverflow question:
open(my $fh, "<:encoding(UTF-8)", "filename");
GNU diff has -B/--ignore-blank-lines and -b/--ignore-space-change.
Err, that "use open" says that your data is not UTF-8. Try binmode on both FILE and OUT?
I ended up using the XCode text editor. By selecting a newline and pasting it into the search/replace dialog, I was able to replace all double newlines with single newlines.
Then I saved the file and used my Compare utility.

Opening a CSV file created in Mac Excel with Perl

I'm having a bit of trouble with the Perl code below. I can open and read in a CSV file that I've made manually, but if I try to open any Mac Excel spreadsheet that I save as a CSV file, the code below reads it all as a single line.
#!/usr/bin/perl
use strict;
use warnings;
open F, "file.csv";
foreach (<F>)
{
($first, $second, undef, undef) = split (',', $_);
}
print "$first : $second\n";
close(F);
Always use a specialised module (such as Text::CSV or Text::CSV_XS) for this purpose as there are lots of cases where split-ing will not help (for example when the fields contain a comma which is not a field separator but is within quotes).
Traditional Macintosh (System 9 and previous) uses CR (0x0D, \r) as the line separator. Mac OS X (Unix based) uses LF(0x0A, \n) as the default line separator, so the perl script, being a Unix tool, is probably expecting LF but is getting CR. Since there are no line separators in the file perl thinks there is only one line. If it had Windows line endings (CR,LF) you'd probably be getting an invisible CR at the end of each line.
A quick loop over the input replacing 0x0D with 0x0A should fix your problem.
I've directly experienced this problem with Excel 2004 for Mac. The line endings are indeed \r, and IIRC, the text uses the MacRoman character set, rather than Latin-1 or UTF-8 as you might expect.
So as well as the good advice to use Text::CSV / Text::CSV_XS and splitting on \r, you will want to open the file using the MacRoman encoding like so:
open my $fh, "<:encoding(MacRoman)", $filename
or die "Can't read $filename: $!";
Likewise, when reading a file exported with Excel on Windows, you may wish to use :encoding(cp1252) instead of :encoding(MacRoman) in that code.
Not sure about Mac excel, but certainly the windows version tends to enclose all values in quotes: "like","this". Also, you need to take into account the possibility of there being a quote in the value, which would show up "like""this" (there's only a single " in that value).
To actually answer your question however, it's likely that it's using a different newline character from what you'd expect. It's probably saving as \r\n instead of \n, or vice versa.
As others have suspected, your line endings are probably to blame. On my Linux-based system there are builtin utilities to change these line endings. mac2unix (which I think is just a wrapper around dos2unix will read your file and change the line endings for you. You should have something similar both on Linux and Mac (Microsoft may not care about you).
If you want to handle this in Perl, look into setting the $/ variable to set the "input record separator" from "\n" to "\r" (if thats the right ending). Try local $/ = "\r" before you read the file. Read more about it in perldoc perlvar (near $/) or in perldoc perlport (devoted to writing portable Perl code.
P.S. if I have some part of this incorrect let me know, I don't use Mac, I just think I know the theory
if you set the "special variable" that handles what it considers a newline to \r you'll be able to read one line at a time: $/="\r"; in this particular case the mac new line for perl is default \n but the file is probably using \r. This builds off what Flynn1179 & Mark Thalman said but shows you what to do to use the while () style reading.

With a utf8-encoded Perl script, can it open a filename encoded as GB2312?

I'm not talking about reading in the file content in utf-8 or non-utf-8 encoding and stuff. It's about file names. Usually I save my Perl script in the system default encoding, "GB2312" in my case and I won't have any file open problems. But for processing purposes, I'm now having some Perl script files saved in utf-8 encoding. The problem is: these scripts cannot open the files whose names consist of characters encoded in "GB2312" encoding and I don't like the idea of having to rename my files.
Does anyone happen to have any experience in dealing with this kind of situation? Thanks like always for any guidance.
Edit
Here's the minimized code to demonstrate my problem:
# I'm running ActivePerl 5.10.1 on Windows XP (Simplified Chinese version)
# The file system is NTFS
#!perl -w
use autodie;
my $file = "./测试.txt"; #the file name consists of two Chinese characters
open my $in,'<',"$file";
while (<$in>){
print;
}
This test script can run well if saved in "ANSI" encoding (I assume ANSI encoding is the same as GB2312, which is used to display Chinese charcters). But it won't work if saved as "UTF-8" and the error message is as follows:
Can't open './娴嬭瘯.txt' for reading: 'No such file or directory'.
In this warning message, "娴嬭瘯" are meaningless junk characters.
Update
I tried first encoding the file name as GB2312 but it does not seem to work :(
Here's what I tried:
#!perl -w
use autodie;
use Encode;
my $file = "./测试.txt";
encode("gb2312", decode("utf-8", $file));
open my $in,'<',"$file";
while (<$in>){
print;
}
My current thinking is: the file name in my OS is 测试.txt but it is encoded as GB2312. In the Perl script the file name looks the same to human eyes, still 测试.txt. But to Perl, they are different because they have different internal representations. But I don't understand why the problem persists when I already converted my file name in Perl to GB2312 as shown in the above code.
Update
I made it, finally made it :)
#brian's suggestion is right. I made a mistake in the above code. I didn't give the encoded file name back to the $file.
Here's the solution:
#!perl -w
use autodie;
use Encode;
my $file = "./测试.txt";
$file = encode("gb2312", decode("utf-8", $file));
open my $in,'<',"$file";
while (<$in>){
print;
}
If you
use utf8;
in your Perl script, that merely tells perl that the source is in UTF-8. It doesn't affect how perl deals with the outside world. Are you turning on any other Perl Unicode features?
Are you having problems with every filename, or just some of them? Can you give us some examples, or a small demonstration script? I don't have a filesystem that encodes names as GB2312, but have you tried encoding your filenames as GB2312 before you call open?
If you want specific strings encoded with a specific encoding, you can use the Encode module. Try that with your filenames that you give to open.