Remove or completely supress null character \0 - perl

I have a script, MM.pl, which is the “workhorse”, and a simple “patchfile” that it reads from. In this case, the patch file is targeting an .ini file for search and replace. Simple enough. It took me 5 days to realize the ini is encoded with null (\0) characters between each letter. Since then, I have tried every option I could find both in code snippets, use:: functions, and regular expressions.
The only reason I found it was I used use Data::Printer; to dump several values. In Notepad++, the ini appears to be encoded as USC-2 LE. It is important that MM.pl handles the task instead of asking the user to “fix” the issue.
Update: This may provide a clue \xFF\xFE are the first 2 characters in the ini file. They appear after processing. The swap is not actually changing anything else like it is supposed to, but "reveals" 2 hidden characters.

As you noticed, those nulls aren't just junk to be stripped; they're part of the file's character encoding. So decode it:
open my $fh, '<:encoding(UCS-2)', 'file.ini';
Write it back out the same way once you're done.

When you read the file set the encoding
my $fh = IO::File->open( "< something.ini" );
binmode( $fh, ":encoding(UTF-16LE)" );
And when you output, you can write back whichever enoding you like. e.g.
my $out = IO::File->open( "> something-new.ini" );
binmode( $out, ":encoding(UTF-8)" );
Or even if you're dumping to the terminal
binmode( STDOUT, ":encoding(UTF-8)" );

To be honest this really is not a solution but a copout. After 4 weeks of trying and retrying methods, and reading and reading and reading, I have put it in park and switched to python to build the app. Several references in the perldocs mention UTF16 is "problematic" and also in mention situations it is treated differently.

Related

making sure that my handling of utf8 is correct

I am using Perl for a module that involves processing a lot of Unicode documents. I started getting nervous because I'm not opening and closing files with the utf8 layers like open (OUT, '>:utf8', $textfile). However, I have been thoroughly testing and the output was still as expected. So I want to better understand why.
In a nutshell, my Perl module passes a document to an external service and gets a response. The response will be in Utf8. It uses LWP::UserAgent for this. When it gets the response it just writes it to a file:
my $fh;
open($fh, '>', $outputpath) or die "Could not open file '$outputpath' $!";
print $fh $response->content;
close $fh;
I have diffed these files against Unicode files representing the "expected" output and it is fine. And yet, you can see in my open command that I was not using the utf8 layer. So why is that?
What if I just returned $response->content to some other process, instead of printing it? Would it still be proper Unicode then?
I also have a separate process that I would like to ask about, very similar question. In this case I am trying to build a new service which replaces an old one. The old one read from a file like open(my $fh, '<:utf8', $inputfile) and wrote to a new file like open(my $fh, '>:utf8', $outputfile). The new service will still read the same way, but will not write to the output file anymore. It will send the string to another server using HTTP, and on that server it will be printed to a file using open(my $fh, '>', $outputfile) so no utf8 layer. I can't change that code immediately.
I want the file contents to be the exact same as they would otherwise have been (none of the other processing rules are changing). Should I be nervous about losing the layer?
I think maybe it would help if I understood better what these layers are doing.
There is no "handling of utf8" in the main question and that in itself isn't right.
The whole thing works, as the server is sending utf8 as you say, in the following way.
The content method used on $response is from HTTP::Message
The content() method sets the raw content if an argument is given. If no argument is given the content is not touched. In either case the original raw content is returned.
Since you don't specify layers† in open the default is used, likely :unix:perlio for Unix, with no encoding (see PerlIO). So you are dumping the original bytes to the disk, unchanged.
Looking further down the page, at decoded_content( %options ), we see the default
default_charset
This override the default charset guessed by content_charset() or if that fails "ISO-8859-1".
and can establish what you are getting by printing it
say 'Content type: ', $response->content_charset;
where you should get Content type: UTF-8. But when you receive a different encoding from the server then that will wind up in the file and any code that expects utf8 will break.
One should always decode all input and encode all output. Then we know exactly what is going on. As input is decoded the program carries on with character strings (not bytes in whatever encoding was sent). In the end encode suitably for output. This Effective Perler article should be useful. Here you'd use decoded_content and write files opened with :encoding(UTF-8).
With use open ":std", ":encoding(UTF-8)"; all I/O via standard streams in the lexical scope of this pragma will be handled as utf8. (This can be overriden for other specific uses, say by specifying layers in the three argument open.)
See open pragma.
As for the other question, you need to properly encode what you intend to "send to another server." How to do that depends on how you are "sending" it.
†   With PerlIO the I/O "layers" can be set so that encoding of input and output is done as needed behind the scenes, as data is read or written. The work is done by Encode. For a nice explanation of the process see Encode::PerlIO.
Also see perlunitut, perlunifaq, and perluniitro.

setting BOM to Unicode U code UTF8 perl

This question is similar to others that have been posted before. however trying all combinations nothing is working.
I need to have my excel file read in Unicode Utf8, I am attempting to set my bom:
my $csv = Text::CSV->new ({binary=>1, eol =>$/})
or die "cannot use CSV: ".Text::CSV->error_diag ();
open my $csvFile, ">:encoding(UTF-8)", "teht.csv" or die "teht.csv: $!";
print($csvFile "\x{FEBBBF}");
however this gets an errror and says that "0xFEBBBF is not Unicode..."
all information that I have found indicates that the code for utf8 should read
print($csvFile "\N{U+FEBBBF}") or ... "\xFE\xBB\xBF" or similar.
Is it possible to force Excel recognize UTF-8 CSV files automatically? is one source which says this many times.
https://stackoverflow.com/a/22711105/6557829 is another source.
So far I have actually been able to get UTF-16 to work with the same print statement: print($csvFile "\N{U+FEFF}"); however that is more space than I mean to use.
Thanks in advance for any help you can give me.
The BOM is U+FEFF, not U+FEBBBF. Replace
"\x{FEBBBF}"
with any one of following:
chr(0xFEFF)
"\x{FEFF}"
"\N{U+FEFF}"
"\N{BOM}"
This will create a string with a single character (FEFF), which print will encode using UTF-8 as requested (EF BB BF).

Compare two UTF-8 text files and ignore lines that are blank or all whitespace

I am an author maintaining Kindle(HTML) and Open Office versions of a book. I sometimes forget to make changes to one or the other, and the documents are diverging.
My procedure is to copy the text from each and paste into separate text files (using paste and match style in TextEdit) in UTF-8, then perform a differencing operation. However the HTML paste adds blank lines between paragraphs.
I have a file differencing tool, but it has no option to ignore blank lines. My thought was to write a Perl script to remove the blank lines. However, the output of that script screws up the special characters - like ndashes, curly quotes, etc. I have tried using BINMODE and other tricks, to no avail.
I will accept a pointer to a free comparator for MAC OS X that ignores blank lines, or a way to get Perl to not screw up the UTF-8 special characters. I am using Perl 5.14. I prefer answers that do not rely upon newer features, but if I have to install a new Perl, I will.
UPDATE:
This does not work:
use open IO => ":encoding(iso-8859-7)";
open(FILE, "From HTML.txt") or die "$!\n";
open(OUT, ">From HTML - no blank lines.txt") or die "$!\n";
while(<FILE>) {
next if /^\s*$/;
print OUT $_;
}
close FILE; close OUT;
I also tried calling binmode(OUT, ":utf8");
UPDATE: Tried without success this tip from another Stackoverflow question:
open(my $fh, "<:encoding(UTF-8)", "filename");
GNU diff has -B/--ignore-blank-lines and -b/--ignore-space-change.
Err, that "use open" says that your data is not UTF-8. Try binmode on both FILE and OUT?
I ended up using the XCode text editor. By selecting a newline and pasting it into the search/replace dialog, I was able to replace all double newlines with single newlines.
Then I saved the file and used my Compare utility.

Opening a CSV file created in Mac Excel with Perl

I'm having a bit of trouble with the Perl code below. I can open and read in a CSV file that I've made manually, but if I try to open any Mac Excel spreadsheet that I save as a CSV file, the code below reads it all as a single line.
#!/usr/bin/perl
use strict;
use warnings;
open F, "file.csv";
foreach (<F>)
{
($first, $second, undef, undef) = split (',', $_);
}
print "$first : $second\n";
close(F);
Always use a specialised module (such as Text::CSV or Text::CSV_XS) for this purpose as there are lots of cases where split-ing will not help (for example when the fields contain a comma which is not a field separator but is within quotes).
Traditional Macintosh (System 9 and previous) uses CR (0x0D, \r) as the line separator. Mac OS X (Unix based) uses LF(0x0A, \n) as the default line separator, so the perl script, being a Unix tool, is probably expecting LF but is getting CR. Since there are no line separators in the file perl thinks there is only one line. If it had Windows line endings (CR,LF) you'd probably be getting an invisible CR at the end of each line.
A quick loop over the input replacing 0x0D with 0x0A should fix your problem.
I've directly experienced this problem with Excel 2004 for Mac. The line endings are indeed \r, and IIRC, the text uses the MacRoman character set, rather than Latin-1 or UTF-8 as you might expect.
So as well as the good advice to use Text::CSV / Text::CSV_XS and splitting on \r, you will want to open the file using the MacRoman encoding like so:
open my $fh, "<:encoding(MacRoman)", $filename
or die "Can't read $filename: $!";
Likewise, when reading a file exported with Excel on Windows, you may wish to use :encoding(cp1252) instead of :encoding(MacRoman) in that code.
Not sure about Mac excel, but certainly the windows version tends to enclose all values in quotes: "like","this". Also, you need to take into account the possibility of there being a quote in the value, which would show up "like""this" (there's only a single " in that value).
To actually answer your question however, it's likely that it's using a different newline character from what you'd expect. It's probably saving as \r\n instead of \n, or vice versa.
As others have suspected, your line endings are probably to blame. On my Linux-based system there are builtin utilities to change these line endings. mac2unix (which I think is just a wrapper around dos2unix will read your file and change the line endings for you. You should have something similar both on Linux and Mac (Microsoft may not care about you).
If you want to handle this in Perl, look into setting the $/ variable to set the "input record separator" from "\n" to "\r" (if thats the right ending). Try local $/ = "\r" before you read the file. Read more about it in perldoc perlvar (near $/) or in perldoc perlport (devoted to writing portable Perl code.
P.S. if I have some part of this incorrect let me know, I don't use Mac, I just think I know the theory
if you set the "special variable" that handles what it considers a newline to \r you'll be able to read one line at a time: $/="\r"; in this particular case the mac new line for perl is default \n but the file is probably using \r. This builds off what Flynn1179 & Mark Thalman said but shows you what to do to use the while () style reading.

Perl: Encoding messed up after text concatenation

I have encountered a weird situation while updating/upgrading some legacy code.
I have a variable which contains HTML. Before I can output it, it has to be filled with lots of data. In essence, I have the following:
for my $line (#lines) {
$output = loadstuff($line, $output);
}
Inside of loadstuff(), there is the following
sub loadstuff {
my ($line, $output) = #_;
# here the process is simplified for better understanding.
my $stuff = getOtherStuff($line);
my $result = $output.$stuff;
return $result;
}
This function builds a page which consists of different areas. All area is loaded up independently, that's why there is a for-loop.
Trouble starts right about here. When I load the page from ground up (click on a link, Perl executes and delivers HTML), everything is loaded fine. Whenever I load a second page via AJAX for comparison, that HTML has broken encoding.
I tracked down the problem to this line my $result = $output.$stuff. Before the concatenation, $output and $stuff are fine. But afterward, the encoding in $result is messed up.
Does somebody have a clue why concatenation messes up my encoding? While we are on the subject, why does it only happen when the call is done via AJAX?
Edit 1
The Perl and the AJAX call both execute the very same functions for building up a page. So, whenever I fix it for AJAX, it is broken for freshly reloaded pages. It really seems to happen only if AJAX starts the call.
The only difference in this particular case is that the current values for the page are compared with an older one (it is a backup/restore function). From here, everything is the same. The encoding in the variables (as far as I can tell) are ok. I even tried the Encode functions only on the values loaded from AJAX, but to no avail. The files themselves seem to be utf8 according to "Kate".
Besides that, I have a another function with the same behavior which uses the EXACT same functions, values and files. When the call is started from Perl/Apache, the encoding is ok. Via AJAX, again, it is messed up.
I have been examinating the AJAX Request (jQuery) and could not find anything odd. The encoding seems to be utf8 too.
Perl has a “utf8” flag for every scalar value, which may be “on” or “off”. “On” state of the flag tells perl to treat the value as a string of Unicode characters.
If you take a string with utf8 flag off and concatenate it with a string that has utf8 flag on, perl converts the first one to Unicode. This is the usual source of problems.
You need to either convert both variables to bytes with Encode::encode() or to perl's internal format with Encode::decode() before concatenation.
See perldoc Encode.
Expanding on the previous answer, here's a little more information that I found useful when I started messing with character encodings in Perl.
This is an excellent introduction to Unicode in perl: http://perldoc.perl.org/perluniintro.html. The section "Perl's Unicode Model" is particularly relevant to the issue you're seeing.
A good rule to use in Perl is to decode data to Perl characters on it's way in and encode it into bytes on it's way out. You can do this explicitly using Encode::encode and Encode::decode. If you're reading from/writing to a file handle you can specify an encoding on the filehandle by using binmode and setting layer: perldoc -f binmode
You can tell which of the strings in your example has been decoded into Perl characters using Encode::is_utf8:
use Encode qw( is_utf8 );
print is_utf8($stuff) ? 'characters' : 'bytes';
A colleague of mine found the answer to this problem. It really had something to do with the fact that AJAX started the call.
The file structure is as follows:
1 Handler, accessed by Apache
1 Handler, accessed by Apache but who only contains AJAX responders. We call it the AJAX-Handler
1 package, which contains functions relevant for the entire software, who access yet other packages from our own Framework
Inside of the AJAX-Handler, we print the result as such
sub handler {
my $r = shift;
# processing output
$r->print($output);
return Apache2::Const::OK;
}
Now, when I replace $r->print($output); by print($output);, the problem disappears! I know that this is not the recommended way to print stuff in mod_perl, but this seems to work.
Still, any ideas how to do this the proper way are welcome.