perl save utf-8 text problem - perl

I am playing around the pplog, a single file file base blog.
The writing to file code:
open(FILE, ">$config_postsDatabaseFolder/$i.$config_dbFilesExtension");
my $date = getdate($config_gmt);
print FILE $title.'"'.$content.'"'.$date.'"'.$category.'"'.$i; # 0: Title, 1: Content, 2: Date, 3: Category, 4: FileName
print 'Your post '. $title.' has been saved. Go to Index';
close FILE;
The input text:
春眠不覺曉,處處聞啼鳥. 夜來風雨聲,花落知多小.
After store to file, it becomes:
春眠不覺�›�,處處聞啼鳥. 夜來風�›�聲,花落知多小.
I can use Eclipse to edit the file and make it render to normal. The problem exists during printing to the file.
Some basic info:
Strawberry perl 5.12
without use utf8;
tried use utf8;, dosn't have effect.
Thank you.
--- EDIT ---
Thanks for comments. I traced the code:
Codes add new content:
# Blog Add New Entry Page
my $pass = r('pass');
#BK 7JUL09 patch from fedekun, fix post with no title that caused zero-byte message...
my $title = r('title');
my $content = '';
if($config_useHtmlOnEntries == 0)
{
$content = bbcode(r('content'));
}
else
{
$content = basic_r('content');
}
my $category = r('category');
my $isPage = r('isPage');
sub r
{
escapeHTML(param($_[0]));
}
sub r forward the command to a CGI.pm function.
In CGI.pm
sub escapeHTML {
# hack to work around earlier hacks
push #_,$_[0] if #_==1 && $_[0] eq 'CGI';
my ($self,$toencode,$newlinestoo) = CGI::self_or_default(#_);
return undef unless defined($toencode);
$toencode =~ s{&}{&}gso;
$toencode =~ s{<}{<}gso;
$toencode =~ s{>}{>}gso;
if ($DTD_PUBLIC_IDENTIFIER =~ /[^X]HTML 3\.2/i) {
# $quot; was accidentally omitted from the HTML 3.2 DTD -- see
# <http://validator.w3.org/docs/errors.html#bad-entity> /
# <http://lists.w3.org/Archives/Public/www-html/1997Mar/0003.html>.
$toencode =~ s{"}{"}gso;
}
else {
$toencode =~ s{"}{"}gso;
}
# Handle bug in some browsers with Latin charsets
if ($self->{'.charset'}
&& (uc($self->{'.charset'}) eq 'ISO-8859-1' # This line cause trouble. it treats Chinese chars as ISO-8859-1
|| uc($self->{'.charset'}) eq 'WINDOWS-1252')) {
$toencode =~ s{'}{'}gso;
$toencode =~ s{\x8b}{‹}gso;
$toencode =~ s{\x9b}{›}gso;
if (defined $newlinestoo && $newlinestoo) {
$toencode =~ s{\012}{
}gso;
$toencode =~ s{\015}{
}gso;
}
}
return $toencode;
}
Further trace the problem, found out the browser default to iso-8859-1, even manually set to utf-8, it send the string back to server as iso-8859-1.
Finally,
print header(-charset => qw(utf-8)), '<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
add the -charset => qw(utf-8) param to header. The Chinese poem is still Chinese poem.
Thanks for Schwern's comments, it inspired me to trace out the problem and learn a leeson.

In order to get utf8 really working in Perl involves flipping on a lot of individual features. use utf8 only makes your code utf8 (strings, variables, regexes...), you have to do file handles separately.
Its complicated, and the simplest thing is to use utf8::all which will make utf8 the default for your code, your files, #ARGV, STDIN, STDOUT and STDERR. utf8 support is constantly improving in Perl, and utf8::all will add it as it comes available.

I'm unsure of how your code can produce that output—for example, the quote marks are missing. Of course, this could be due to "corruption" somewhere between your file and me seeing the page. SO may filter corrupted UTF-8. I suggest providing hex dumps in the future!
Anyway, to get UTF-8 output working in Perl, there are several approaches:
Work with character data, that is let Perl know that your variables contain Unicode. This is probably the best method. Confirm that utf8::is_utf8($var) is true (you do not need to, and should not use utf8 for this). If not, look into the Encode module's decode function to make Perl know its Unicode. Once Perl knows your data is characters, that print will give warnings (which you do have enabled, right?). To fix, enable the :utf8 or :encoding(utf8) layer on your file (the latter version provides error checking). You can do this in your open (open FILE, '>:utf8', "$fname") or alternative enable it with binmode (binmode FILE, ':utf8'). Note that you can also use other encodings; see the encoding and PerlIO::encoding docs.
Treat your Unicode as opaque binary data. utf8::is_utf8($var) must be false. You must be very careful when manipulating strings; for example, if you've got UTF-16-BE, this would be a bad idea: print "$data\n", because you actually need print $data\0\n". UTF-8 has fewer of these issues, but you need to be aware of them.
I suggest reading the perluniintro, perlunitut, perlunicode, and perlunifaq manpages/pods.
Also, use utf8; just tells Perl that your script is written in UTF-8. Its effects are very limited; see its pod docs.

You're not showing the code that is actually running. I successfully processed the text you supplied as input with both 5.10.1 on Cygwin and 5.12.3 on Windows. So I suspect a bug in your code. Try narrowing down the problem by writing a short, self-contained test case.

Related

Perl drop down menus and Unicode

I've been going around on this for some time now and can't quite get it. This is Perl 5 on Ubuntu. I have a drop down list on my web page:
$output .= start_form . "Student: " . popup_menu(-name=>'student', -values=>['', #students], -labels=>\%labels, -onChange=>'Javascript:submit()') . end_form;
It's just a set of names in the form "Last, First" that are coming from a SQL Server table. The labels are created from the SQL columns like so:
$labels{uc($record->{'id'})} = $record->{'lastname'} . ", " . $record->{'firstname'};
The issue is that the drop down isn't displaying some Unicode characters correctly. For instance, "Søren" shows up in the drop down as "Søren". I have in my header:
use utf8;
binmode(STDOUT, ":utf8");
...and I've also played around with various takes on the "decode( )" function, to no avail. To me, the funny thing is that if I pull $labels into a test script and print the list to the console, the names appear just fine! So what is it about the drop down that is causing this? Thank you in advance.
EDIT:
This is the relevant functionality, which I've stripped down to this script that runs in the console and yields the correct results for three entries that have Unicode characters:
#!/usr/bin/perl
use DBI;
use lib '/home/web/library';
use mssql_util;
use Encode;
binmode(STDOUT, ":utf8");
$query = "[SQL query here]";
$dbh = &connect;
$sth = $dbh->prepare($query);
$result = $sth->execute();
while ($record = $sth->fetchrow_hashref())
{
if ($record->{'id'})
{
$labels{uc($record->{'id'})} = Encode::decode('UTF-8', $record->{'lastname'} . ", " . $record->{'nickname'} . " (" . $record->{'entryid'} . ")");
}
}
$sth->finish();
print "$labels{'ST123'}\n";
print "$labels{'ST456'}\n";
print "$labels{'ST789'}\n";
The difference in what the production script is doing is that instead of printing to the console like above, it's printing to HTTP:
$my_output = "<p>$labels{'ST123'}</p><br>
<p>$labels{'ST456'}</p><br>
<p>$labels{'ST789'}</p>";
$template =~ s/\$body/$my_output/;
print header(-cookie=>$cookie) . $template;
This gives, i.e., strings like "Zoë" and "Søren" on the page. BUT, if I remove binmode(STDOUT, ":utf8"); from the top of the production script, then the strings appear just fine on the page (i.e. I get "Zoë" and "Søren").
I believe that the binmode( ) line is necessary when writing UTF-8 to output, and yet removing it here produces the correct results. What gives?
Problem #1: Decoding inputs
53.C3.B8.72.65.6E is the UTF-8 encoding for Søren. When you instruct Perl to encode it all over again (by printing it to handle with the :utf8 layer), you are producing garbage.
You need to decode your inputs ($record->{id}, $record->{lastname}, $record->{firstname}, etc)! This will transform The UTF-8 bytes 53.C3.B8.72.65.6E ("encoded text") into the Unicode Code Points 53.F8.72.65.6E ("decoded text").
In this form, you will be able to use uc, regex matches, etc. You will also be able to print them out to a handle with an encoding layer (e.g. :encoding(UTF-8), or the improper :utf8).
You let on that these inputs come from a database. Most DBD have a flag that causes strings to be decoded. For example, if it's a MySQL database, you should pass mysql_enable_utf8mb4 => 1 to connect.
Problem #2: Communicating encoding
If you're going to output UTF-8, don't tell the browser it's ISO-8859-1!
$ perl -e'use CGI qw( :standard ); print header()'
Content-Type: text/html; charset=ISO-8859-1
Fixed:
$ perl -e'use CGI qw( :standard ); print header( -type => "text/html; charset=UTF-8" )'
Content-Type: text/html; charset=UTF-8
Hard to give a definitive solution as you don't give us much useful information. But here are some pointers that might help.
use utf8 only tells Perl that your source code is encoded as UTF-8. It does nothing useful here.
Reading perldoc perlunitut would be a good start.
Do you know how your database tables are encoded?
Do you know whether your database connection is configured to automatically decode data coming from the database into Perl characters?
What encoding are you telling the browser that you have encoded your HTTP response in?

Encode module and inverted commas

I am scraping a web page, and extracting a specific section from it. That section includes inverted commas (’, character 146). I'm trying to print my extracted data to a text file, but it's giving me ’ instead of the inverted comma. I have tried the following:
$content =~ s/’/'/g;
my $invComma = chr 146;
$content =~ s/$invComma/'/g;
$content =~ s/\x{0092}/'/g;
None of it has worked. I can't decode('UTF-8', $content) because it has wide characters. When I try to encode('UTF-8', $content) the ’ changes to ’ instead. I have already tried use utf8 as well, to no effect.
I know that my text file viewer can display inverted commas, because I printed one to a test file and opened it. The problem is therefore in my script.
What am I doing wrong, and how do I fix it?
UPDATE: I am able to do $content =~ s/’/'/g to replace it with a simple apostrophe, but I still don't know why nothing else works. I'd also like a fix that actually solves the problem, instead of just solving one of the symptoms.
UPDATE 2: I have been informed by hobbs that the character is actually U+2019 RIGHT SINGLE QUOTATION MARK and changed my regex to use chr 0x2019 which now works.
The character you're trying to replace is only 0x92 / 146 in the Windows-1252 encoding. Perl uses Unicode, where that character is U+2019 RIGHT SINGLE QUOTATION MARK, aka "\x{2019}", chr(0x2019), or chr(8217).
Start by finding out what $content contains. You can use the following:
use Data::Dumper;
local $Data::Dumper::Useqq = 1;
warn(Dumper($content));
If you get the following, $content is decoded
$VAR1 = "...\x{2019}...";
Any of the following will work.
use utf8; # Source code is encoded using UTF-8.
$content =~ s/’/'/g;
$content =~ s/\x{2019}/'/g;
$content =~ s/\N{U+2019}/'/g;
$content =~ s/\N{RIGHT SINGLE QUOTATION MARK}/'/g;
If you get the following, $content is encoded using UTF-8.
$VAR1 = "...\342\200\231...";
Start by decoding the value of $content using either of the following:
utf8::decode($content) or die;
use Encode qw( decode_utf8 );
$content = decode_utf8($content);
Then use any of the solutions for decoded content (above).
If you get the following, $content is encoded using cp1252.
$VAR1 = "...\222...";
Start by decoding the value of $content.
use Encode qw( decode );
$content = decode("cp1252", $content);
Then use any of the solutions for decoded content (above).
By the way, ’ is what the UTF-8 encoding of ’ (E2 80 99) would look like if decoded as cp1252.
The problem was not in my script, it was in my editor. The script works properly, and the question is based on false pretenses. I was using gVim on Windows, which did not play nicely with Unicode. My script was properly decoding the content, but when I opened the output file in gVim, it mangled the text and displayed it incorrectly. My attempts to use regular expressions to change the characters failed because I was using the wrong codepoint - it wasn't 0x92, it was 0x2019. This was another failing of gVim. Thanks to hobbs and ikegami for helping me figure this out.

How to replace à with a space using perl

Apologies if this is a dupe (I tried all manner of searches!). This is driving me nuts...
I need a quick fix to replace à with a space.
I've tried the following, with no success:
$str =~ s/Ã/ /g;
$str =~ s/\xC3/ /g;
What am I doing wrong here ?
The statement "replace à with a space" is meaningless, because the statement does not specify which encoding is used for the character in question.
The context of this statement could be using the UTF-8 encoding, for example, as well as one of several ISO-8859 encodings. Or, maybe even UTF-16 or UTF-32.
So, for starters, you need to specify, at least, which encoding you are using. And after that, it's also necessary to specify where the input or the output is coming from.
Assuming:
1) You are using UTF-8 encoding
2) You are reading/writing STDIN and STDOUT
Then here's a short example of a filter that shows how to replace this character with a space. Assuming, of course, that the Perl script itself is also encoded in UTF-8.
use utf8;
use feature 'unicode_strings';
binmode(STDIN, ":utf8");
binmode(STDOUT, ":utf8");
while (<STDIN>)
{
s/Ã/ /g;
print;
}
You need to specify that you want UNICODE and not Latin-1 (or another encoding).
If you're reading from a file then:
#!/usr/bin/perl
open INFILE, '<:encoding(UTF-8)', '/mypath/file';
while(<INFILE>) {
s/\xc3/ /g;
print;
}
I'll break that down better for you:
In <:encoding(UTF-8) you are specifying that you want to read (the <), and that you want UNICODE (the :encoding(UTF-8) part).
If you weren't using unicode you would use:
open INFILE, '<', '/mypath/file';
or
open INFILE, '/mypath/file';
because by default perl will read. If you want to write you use >:encoding(UTF-8) and if you want to append (because the > overwrites the file) you use >>:encoding(UTF-8).
Hope it helped!
There is another answer that specifies how to do binmode(STDIN, ":utf8") if you're trying to unicode from STDIN.
Following this, for the simple "quick fix" Wonko was looking for:
tr/ -~//cd;

PERL: String Replacement on file

I am working on a script to do a string replacement in a file and I will read the variables and values and files from a configuration file and do string replacement.
Here is my logic to do a string replacement.
sub expansion($$$){
my $f = shift(#_) ; # file Name
my $vname = shift(#_) ; # variable name for pattern match
my $value = shift(#_) ; # value to replace
my $n = "$f".".new";
open ( O, "<$f") or print( "Can't open $f file: $!");
open ( N ,">$n" ) or print( "Can't open $n file: $!");
while (<O>)
{
$_ =~ s/$vname/$value/g; #check for pattern
print N "$_" ;
}
close (O);
close (N);
}
In my logic am reading line by line in from input file ($f) for the pattern and writing to a new file ($n) .
Instead of write to a new file is there any way to do a string replacement the original file when I try to do the same it has only empty file with no contents.
Do not. Never, ever1. Don't you dare, Don't even think of, do not use subroutine prototyping. It is horribly broken (that is, it doesn't do what you think it does) and is dangerous.
Now, we got that out of the way:
Yes, you can do what you want. You can open a file as both read and writable by using the mode <+. So far, so good.
However, due to buffering, you cannot use the standard read and write methods to read and write to the file. Instead, you need to use sysread and syswrite.
Then, what you need to do is read the line, use sysseek to go back to the start of where you read, and then write to that spot.
Not only is it very complex to do, but it is full of peril. Let's take a simple example. I have a document, and I want to replace my curly quotes with straight quotes.
$line =~ s/“|”/"/g;
That should work. I'm replacing one character with another. What could go wrong?
If this is a UTF-8 file (what Macs and Linux systems use by default), those curly quotes are two-byte characters and that straight quote is a single byte character. I would be writing back a line that was shorter than the line I read in. My buffer is going to be off.
Back in the days when computer memory and storage were measured in kilobytes, and you serial devices like reel-to-reel tapes, this type of operation was quite common. However, in this age where storage is vast, it's simply not worth the complexity and error prone process that this entails. Stick with reading from one file, and writing to another. Then use unlink and rename to delete the original and to rename the copy to the original's name.
A few more pointers:
Don't print if the file can't be opened. Use die. Otherwise, your program will simply continue on blithely unaware that it is not working. Even better, use the pragma use autodie;, and you won't have to worry about testing whether or not a read/write failed.
Use scalars for file handles.
That is instead of
open OUT, ">my_file.txt";
use
open my $out_fh, ">my_file.txt";
And, it is highly recommended to use the three parameter open:
Use
open my $out_fh, ">", "my_file.txt";
If you aren't, always add use strict; and use warnings;.
In fact, your Perl syntax is a bit ancient. You need to get a book on Modern Perl. Perl originally was written as a hack language to replace shell and awk programming. However, Perl has morphed into a full fledge language that can handle complex data types, object orientation, and large projects. Learning the modern syntax of Perl will help you find errors, and become a better developer.
1. Like all rules, this can be broken, but only if you have a clear and careful understanding what is going on. It's like those shows that say "Don't do this at home. We're professionals."
sub inplace_expansion($$$){
my $f = shift(#_) ; # file Name
my $vname = shift(#_) ; # variable name for pattern match
my $value = shift(#_) ; # value to replace
local #ARGV = ( $f );
local $^I = '';
while (<>)
{
s/\Q$vname/$value/g; #check for pattern
print;
}
}
or, my preference would run closer to this (basically equivalent, changes mostly in formatting, variable names, etc.):
use English;
sub inplace_expansion {
my ( $filename, $pattern, $replacement ) = #_;
local #ARGV = ( $filename ),
$INPLACE_EDIT = '';
while ( <> ) {
s/\Q$pattern/$replacement/g;
print;
}
}
The trick with local basically simulates a command-line script (as one would run with perl -e); for more details, see perldoc perlrun. For more on $^I (aka $INPLACE_EDIT), see perldoc perlvar.
(For the business with \Q (in the s// expression), see perldoc -f quotemeta. This is unrelated to your question, but good to know. Also be aware that passing regex patterns around in variables—as opposed to, e.g., using literal regexes exclusively— can be vulnerable to injection attacks; Perl's built-in taint mode is useful here.)
EDIT: David W. is right about prototypes.

Unicode in Perl not working

I have some text files which I am trying to transform with a Perl script on Windows. The text files look normal in Notepad+, but all the regexes in my script were failing to match. Then I noticed that when I open the text files in NotePad+, the status bar says "UCS-2 Little Endia" (sic). I am assuming this corresponds to the encoding UCS-2LE. So I created "readFile" and "writeFile" subs in Perl, like so:
use PerlIO::encoding;
my $enc = ':encoding(UCS-2LE)';
sub readFile {
my ($fName) = #_;
open my $f, "<$enc", $fName or die "can't read $fName\n";
local $/;
my $txt = <$f>;
close $f;
return $txt;
}
sub writeFile {
my ($fName, $txt) = #_;
open my $f, ">$enc", $fName or die "can't write $fName\n";
print $f $txt;
close $f;
}
my $fName = 'someFile.txt';
my $txt = readFile $fName;
# ... transform $txt using s/// ...
writeFile $fName, $txt;
Now the regexes match (although less often than I expect), but the output contains long strings of Asian-looking characters interspersed with longs strings of the correct text. Is my code wrong? Or perhaps Notepad+ is wrong about the encoding? How should I proceed?
OK, I figured it out. The problem was being caused by a disconnect between the encoding translation done by the "encoding..." parameter of the "open" call and the default CRLF translation done by Perl on Windows. What appeared to be happening was that LF was being translated to CRLF on output after the encoding had already been done, which threw off the "parity" of the 16-bit encoding for the following line. Once the next line was reached, the "parity" got put back. That would explain the "long strings of Asian-looking characters interspersed with longs strings of the correct text"... every other line was being messed up.
To correct it, I took out the encoding parameter in my "open" call and added a "binmode" call, as follows:
open my $f, $fName or die "can't read $fName\n";
binmode $f, ':raw:encoding(UCS-2LE)';
binmode apparently has a concept of "layered" I/O handling that is somewhat complicated.
One thing I can't figure out is how to get my CRLF translation back. If I leave out :raw or add :crlf, the "parity" problem returns. I've tried re-ordering as well and can't get it to work.
(I added this as a separate question: CRLF translation with Unicode in Perl)
I don't have the Notepad+ editor to check but it may be a BOM problem with your output encoding not containing a BOM.
http://perldoc.perl.org/Encode/Unicode.html#Size%2c-Endianness%2c-and-BOM
Maybe you need to encode $txt using a byte order mark as described above.