Template Toolkit's somevar.substr() and UTF-8 - perl

We use Template Toolkit in a Catalyst app. We configured TT to use UTF-8 and had no problems with it before.
Now I call the substr() method of a string var. Unfortunately it does split the string after n bytes instead of n chars. If the n'th and (n+1)'th byte build a unicode char it is split and only the 1st byte is part of substr() result.
How to fix or workaround that behaviour?
[% string = "fööbär";
string.length; # prints 9
string.substr(0, 5); # prints "föö" (1 ascii + 2x 2 byte unicode)
string.substr(0, 4): # prints "fö?" (1 ascii, 1x 2 byte unicode, 1 unknown char)
%]
Until now we had no problems with Unicode chars, neither ones comes from the database nor text in the templates.
Edit: This is how I configure the Catalyst::View::TT module in my Catalyst app:
__PACKAGE__->config(
# DEBUG => DEBUG_ALL,
DEFAULT_ENCODING => 'utf-8',
INCLUDE_PATH => My::App->path_to( 'root', 'templates' ),
TEMPLATE_EXTENSION => '.tt',
WRAPPER => "wrapper/default.tt",
render_die => 1,
);

I did a quick testing with Perl 1.12.2 for MSWin32 Template module.
It could handle all these substr operation properly.
This is my test code:
use Template;
# some useful options (see below for full list)
my $config = {
# DEFAULT_ENCODING => 'utf-8',
INCLUDE_PATH => 'd:/devel/perl', # or list ref
INTERPOLATE => 1, # expand "$var" in plain text
EVAL_PERL => 1, # evaluate Perl code blocks
};
# create Template object
my $template = Template->new($config);
# define template variables for replacement
my $vars = {
var1 => "abcdef"
};
# specify input filename, or file handle, text reference, etc.
my $input = 'ttmyfile.txt';
# process input template, substituting variables
print $template->process($input, $vars);
ttmyfile.txt
Var = [% var1 %]
[% string = "fööbär" -%]
[% string.length %] # prints 6
[% string.substr(0, 5) %] # prints "fööbä"
[% string.substr(0, 4) %] # prints "fööb"
Output:
Var = abcdef
6 # prints 6
fööbä # prints "fööbä"
fööb # prints "fööb"
1
All works fine, even without use utf8 nor DEFAULT_ENCODING. Key things here:
Make sure your template .tt files are encoded as UTF8 with BOM -- Byte Order Mark. This is a must do task! Because Template-Toolkit is detect the Unicode file encoding according to BOM.
You can use Windows Notepad to save a file with BOM, just do File --> Save --> Encoding: "UTF-8".
You can use VIM make it as well by input set fenc=utf8 and set bomb, then save the file, the file will start with BOM.
Set the NCODING paramter Template->new({NCODING => 'utf-8'}); as 'utf-8' will enforce Template to load template file as 'utf-8'.
Suggest to have use utf8 in your script, it will ensure all your inline strings are encoding as utf8 properly.
Because Catalyst::View::TT rely on Template, I believe it should be working as well! Good luck~~~

The Wikipedia article on UTF-8 provides a table that shows how non-ASCII characters are encoded. That table illustrates the following simple rules for UTF-8:
If the highest bit of a byte is 0, then the byte denotes an ASCII character.
If the two highest bits of a byte are 11, then this is the start of a multi-byte character, and the number of consecutive 1 bits starting from the highest order bit indicates the total number of bytes in the multi-byte character. Thus, a byte whose bit representation is 110xxxxx is the start of a 2-byte character, 1110xxxx is the start of a 3-byte character, and 11110xxx is the start of a 4-byte character. (You can ignore the hypothetical 5-byte and 6-byte characters because Unicode is limited to being a 21-bit character set rather than a 32-bit character set.)
If the two highest bits of a byte are 10, then this byte is part of a multi-byte character (but not the first byte of that character).
That information should be enough for you to write your own utility functions that are like string.length and string.substring() but work in terms of characters instead of bytes.
Update: The question did not specify the programming language being used, and I was not aware that "Template Toolkit" implied the use of Perl. Once I realised that, I did a Google search and discovered that your problem is likely to be due to the need to add a use utf8 directive to your source code. You can find a discussion about this here.

The answer is pretty simple (in Perl), fortunately:
use Encode qw{encode decode};
The way this works is that you decode Unicode strings into Perl strings, whereupon you can use substr() and length() the way you expect, and then you encode them again for output.
With that header:
# $unicodeString = 'fööbär';
my $perlString = decode('UTF-8', $unicodeString);
printf "%d\n", length($perlString); # should be 6
printf "%s\n", substr($perlString, 0, 3); # should be 'föö'
# whatever other processing you want here with $perlString . . .
# Then, you want to reencode that back to a proper UTF-8 string:
my $unicodeString = encode('UTF-8', $perlString);
Would that help?

Related

In Perl, how to create a "mixed-encoding" string (or a raw sequence of bytes) in a scalar?

In a Perl script of mine, I have to write a mix of UTf-8 and raw bytes into files.
I have a big string in which everything is encoded as UTF-8. In that "source" string, UTF-8 characters are just like they should be (that is, UTF-8-valid byte sequences), while the "raw bytes" have been stored as if they were codepoints of the value held by the raw byte. So, in the source string, a "raw" byte of 0x50 would be stored as one 0x50 byte; whereas a "raw" byte of 0xff would be stored as a 0xc3 0xbf two-byte utf-8-valid sequence. When I write these "raw" bytes back, I need to put them back to single-byte form.
I have other data structures allowing me to know which parts of the string represent what kind of data. A list of fields, types, lengths, etc.
When writing in a plain file, I write each field in turn, either directly (if it's UTF-8) or by encoding its value to ISO-8859-1 if it's meant to be raw bytes. It works perfectly.
Now, in some cases, I need to write the value not directly to a file, but as a record of a BerkeleyDB (Btree, but that's mostly irrelevant) database.
To do that, I need to write ALL the values that compose my record, in a single write operation. Which means that I need to have a scalar that holds a mix of UTF-8 and raw bytes.
Example:
Input Scalar (all hex values): 61 C3 8B 00 C3 BF
Expected Output Format: 2 UTF-8 characters, then 2 raw bytes.
Expected Output: 61 C3 8B 00 FF
At first, I created a string by concatenating the same values I was writing to my file from an empty string. And I tried writing that very string to a "standard" file without adding encoding. I got '?' characters instead of all my raw bytes over 0x7f (because, obviously, Perl decided to consider my string to be UTF-8).
Then, to try and tell Perl that it was already encoded, and to "please not try to be smart about it", I tried to encode the UTF-8 parts into "UTF-8", encode the binary parts into "ISO-8859-1", and concatenate everything. Then I wrote it. This time, the bytes looked perfect, but the parts which were already UTF-8 had been "double-encoded", that is, each byte of a multi-byte character had been seen as its codepoint...
I thought Perl wasn't supposed to re-encode "internal" UTF-8 into "encoded" UTF-8, if it was internally marked as UTF-8. The string holding all the values in UTF-8 comes from a C API, which sets the UTF-8 marker (or is supposed to, at the very least), to let Perl know it is already decoded.
Any idea about what I did miss there?
Is there a way to tell Perl what I want to do is just put a bunch of bytes one after another, and to please not try to interpret them in any way? The file I write to is opened as ">:raw" for that very reason, but I guess I need a way to specify that a given scalar is "raw" too?
Epilogue: I found the cause of the problem. The $bigInputString was supposed to be entirely composed of UTF-8 encoded data. But it did contain raw bytes with big values, because of a bug in C (turns out a "char" (not "unsigned char") is best tested with bitwise operators, instead of a " > 127"... ahem). So, "big" bytes weren't split into a two-bytes UTF-8 sequence, in the C API.
Which means the $bigInputString, created from the bad C data, didn't have the expected contents, and Perl rightfully didn't like it either.
After I corrected the bug, the string correctly encoded to UTF-8 (for the parts I wanted to keep as UTF-8) or LATIN-1 (for the "raw bytes" I wanted to convert back), and I got no further problems.
Sorry for wasting your time, guys. But I still learned some things, so I'll keep this here. Moral of the story, Devel::Peek is GOOD for debugging (thanks ikegami), and one should always double check, instead of assuming. Granted, I was in a hurry on friday, but the fault is still mine.
So, thanks to everyone who helped, or tried to, and special thanks to ikegami (again), who used quite a bit of his time helping me.
Assuming you have a Unicode string where you know what each codepoint is supposed to be stored as - a UTF-8 sequence or a single byte, and a way to create a template string where each character represents what the corresponding one of the unicode string is supposed to use (U for UTF-8, C for single byte to keep things simple), you can use pack:
#!/usr/bin/env perl
use strict;
use warnings;
sub process {
my ($str, $formats) = #_;
my $template = "C0$formats";
my #chars = map { ord } split(//, $str);
pack $template, #chars;
}
my $str = "\x61\xC3\x8B\x00\xC3\xBF";
utf8::decode($str);
print process($str, "UUCC"); # Outputs 0x61 0xc3 0x8b 0x00 0xff
So you have
my $in = "\x61\xC3\x8B\x00\xC3\xBF";
and you want
my $out = "\x61\xC3\x8B\x00\xFF";
This is the result of decoding only some parts of the input string, so you want something like the following:
sub decode_utf8 { my ($s) = #_; utf8::decode($s) or die("Invalid Input"); $s }
my $out = join "",
substr($in, 0, 3),
decode_utf8(substr($in, 3, 1)),
decode_utf8(substr($in, 4, 2));
Tested.
Alternatively, you could decode the entire thing and re-encode the parts that should be encoded.
sub encode_utf8 { my ($s) = #_; utf8::encode($s); $s }
utf8::decode($in) or die("Invalid Input");
my $out = join "",
encode_utf8(substr($in, 0, 2)),
substr($in, 2, 1),
substr($in, 3, 1);
Tested.
You have not indicate how you know which to decode and which not to decode, but you indicated you have this information.

Perl unicode file with non-unicode content

A software is producing UTF-8 files, but writing content to the file that isn't unicode. I can't change that software and have to take the output as it is now. Don' t know if this will show up here correctly, but an german umlaut "ä" is shown in the file as "ä".
If I open the file in Notepad++, it tells me the file is UTF-8 (without BOM) encoded. Now, if I say "convert to ANSI" in Notepad and then switch the file encoding back to UTF-8 (without converting), the German umlauts in the file are correct. How can I achieve the exact same behaviour in Perl? Whatever I tried up to now, the umlaut mess just got worse.
To reproduce, create yourself an UTF-8 encoded file and write content to it:
Ok, I'll try. Create yourself a UTF-8 file and write this to it:
Männer Schüle Vöogel SüÃ
Then, on an UTF-8 mysql database, create a table with varchar field an UTF8_unicode encoding. Now, use this script:
use utf8;
use DBI;
use Encode;
if (open FILE, "test.csv") {
my $db = DBI->connect(
'DBI:mysql:your_db;host=127.0.0.1;mysql_compression=1', 'root', 'Yourpass',
{ PrintError => 1 }
);
my $sql="";
my $sql = qq{SET NAMES 'utf8';};
$db->do($sql);
while (my $line = <FILE>) {
my $sth = $db->prepare("INSERT IGNORE INTO testtable (testline) VALUES (?);");
$sth->execute($line);
}
}
The exact contents of file will get written to the database. But, the output I expect in database is with German umlauts:
Männer Schüler Vögel Süß
So, how can I convert that correctly?
It's ironic: as I see it, the software you talk about is not writing 'non-unicode content' (that's non-sense) - it encodes it UTF-8 twice. Let's take this ä character, for example: it's represented by two bytes in UTF-8, %C3 %A4. But then something in that program decides to treat these bytes as Latin-1 encodings instead: thus they become two separate characters (which will be eventually encoded into UTF-8, and that's what'll be saved into a file).
I suppose the simplest way of reversing this is making Perl think that it uses a series of bytes (and not a sequence of characters) when dealing with the string read from the file. It can be done as simple (and as ugly) as...
open my $fh, '<:utf8', $file_name or die $!;
my $string = <$fh>; # a sequence of characters
$string = utf8::decode($string); # ... will be considered a sequence of octets
Sounds like something is converting it a second time, assuming it to be something like ISO 8859-15 and then converting that to UTF-8. You can reverse this by converting UTF-8 to ISO 8859-15 (or whichever encoding seems to make sense for your data).
As seen on http://www.fileformat.info/info/unicode/char/E4/index.htm the bytes 0xC3 0xA4 are the valid UTF-8 encoding of ä. When viewed as ISO 8859-15 (or 8859-1, or Windows-1252, or a number of other 8-bit encodings) they display the string ä.

Convert bit vector to binary in Perl

I'm not sure of the best way to describe this.
Essentially I am attempting to write to a buffer which requires a certain protocol. The first two bytes I would like to are "10000001" and "11111110" (bit by bit). How can I write these two bytes to a file handle in Perl?
To convert spelled-out binary to actual bytes, you want the pack function with either B or b (depending on the order you have the bits in):
print FILE pack('B*', '1000000111111110');
However, if the bytes are constant, it's probably better to convert them to hex values and use the \x escape with a string literal:
print FILE "\x81\xFE";
How about
# open my $fh, ...
print $fh "\x81\xFE"; # 10000001 and 11111110
Since version 5.6.0 (released in March 2000), perl has supported binary literals as documented in perldata:
Numeric literals are specified in any of the following floating point or integer formats:
12345
12345.67
.23E-10 # a very small number
3.14_15_92 # a very important number
4_294_967_296 # underscore for legibility
0xff # hex
0xdead_beef # more hex
0377 # octal (only numbers, begins with 0)
0b011011 # binary
You are allowed to use underscores (underbars) in numeric literals between digits for legibility. You could, for example, group binary digits by threes (as for a Unix-style mode argument such as 0b110_100_100) or by fours (to represent nibbles, as in 0b1010_0110) or in other groups.
You may be tempted to write
print $fh 0b10000001, 0b11111110;
but the output would be
129254
because 10000001₂ = 129₁₀ and 11111110₂ = 254₁₀.
You want a specific representation of the literals’ values, namely as two unsigned bytes. For that, use pack with a template of "C2", i.e., octet times two. Adding underscores for readability and wrapping it in a convenient subroutine gives
sub write_marker {
my($fh) = #_;
print $fh pack "C2", 0b1000_0001, 0b1111_1110;
}
As a quick demo, consider
binmode STDOUT or die "$0: binmode: $!\n"; # we'll send binary data
write_marker *STDOUT;
When run as
$ ./marker-demo | od -t x1
the output is
0000000 81 fe
0000002
In case it’s unfamiliar, the od utility is used here for presentational purposes because the output contains a control character and Þ (Latin small thorn) in my system’s encoding.
The invocation above commands od to render in hexadecimal each byte from its input, which is the output of marker-demo. Note that 10000001₂ = 81₁₆ and 11111110₂ = FE₁₆. The numbers in the left-hand column are offsets: the special marker bytes start at offset zero (that is, immediately), and there are exactly two of them.

Filtering microsoft 1252 characters out of an ASCII text file opened in utf8 mode in Perl

I have a reasonable size flat file database of text documents mostly saved in 8859 format which have been collected through a web form (using Perl scripts). Up until recently I was negotiating the common 1252 characters (curly quotes, apostrophes etc.) with a simple set of regex's:
$line=~s/\x91/\&\#8216\;/g; # smart apostrophe left
$line=~s/\x92/\&\#8217\;/g; # smart apostrophe right
... etc.
However since I decided I ought to be going Unicode, and have converted all my scripts to read in and output utf8 (which works a treat for all new material), the regex for these (existing) 1252 characters no longer works and my Perl html output outputs literally the 4 characters: '\x92' and '\x93' etc. (at least that's how it appears on a browser in utf8 mode, downloading (ftp not http) and opening in a text editor (textpad) it's different, a single undefined character remains, and opening the output file in Firefox default (no content type header) 8859 mode renders the correct character).
The new utf8 pragmas at the start of the script are:
use CGI qw(-utf8);
use open IO => ':utf8';
I understand this is due to utf8 mode making the characters double byte instead of single byte and applies to those chars in the 0x80 to 0xff range, having read up the article on wikibooks relating to this, however I was non the wiser as to how to filter them. Ideally I know I ought to resave all the documents in utf8 mode (since the flat file database now contains a mixture of 8859 and utf8), however I will need some kind of filter in the first place if I'm going to do this anyway.
And I could be wrong as to the 2-byte storage internally, since it did seem to imply that Perl handles stuff very differently according to various circumstances.
If anybody could provide me with a regex solution I would be very grateful. Or some other method. I have been tearing my hair out for weeks on this with various attempts and failed hacking. There's simply about 6 1252 characters that commonly need replacing, and with a filter method I could resave the whole flippin lot in utf8 and forget there ever was a 1252...
Encoding::FixLatin was specifically written to help fix data broken in the same manner as yours.
Ikegami already mentioned the Encoding::FixLatin module.
Another way to do it, if you know that each string will be either UTF-8 or CP1252, but not a mixture of both, is to read it as a binary string and do:
unless ( utf8::decode($string) ) {
require Encode;
$string = Encode::decode(cp1252 => $string);
}
Compared to Encoding::FixLatin, this has two small advantages: a slightly lower chance of misinterpreting CP1252 text as UTF-8 (because the entire string must be valid UTF-8) and the possibility of replacing CP1252 with some other fallback encoding. A corresponding disadvantage is that this code could fall back to CP1252 on strings that are not entirely valid UTF-8 for some other reason, such as because they were truncated in the middle of a multi-byte character.
You could also use Encode.pm's support for fallback.
use Encode qw[decode];
my $octets = "\x91 Foo \xE2\x98\xBA \x92";
my $string = decode('UTF-8', $octets, sub {
my ($ordinal) = #_;
return decode('Windows-1252', pack 'C', $ordinal);
});
printf "<%s>\n",
join ' ', map { sprintf 'U+%.4X', ord $_ } split //, $string;
Output:
<U+2018 U+0020 U+0046 U+006F U+006F U+0020 U+263A U+0020 U+2019>
Did you recode the data files? If not, opening them as UTF-8 won't work. You can simply open them as
open $filehandle, '<:encoding(cp1252)', $filename or die ...;
and everything (tm) should work.
If you did recode, something seem to have gone wrong, and you need to analyze what it is, and fix it. I recommend using hexdump to find out what actually is in a file. Text consoles and editors sometimes lie to you, hexdump never lies.

Convert a UTF8 string to ASCII in Perl

I've tried everything Google and StackOverflow have recommended (that I could find) including using Encode. My code works but it just uses UTF8 and I get the wide character warnings. I know how to work around those warnings but I'm not using UTF8 for anything else so I'd like to just convert it and not have to adapt the rest of my code to deal with it. Here's my code:
my $xml = XMLin($content);
# Populate the #titles array with each item title.
my #titles;
for my $item (#{$xml->{channel}->{item}}) {
my $title = Encode::decode_utf8($item->{title});
#my $title = $item->{title};
#utf8::downgrade($title, 1);
Encode::from_to($title, 'utf8', 'iso-8859-1');
push #titles, $title;
}
return #titles;
Commented out you can see some of the other things I've tried. I'm well aware that I don't know what I'm doing here. I just want to end up with a plain old ASCII string though. Any ideas would be greatly appreciated. Thanks.
The answer depends on how you want to use the title. There are 3 basic ways to go:
Bytes that represent a UTF-8 encoded string.
This is the format that should be used if you want to store the UTF-8 encoded string outside your application, be it on disk or sending it over the network or anything outside the scope of your program.
A string of Unicode characters.
The concept of characters is internal to Perl. When you perform Encode::decode_utf8, then a bunch of bytes is attempted to be converted to a string of characters, as seen by Perl. The Perl VM (and the programmer writing Perl code) cannot externalize that concept except through decoding UTF-8 bytes on input and encoding them to UTF-8 bytes on output. For example, your program receives two bytes as input that you know they represent UTF-8 encoded character(s), let's say 0xC3 0xB6. In that case decode_utf8 returns a representation that instead of two bytes, sees one character: ö.
You can then proceed to manipulate that string in Perl. To illustrate the difference further, consider the following code:
my $bytes = "\xC3\xB6";
say length($bytes); # prints "2"
my $string = decode_utf8($bytes);
say length($string); # prints "1"
The special case of ASCII, a subset of UTF-8.
ASCII is a very small subset of Unicode, where characters in that range are represented by a single byte. Converting Unicode into ASCII is an inherently lossy operation, as most of the Unicode characters are not ASCII characters. You're either forced to drop every character in your string which is not in ASCII or try to map from a Unicode character to their closest ASCII equivalents (which isn't possible in the vast majority of cases), when trying to coerce a Unicode string to ASCII.
Since you have wide character warnings, it means that you're trying to manipulate (possibly output) Unicode characters that cannot be represented as ASCII or ISO-8859-1.
If you do not need to manipulate the title from your XML document as a string, I'd suggest you leave it as UTF-8 bytes (I'd mention that you should be careful not to mix bytes and characters in strings). If you do need to manipulate it, then decode, manipulate, and on output encode it in UTF-8.
For further reading, please use perldoc to study perlunitut, perlunifaq, perlunicode, perluniintro, and Encode.
Although this is an old question, I just spent several hours (!) trying to do more or less the same thing! That is: read data from a UTF-8 XML file, and convert that data into the Windows-1252 codepage (I could also have used Latin1, ISO-8859-1 etc.) in order to be able to create filenames with accented letters.
After much experimentation, and even more searching, I finally managed to get the conversion working. The "trick" is to use Encode::encode instead of Encode::decode.
For example, given the code in the original question, the correct (or at least one :-) way to convert from UTF-8 would be:
my $title = Encode::encode("Windows-1252", $item->{title});
or
my $title = Encode::encode("ISO-8859-1", $item->{title});
or
my $title = Encode::encode("<your-favourite-codepage-here>", $item->{title});
I hope this helps others having similar problems!
You can use the following line to simply get rid of the warning. This assumes that you want to use UTF8, which shouldn't normally be a problem.
binmode(STDOUT, ":encoding(utf8)");