I get extra CR using TT (perl template toolkit) - perl

I use perl v5.10 (on windows 7) + TT v2.22. When I use TT, for each source line, I get in the produced html an extra CR :
Source text (windows format):
"Some_html" CR LF
Output text :
"Some_html" CR
CR LF
However, when I convert my source file to unix format, and then I run TT, I get :
Source text (unix format):
"Some_html" LF
Output text :
"Some_html" CR LF
(I use notepad++ to show the CR & LF characters; also to change unix <-> windows formats in the source template).
When I google the problem, I get some (few) posts about extra ^M on windows, but I couldn't find explanation as for the root cause neither a true solution (just some workaround how to get rid of extra ^M).
Although not a real problem, I find it quite "unclean".
Is there some configuration that i should turn on (I reviewed www.template-toolkit.org/docs/manual/Config.html but could not find anything) ?
Some other solution ? (other than post-fixing the output file).
Thanks

Template Toolkit reads source files for templates in binary mode, but writing in text mode. Data from template (that contains CR LF) are translated during output in text mode, so the LF becomes CR LF.
The easiest solution for the problem is to write files in binary mode (note the raw modifier to open call):
my $tt = Template->new;
my $output_file = 'some_file.txt';
open my $out_fh, '>:raw', $output_file or die "$output_file: $!\n";
$tt->process('template', \%data, $out_fh) or die $tt->error();

bvr's solution unfortunately doesn't work for output generated using [% FILTER redirect(...) %]. On Windows 10, template
[% FILTER redirect("bar.txt") %]
This text is for bar.txt.
[% END %]
This text is for foo.txt.
(with DOS-style CR-LF line endings) expanded through
#! /bin/perl
use strict;
use warnings;
use Template;
my $tt = Template->new({
OUTPUT_PATH => '.',
RELATIVE => 1,
}) || die "$Template::ERROR\n";
my $srcfile = 'foo.txt.tt';
my $tgtfile = 'foo.txt';
open my $ofh, '>:raw', $tgtfile or die;
$tt->process($srcfile, {}, $ofh, { binmode => ':raw' })
|| die $tt->error . "\n";
creates output file foo.txt with the expected CR-LF line endings, but creates bar.txt with bad CR-CR-LF line endings:
> od -c bar.txt
0000000 \r \r \n T h i s t e x t i s
0000020 f o r b a r . t x t . \r \r \n
0000037
I reported this problem to the TT author at https://github.com/abw/Template2/issues/63.
I found a simple workaround solution: In sub Template::_output (in Template.pm), change
my $bm = $options->{ binmode };
to
my $bm = $options->{ binmode } // $BINMODE;
Then in your main perl script set
$Template::BINMODE = ':raw';
Then you can process the template using
$tt->process($srcfile, {}, $tgtfile) || die $tt->error . "\n";
and get CR-LF line endings in both the main and redirected output.

Have a good day,
I found a very simple solution on this:
my $tt = Template->new({
...,
PRE_CHOMP => 1,
POST_CHOMP => 1,
...
});
This config will instruct the template engine to remove all pre and post CR LF of the template text.

Related

How to get Perl to match \r in files with windows EOL characters

I'm trying to write a perl script to identify files with Windows EOL characters, but \r matching doesn't seem to work.
Here's my test script:
#!/usr/bin/perl
use File::Slurp;
$winfile = read_file('windows-newlines.txt');
if($winfile =~ m/\r/) {
print "winfile has windows newlines!\n"; # I expect to get here
}
else {
print "winfile has unix newlines!\n"; # But I actually get here
}
$unixfile = read_file('unix-newlines.txt');
if($unixfile =~ m/\r/) {
print "unixfile has windows newlines!\n";
}
else {
print "unixfile has unix newlines!\n";
}
Here's what it outputs:
winfile has unix newlines!
unixfile has unix newlines!
I'm running this on Windows, and I can confirm in Notepad++ that the files definitely have the correct EOL characters:
Unless binmode is true (which is not in your code) read_file will change \r\n to \n on Windows. From the code:
# line endings if we're on Windows
${$buf_ref} =~ s/\015\012/\012/g if ${$buf_ref} && $is_win32 && !$opts->{binmode};
In order to keep the original encoding set binmode, like shown in the documentation:
my $bin = read_file('/path/file', { binmode => ':raw' });

Perl chomp doesn't remove the newline

I want to read a string from a the first line in a file, then repeat it n repetitions in the console, where n is specified as the second line in the file.
Simple I think?
#!/usr/bin/perl
open(INPUT, "input.txt");
chomp($text = <INPUT>);
chomp($repetitions = <INPUT>);
print $text x $repetitions;
Where input.txt is as follows
Hello
3
I expected the output to be
HelloHelloHello
But words are new line separated despite that chomp is used.
Hello
Hello
Hello
You may try it on the following Perl fiddle CompileOnline
The strange thing is that if the code is as follows:
#!/usr/bin/perl
open(INPUT, "input.txt");
chomp($text = <INPUT>);
print $text x 3;
It will work fine and displays
HelloHelloHello
Am I misunderstanding something, or is it a problem with the online compiler?
You have issues with line endings; chomp removes trailing char/string of $/ from $text and that can vary depending on platform. You can however choose to remove from string any trailing white space using regex,
open(my $INPUT, "<", "input.txt");
my $text = <$INPUT>;
my $repetitions = <$INPUT>;
s/\s+\z// for $text, $repetitions;
print $text x $repetitions;
I'm using an online Perl editor/compiler as mentioned in the initial post http://compileonline.com/execute_perl_online.php
The reason for your output is that string Hello\rHello\rHello\r is differently interpreted in html (\r like line break), while in console \r returns cursor to the beginning of the current line.

Losing encoding when opening and saving a file

I'm trying to open a file with regular HTML and special Unicode characters such as "ÖÄÅ öäå" (Swedish), format it and then output it to a file.
So far everything works out great, I can open the file, find the parts I need and output into a file.
But here is the point:
I can't save the inputted Unicode data into the file without losing my encoding (eg. an 'ö' becomes 'ö').
Although I can, by manually entering them into the code itself, manage to both perform regex and output them to correct encoding. But not when I'm importing a file, formatting it and then outputting.
Example on working approach when using OCT (eg. this can output to the file without the encoding problem):
my $charsSWE = "öäåÅÄÖ";
# \344 = ä
# \345 = å
# \305 = Å
# \304 = Ä
# \326 = Ö
# \366 = ö
my $SwedishLetters = '\344 \345 \305 \304 \326 \366';
if($charsSWE =~ /([$SwedishLetters]+)/){
print "Output: $1\n";
}
The way below does not work because the encoding is lost (this is a quick illustration of the part of the code but its concept is the same [eg. open file, fetch and output]):
open(FH, 'swedish.htm') or die("File could not be opened");
while(<FH>)
{
my #List = /([$SwedishLetters]+)/g;
message($List[0]) if #List;
}
close(FH);
use Encode;
open FILE1, "<:encoding(UTF-8)", "swedish.htm" or die $!;
#do stuff
open FILE2, ">:encoding(UTF-8)", "output.htm" or die $!;
You may need to use a different encoding.

Perl Text::CSV_XS Encoding Issues

I'm having issues with Unicode characters in Perl. When I receive data in from the web, I often get characters like “ or €. The first one is a quotation mark and the second is the Euro symbol.
Now I can easily substitute in the correct values in Perl and print to the screen the corrected words, but when I try to output to a .CSV file all the substitutions I have done are for nothing and I get garbage in my .CSV file. (The quotes work, guessing since it's such a general character). Also Numéro will give Numéro. The examples are endless.
I wrote a small program to try and figure this issue out, but am not sure what the problem is. I read on another stack overflow thread that you can import the .CSV in Excel and choose UTF8 encoding, this option does not pop up for me though. I'm wondering if I can just encode it into whatever Excel's native character set is (UTF16BE???), or if there is another solution. I have tried many variations on this short program, and let me say again that its just for testing out Unicode problems, not a part of a legit program. Thanks.
use strict;
use warnings;
require Text::CSV_XS;
use Encode qw/encode decode/;
my $text = 'Numéro Numéro Numéro Orkos Capital SAS (√¢¬Ä¬úOrkos√¢¬Ä¬ù) 325M√¢¬Ç¬¨ in 40 companies headquartered';
print("$text\n\n\n");
$text =~ s/“|”/"/sig;
$text =~ s/’s/'s/sig;
$text =~ s/√¢¬Ç¬¨/€/sig;
$text =~ s/√¢¬Ñ¬¢/®/sig;
$text =~ s/ / /sig;
print("$text\n\n\n");
my $CSV = Text::CSV_XS->new ({ binary => 1, eol => "\n" }) or die "Cannot use CSV: ".Text::CSV->error_diag();
open my $OUTPUT, ">:encoding(utf8)", "unicode.csv" or die "unicode.csv: $!";
my #row = ($text);
$CSV->print($OUTPUT, \#row);
$OUTPUT->autoflush(1);
I've also tried these two lines to no avail:
$text = decode("Guess", $text);
$text = encode("UTF-16BE", $text);
First, your strings are encoded in MacRoman. When you interpret them as byte sequences the second results in C3 A2 C2 82 C2 AC. This looks like UTF-8, and the decoded form is E2 82 AC. This again looks like UTF-8, and when you decode it you get €. So what you need to do is:
$step1 = decode("MacRoman", $text);
$step2 = decode("UTF-8", $step1);
$step3 = decode("UTF-8", $step2);
Don't ask me on which mysterious ways this encoding has been created in the first place. Your first character decodes as U+201C, which is indeed the LEFT DOUBLE QUOTATION MARK.
Note: If you are on a Mac, the first decoding step may be unnecessary since the encoding is only in the "presentation layer" (when you copied the Perl source into the HTML form and your browser did the encoding-translation for you) and not in the data itself.
So I figured out the answer, the comment from Roland Illig helped me get there (thanks again!). Decoding more than once causes the wide characters error, and therefore should not be done.
The key here is decoding the UTF-8 Text and then encoding it in MacRoman. To send the .CSV files to my Windows friends I have to save it as .XLSX first so that the coding doesn't get all screwy again.
$text =~ s/“|”/"/sig;
$text =~ s/’s/'s/sig;
$text =~ s/√¢¬Ç¬¨/€/sig;
$text =~ s/√¢¬Ñ¬¢/®/sig;
$text =~ s/ / /sig;
$text = decode("UTF-8", $text);
print("$text\n\n\n");
my $CSV = Text::CSV_XS->new ({ binary => 1, eol => "\n" }) or die "Cannot use CSV: ".Text::CSV->error_diag();
open my $OUTPUT, ">:encoding(MacRoman)", "unicode.csv" or die "unicode.csv: $!";

How do I copy a file with a UTF-8 filename to another UTF-8 filename in Perl on Windows?

For example, given an empty file テスト.txt, how would I make a copy called テスト.txt.copy?
My first crack at it managed to access the file and create the new filename, but the copy generated テスト.txt.copy.
Here was my first crack at it:
#!/usr/bin/env perl
use strict;
use warnings;
use English '-no_match_vars';
use File::Basename;
use Getopt::Long;
use File::Copy;
use Win32;
my (
$output_relfilepath,
) = process_command_line();
open my $fh, '>', $output_relfilepath or die $!;
binmode $fh, ':utf8';
foreach my $short_basename ( glob( '*.txt') ) {
# skip the output basename if it's in the glob
if ( $short_basename eq $output_relfilepath ) {
next;
}
my $long_basename = Win32::GetLongPathName( $short_basename );
my $new_basename = $long_basename . '.copy';
print {$fh} sprintf(
"short_basename = (%s)\n" .
" long_basename = (%s)\n" .
" new_basename = (%s)\n",
$short_basename,
$long_basename,
$new_basename,
);
copy( $short_basename, $new_basename );
}
printf(
"\n%s done! (%d seconds elapsed)\n",
basename( $0 ),
time() - $BASETIME,
);
# === subroutines ===
sub process_command_line {
# default arguments
my %args
= (
output_relfilepath => 'output.txt',
);
GetOptions(
'help' => sub { print usage(); exit },
'output_relfilepath=s' => \$args{output_relfilepath},
);
return (
$args{output_relfilepath},
);
}
sub usage {
my $script_name = basename $0;
my $usage = <<END_USAGE;
======================================================================
Test script to copy files with a UTF-8 filenames to files with
different UTF-8 filenames. This example tries to make copies of all
.txt files with versions that end in .txt.copy.
usage: ${script_name} (<options>)
options:
-output_relfilepath <s> set the output relative file path to <s>.
this file contains the short, long, and
new basenames.
(default: 'output.txt')
----------------------------------------------------------------------
examples:
${script_name}
======================================================================
END_USAGE
return $usage;
}
Here are the contents of output.txt after execution:
short_basename = (BD9A~1.TXT)
long_basename = (テスト.txt)
new_basename = (テスト.txt.copy)
I've tried replacing File::Copy's copy command with a system call:
my $cmd = "copy \"${short_basename}\" \"${new_basename}\"";
print `$cmd`;
and with Win32::CopyFile:
Win32::CopyFile( $short_basename, $new_basename, 'true' );
Unfortunately, I get the same result in both cases (テスト.txt.copy). For the system call, the print shows 1 file(s) copied. as expected.
Notes:
I'm running Perl 5.10.0 via Strawberry Perl on Windows 7 Professional
I use the Win32 module to access long filenames
The glob returns short filenames, which I have to use to access the file
テスト = test (tesuto) in katakana
I've read perlunitut and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
This should be possible with the CopyFileW function from Win32API::File, which should be included with Strawberry. I've never messed with Unicode filenames myself, so I'm not sure of the details. You might need to use Encode to manually convert the filename to UTF-16LE (encode('UTF16-LE', $filename)).
You're getting the long filename using Win32, which gives you a UTF-8-encoded string.
However, you're then setting the long filename using plain copy, which uses the C stdlib IO functions. The stdlib functions use the default filesystem encoding.
On modern Linuxes that's usually UTF-8, but on Windows it (sadly) never is, because the system default code page cannot be set to UTF-8. So you'll get your UTF-8 string interpreted as a code page 1252 string on a Western European Windows install, as has happened here. (On a Japanese machine it'd get interpreted as code page 932 — like Shift-JIS — which would come out something like 繝�せ繝�.)
I've not done this in Perl, but I'd suspect the Win32::CopyFile function would be more likely to be able to handle the kind of Unicode paths returned elsewhere in the Win32 module.
Use Encode::Locale:
use Encode::Locale;
use Encode;
use File::Copy;
copy( encode(locale_fs => $short_basename),
encode(locale_fs => $new_basename) ) || die $!;
I successfully duplicated your problem on my Windows machine (Win XP Simplified Chinese version) and my conclusion is that the problem is caused by the font. Choose a Truetype font rather than Raster fonts and see if everything is okay.
My experiment is this:
I first changed the code page of my Windows Console from the default 936 (GBK) to 65001 (UTF-8).
by typing C:>chcp 65001
I wrote a scrip that contains the code: $a= "テスト"; print $a; and saved it as UTF-8.
I ran the script from the Console and found "テスト" became "テスト", which is exactly the same sympton you described in your question.
I changed the Console Font from Raster Fonts to Lucida Console, the console screen gave me this: "テストストトト", which is still not quite right but I assume it is getting closer to the core of the problem.
So althought I'm not 100% sure but the problem is probably caused by the font.
Hope this helps.
See https://metacpan.org/pod/Win32::Unicode
#!/usr/bin/perl --
use utf8;
use strict;
use warnings;
my #kebabs = (
"\x{45B}\x{435}\x{432}\x{430}\x{43F}.txt", ## ћевап.txt
"ra\x{17E}nji\x{107}.txt", ## ražnjić.txt
"\x{107}evap.txt", ## ćevap.txt
"\x{43A}\x{435}\x{431}\x{430}\x{43F}\x{447}\x{435}.txt", ## кебапче.txt
"kebab.txt",
);
{
use Win32::Unicode qw/ -native /;
printW "I \x{2665} Perl"; # unicode console out
mkpathW 'meat';
chdirW 'meat';
for my $kebab ( #kebabs ){
printW "kebabing the $kebab\n";
open my($fh), '>:raw', $kebab or dieW Fudge($kebab);
print $fh $kebab or dieW Fudge($kebab);
close $fh or dieW Fudge($kebab);
}
}
sub Fudge {
use Errno();
join qq/\n/,
"Error #_",
map { " $_" } int( $! ) . q/ / . $!,
int( $^E ) . q/ / . $^E,
grep( { $!{$_} } keys %! ),
q/ /;
}