Why does Encode::decode with non-latin letter locales blow up on localised strftime output? - perl

On Ubuntu with Perl 5.26.1 I have encountered the following problem when working on Dancer::Logger::Console. I've lifted this code out of Dancer2::Core::Role::Logger.
In order to run this, you need to generate the following locales:
sudo locale-gen de_DE.UTF-8
sudo locale-gen ko_KR.UTF-8
This example code uses the Korean locale, and fails without an error message. $# is empty.
$ LC_ALL=ko_KR.UTF-8 perl -MPOSIX -MEncode -E 'eval {
say Encode::decode("UTF-8", strftime("%b", localtime))
};
say $#;
'
Wide character at -e line 1.
When run with a German locale, it succeeds (but throws a wide character warning, which we can ignore for this test).
$ LC_ALL=de_DE.UTF-8 perl -MPOSIX -MEncode -E 'eval {
say Encode::decode("UTF-8", strftime("%b", localtime))
};
say $#;
'
Wide character in say at -e line 2.
M�r
The %b formatting is the abbreviated month as localised word (see http://strftime.net/).
If we don't Encode::decode("UTF-8", ...), it works, and the version above with Korean produces 3월.
What's going on here?

Under ko_KR.UTF-8, strftime("%b", localtime(1552997524)) returns 20.33.C6D4. When interpreted as Unicode Code Points, this is "␠3월" ("March", with a leading space).
Under de_DE.UTF-8, strftime("%b", localtime(1552997524)) returns 4D.E4.72. When interpreted as Unicode Code Points, this is "Mär" (short form of "März", "March").
So it seems decoded text (Unicode Code Points) are being returned, which is perfect. All that's left to do is to encode the outputs.
$ LC_ALL=ko_KR.UTF-8 perl -CSD -MPOSIX -e'CORE::say strftime("%b", localtime)'
3월
$ LC_ALL=de_DE.UTF-8 perl -CSD -MPOSIX -e'CORE::say strftime("%b", localtime)'
Mär
In a program (as opposed to a one-liner), you could use something like the following instead of -CSD:
use open ':std', ':encoding(UTF-8)';

Related

Replacing Windows CRLF with Unix LF using Perl -- `Unrecognized switch: -g`?

Problem Background
We have several thousand large (10M<lines) text files of tabular data produced by a windows machine which we need to prepare for upload to a database.
We need to change the file encoding of these files from cp1252 to utf-8, replace any bare Unix LF sequences (i.e. \n) with spaces, then replace the DOS line end sequences ("CR-LF", i.e \r\n) with Unix line end sequences (i.e. \n).
The dos2unix utility is not available for this task.
We initially had a bash function that packaged these operations together using iconv and sed, with iconv doing the encoding and sed dealing with the LF/CRLF sequences. I'm trying to replace part of this bash function with a perl command.
Example Code
Based on some helpful code review, I want to change this function to a perl script.
The author of the code review suggested the following perl to replace CRLF (i.e. "\r\n") with LF ("\n").
perl -g -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;'
The explanation for why this is better than what we had previously makes perfect sense, but this line fails for me with:
Unrecognized switch: -g (-h will show valid options).
More interestingly, the author of the code review also suggests it is possible to perform the decode/recode in a perl script, too, but I am completely unsure where to start.
Questions
Please can someone explain why the suggested answer fails with Unrecognized switch: -g (-h will show valid options).?
If it helps, the line is supposed to receive piped input from incov as follows (though I am interested in learning how to use perl to do the redcoding/recoding step, too):
iconv --from-code=CP1252 --to-code=UTF-8 $1$ | \
perl -g -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;'
> "$2"
(Highly simplified) example input for testing:
apple|orange|\n|lemon\r\nrasperry|strawberry|mango|\n\r\n
Desired output:
apple|orange| |lemon\nrasperry|strawberry|mango| \n
Perl recently added the command line switch -g as an alias for 'gulp mode' in Perl v5.36.0.
This works in Perl version v5.36.0:
s=$(printf "Line 1\nStill Line 1\r\nLine 2\r\nLine 3\r\n")
perl -g -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;' <<<"$s"
Prints:
Line 1 Still Line 1
Line 2
Line 3
But any version of perl earlier than v5.36.0, you would do:
perl -0777 -pe 's/(?<!\r)\n/ /g; s/\r\n/\n/g;' <<<"$s"
# same
BTW, the conversion you are looking for a way easier in this case with awk since it is close to the defaults.
Just do this:
awk -v RS="\r\n" '{gsub(/\n/," ")} 1' <<<"$s"
Line 1 Still Line 1
Line 2
Line 3
Or, if you have a file:
awk -v RS="\r\n" '{gsub(/\n/," ")} 1' file
This is superior to the posted perl solution since the file is processed record be record (each block of text separated by \r\n) versus having the read the entire file into memory.
(On Windows you may need to do awk -v RS="\r\n" -v ORS="\n" '...')
Another note:
You can get similar behavior from Perl by:
Setting the input record separator to the fixed string $/="\r\n" in a BEGIN block;
Use the -l switch so every line has the input record separator removed;
Use tr for speedy replacement of \n with ' ';
Possible set the output record separator, $/="\n", on Windows.
Full command:
perl -lpE 'BEGIN{$/="\r\n"} tr/\n/ /' file
The error message is about the command line switch -g you use in perl -g -pe .... This is not about the switch at the regex - which is valid (but useless since there is only a single \n in a line anyway, and -p reads line by line).
This switch simply does not exist with the perl version you are using. It was only added with perl 5.36, so you are likely using an older version. Try -0777 instead.

Perl regex replacement of logical unicode characters

Here is a simple substitution that adds parentheses arounds upper-case characters in an unicode string. As you can see, the result is rather ugly:
~$ echo "Whatéver 5" | perl -ape "s/(\p{Upper})/(\1)/g"
(W)hat(�)�ver 5
My understanding is that the regex operates on "code points" instead of "logical characters", which splits my 'é' into meaningless characters. Is there a way to force the regex to work on logical unicode characters at once ?
Thanks,
As illustrated by the other answers, turning on UTF-8 in Perl is a piecemeal process. There's use utf8 for the syntax and raw strings. Then you have to make sure all your filehandles are UTF-8. What about #ARGV? readdir? glob? The output from ``?
There's nothing worse than having half your program working in ASCII and the other half working in UTF-8. utf8::all to the rescue!
Install it, add use utf8::all, and it will turn on UTF-8... all of it. Someone else figured it out, you don't have to worry about it.
$ echo "Whatéver 5" | perl -ape "use utf8::all; s/(\p{Upper})/(\1)/g"
(W)hatéver 5
You haven't told Perl to expect UTF-8 input, so it is treating each byte of the encoding as a separate character
Within a program you can set the default encoding for the three standard IO channels like this
use open ':std' => ':encoding(UTF-8)'
On the command line, the option -CS does the same thing, so this should work for you. I have removed the unnecessary autosplit option and replaced \1 with the correct $1 in the replacement string
echo "Whatéver 5" | perl -CS -pe "s/(\p{Upper})/($1)/g"
Assuming that your terminal uses UTF-8 encoding,
$ echo -n "é" | perl -ne 'printf "%vX\n", $_'
gives
C3.A9
so the input to the Perl program has not been converted internally to Unicode (it is still a string of UTF-8 bytes)
To convert the input to a Perl string, add a UTF-8 layer on the standard input stream using option -CI :
$ echo -n "é" | perl -CI -ne 'printf "%vX\n", $_'
the output is now
E9
However, if you also try to print the character back to standard output
you will not get é but a unicode replacement character � from the terminal. This is because the character 0xE9 is Unicode, but the terminal expect UTF-8, and 0xE9 is not valid UTF-8:
$ echo -n "é" | perl -CI -nE 'printf "$_: %vX\n", $_, $_'
�: E9
To get correct output, you can add an UFT-8 encoding layer on the standard output stream also (using -CO flag):
$ echo -n "é" | perl -CIO -nE 'printf "$_: %vX\n", $_, $_'
é: E9
According to perlunicode
"Upper" is a synonym for "Uppercase" , and we could have written
\p{Uppercase} equivalently as \p{Upper}
and
For instance, \p{Uppercase} matches any single character with the
Unicode "Uppercase" property
It seems like if you try to use \p{Upper} on a byte string, you will not get any warnings from Perl. Also bytes in the range 0xC0 to 0xDE will match the uppercase property. Try
perl -E 'for $i (0x80..0xFF) {$_=chr $i; printf "%x\n", $i if /\p{Upper}/}'
This explains the output you got:
$ echo "Whatéver 5" | perl -ape "s/(\p{Upper})/(\1)/g"
(W)hat(�)�ver 5
Here, the letter é is represented as 2 bytes (in UTF-8) 0xC3 and 0xA9, and 0xC3 will match the Unicode Upper property.
A solution to your problem is therefore to add UTF-8 encoding layers on the standard input and output (you can combine -CI and -CO using -CS):
echo "Whatéver 5" | perl -CS -ape "s/(\p{Upper})/(\1)/g"
with output:
(W)hatéver 5

Perl: Replace text parameter by current timestamp

I have an utility to generate code documentation every night. I would like to add a timestamp in order to be aware how old the generated documentation is. I would like to use perl.
I've seen that with the following sentence I can change a joker (%1) by any value I want
perl -pi.bak -e 's/%1/date/g' footer.html
And with this other one I can get the system timestamp:
perl -MPOSIX -we "print POSIX::strftime('%d/%m/%Y %H:%M:%S', localtime)"
My question is whether there is any way to merge both instructions in just one sentence.
Thank you very much
Try doing this :
perl -MPOSIX -pi.bak -e 'BEGIN{$date = strftime("%d/%m/%Y %H:%M:%S", localtime);} s/%1/$date/g' file.html
sh command:
perl -i.bak -MPOSIX -pe's/%1/strftime("%d/%m/%Y %H:%M:%S", localtime)/eg'
cmd command:
perl -i.bak -MPOSIX -pe"s/%1/strftime('%d/%m/%Y %H:%M:%S', localtime)/eg"
/e cause the replacement expression to be treated as Perl code to execute, the result of which is the replacement text.

(e)grep: accented characters not recognised as part of a word

I would like to use (e)grep to match a whole word using the -w switch. I've set the locale, but accented characters are being treated as word boundaries as in this example:
$ locale
LANG=es_VE.utf8
LC_CTYPE="es_VE.utf8"
LC_NUMERIC="es_VE.utf8"
LC_TIME="es_VE.utf8"
LC_COLLATE="es_VE.utf8"
LC_MONETARY="es_VE.utf8"
LC_MESSAGES="es_VE.utf8"
LC_ALL=es_VE.utf8
$ echo -e "cáñamo\namo" | egrep -w amo
cáñamo
amo
How can I find amo while ignoring cáñamo
Which code points count as a word-class character is not locale-dependent in Unicode, and LATIN SMALL LETTER N WITH TILDE is always a word character.
Here’s an all-UTF8 workflow demonstrating searching for amo after a word boundary, and after a non-(word-boundary):
$ perl -Mutf8 -CSDA -e 'print "cáñamo\namo\n"' |
perl -Mutf8 -CSDA -ne 'print if /\bamo\b/'
amo
$ perl -Mutf8 -CSDA -e 'print "cáñamo\namo\n"' |
perl -Mutf8 -CSDA -ne 'print if /\Bamo\b/'
cáñamo
I cannot help but be amused by your choice of search strings. Thanks for the chuckle.

How to get the default encoding of current OS in perl script?

How can I get the default encoding used by current platform?
Is there any available module in CPAN or with the distribution of Perl itself?
I can't find the solution in perl.org
See I18N::Langinfo.
$ LANG=en_US.UTF-8 perl -MI18N::Langinfo=langinfo,CODESET -E 'say langinfo(CODESET())'
UTF-8
$ LANG=C perl -MI18N::Langinfo=langinfo,CODESET -E 'say langinfo(CODESET())'
ANSI_X3.4-1968
$ LANG=ja_JP.eucjp perl -MI18N::Langinfo=langinfo,CODESET -E 'say langinfo(CODESET())'
EUC-JP
This is probably what you're looking for. If you follow the code in I18N::Langinfo, you can see how it discovers what locale to use for returning this.