How can I reverse a string that contains combining characters in Perl? - perl

I have the string "re\x{0301}sume\x{0301}" (which prints like this: résumé) and I want to reverse it to "e\x{0301}muse\x{0301}r" (émusér). I can't use Perl's reverse because it treats combining characters like "\x{0301}" as separate characters, so I wind up getting "\x{0301}emus\x{0301}er" ( ́emuśer). How can I reverse the string, but still respect the combining characters?

You can use the \X special escape (match a non-combining character and all of the following combining characters) with split to make a list of graphemes (with empty strings between them), reverse the list of graphemes, then join them back together:
#!/usr/bin/perl
use strict;
use warnings;
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print "original: $original\n",
"wrong: $wrong\n",
"right: $right\n";

The best answer is to use Unicode::GCString, as Sinan points out
I modified Chas's example a bit:
Set the encoding on STDOUT to avoid "wide character in print" warnings;
Use a positive lookahead assertion (and no separator retention mode) in split (doesn't work after 5.10, apparently, so I removed it)
It's basically the same thing with a couple of tweaks.
use strict;
use warnings;
binmode STDOUT, ":utf8";
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print <<HERE;
original: [$original]
wrong: [$wrong]
right: [$right]
HERE

You can use Unicode::GCString:
Unicode::GCString treats Unicode string as a sequence of extended grapheme clusters defined by Unicode Standard Annex #29 [UAX #29].
#!/usr/bin/env perl
use utf8;
use strict;
use warnings;
use feature 'say';
use open qw(:std :utf8);
use Unicode::GCString;
my $x = "re\x{0301}sume\x{0301}";
my $y = Unicode::GCString->new($x);
my $wrong = reverse $x;
my $correct = join '', reverse #{ $y->as_arrayref };
say "$x -> $wrong";
say "$y -> $correct";
Output:
résumé -> ́emuśer
résumé -> émusér

Perl6::Str->reverse also works.
In the case of the string résumé, you can also use the Unicode::Normalize core module to change the string to a fully composed form (NFC or NFKC) before reverseing; however, this is not a general solution, because some combinations of base character and modifier have no precomposed Unicode codepoint.

Some of the other answers contain elements that don't work well. Here is a working example tested on Perl 5.12 and 5.14. Failing to specify the binmode will cause the output to generate error messages. Using a positive lookahead assertion (and no separator retention mode) in split will cause the output to be incorrect on my Macbook.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'unicode_strings';
binmode STDOUT, ":utf8";
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print "original: $original\n",
"wrong: $wrong\n",
"right: $right\n";

Related

How do I search and replace with "OR" condition

This is a trivial issue, but I hope someone can point me to the right way to do it. I have a string "Thunderstorms" which I replace with "T/storms".
s/Thunderstorms/T\/Storms/gi
It so happens that "Thunderstorms" is sometimes written as "Thunder Storms". Instead of writing two search and replace commands, I am looking for replacing "Thunderstorms" or "Thunder storms" with "T/Storms" in one command.
You can use \s* to match zero or more whitespaces:
use warnings;
use strict;
while (<DATA>) {
s{Thunder\s*storms}{T/Storms}gi;
print;
}
__DATA__
Thunderstorms
Thunder Storms
Thunder storms
Outputs:
T/Storms
T/Storms
T/Storms
I used different delimeters for the substitution operator (s{}{}) to avoid escaping the /.
use |.
s/Thunderstorms|Thunder\sStorms/T\/Storms/gi
sample code:
use strict;
use warnings;
my $str = 'Thunderstorms foo Thunder Storms bar';
$str =~ s/Thunderstorms|Thunder\sStorms/T\/Storms/gi;
print $str;

Why doesn't Text::Balanced::extract_bracketed extract the text inside a LaTeX tag?

I am trying to parse balanced text (actually, text written in LaTeX) using extract_bracketed from Text::Balanced. However, I did not get a correct match with the following code:
use Text::Balanced qw(extract_bracketed);
my $data = 'xxx \footnote{...} yyy';
(my $ext, my $rem, my $pre) = extract_bracketed($data, '{}', '\footnote');
print "\$ext = $ext\n";
print "\$rem = $rem\n";
print "\$pre = $pre\n";
This prints:
$ext =
$rem = xxx \footnote{...} yyy
$pre =
According to the documentation, this output means that a failure occurred, but I do not understand why.
What I actually want to extract is ..., i.e. the contents of the \footnote command.
Why is this happening and how can I fix it?
Text::Balanced sets $# on failure so you can get details about the cause:
use strict;
use warnings 'all';
use 5.010;
use Text::Balanced qw(extract_bracketed);
my $text = 'xxx \footnote{...} yyy';
my ($substring, $remainder, $prefix) = extract_bracketed($text, '{}', '\footnote');
warn $# if $#;
Output:
Did not find prefix: /\footnote/, detected at offset 0 at balanced line 12.
The prefix didn't match because:
it has to match from the beginning of the string all the way to the first occurrence of the delimiter
\f matches a form feed, not a literal backslash followed by the letter f
The following prefix matches everything up to the first curly brace:
use strict;
use warnings 'all';
use 5.010;
use Text::Balanced qw(extract_bracketed);
my $text = 'xxx \footnote{...} yyy';
my ($substring, $remainder, $prefix) = extract_bracketed($text, '{}', '[^{}]*');
say "<$_>" for $prefix, $substring, $remainder;
Output:
<xxx \footnote>
<{...}>
< yyy>
To actually remove a nested footnote tag from the text, leaving its contents, you need to use extract_tagged:
use strict;
use warnings 'all';
use 5.010;
use Text::Balanced qw(extract_tagged);
my $text = '\footnote{abc \footnote{...} def \emph{!!!} ghi}';
my #pieces = extract_tagged(
$text,
'\\\footnote{',
'}',
'(?s).*\\\footnote{.*(?=\\\footnote{)'
);
my ($remainder, $prefix, $contents) = #pieces[1, 2, 4];
say $prefix . $contents . $remainder;
Output:
\footnote{abc ... def \emph{!!!} ghi}
Note that this approach works for the simple input you gave, but won't work as a general-purpose LaTeX parser. There are a couple of LaTeX parsers on CPAN, but LaTeX::TOM looks fairly limited and LaTeX::Parser hasn't been updated since 2000.
If you need to do more complex parsing, you may need to write your own parser.

Perl unicode conversion

I'm using this code:
use Unicode::UTF8 qw[decode_utf8 encode_utf8];
my $d = "opposite Spencer\u2019s Aliganj, Lucknow";
my $string = decode_utf8($d);
my $octets = encode_utf8($d);
print "\nSTRING :: $string";
I want output like
opposite Spencer's Aliganj, Lucknow
what to do ?
If you just want unicode #2019 to become ’ you can use one of this ways:
use strict;
use warnings;
use open ':std', ':encoding(utf-8)';
print chr(0x2019);
print "\x{2019}"; # for characters 0x100 and above
print "\N{U+2019}";
\u \U in perl translates to uppercase in perl:
Case translation operators use the Unicode case translation tables
when character input is provided. Note that uc(), or \U in
interpolated strings, translates to uppercase, while ucfirst, or \u in
interpolated strings, translates to titlecase in languages that make
the distinction (which is equivalent to uppercase in languages without
the distinction).
You're trying to parse butchered JSON.
You could parse it yourself.
use Encode qw( decode );
my $incomplete_json = "opposite Spencer\u2019s Aliganj, Lucknow";
my $string = $incomplete_json;
$string =~ s{\\u([dD][89aAbB]..)\\u([dD][cCdDeEfF]..)|\\u(....)}
{ $1 ? decode('UTF-16be', pack('H*', $1.$2)) : chr(hex($3)) }eg;
Or you could fix it up then use an existing parser
use JSON::XS qw( decode_json );
my $incomplete_json = "opposite Spencer\u2019s Aliganj, Lucknow";
my $json = $incomplete_json;
$json =~ s/"/\\"/g;
$json = qq{["$json"]};
my $string = decode_json($json)->[0];
Untested. You may have to handle other slashes. Which solution is simpler depends on how you have to handle the other slashes.

searching words in Greek in Unix and Perl

I have txt files that are greek and now I want to search specific words in them using perl and bash ... the words are like ?a?, t?, e??
I was searching for words in english and now want to replace them by greek but all I get is ??? mostly... for Perl:
my %word = map { $_ => 1 } qw/name date birth/;
and for bash
for X in name date birth
do
can someone please help me?
#!/usr/bin/perl
use strict;
use warnings;
# Tell Perl your code is encoded using UTF-8.
use utf8;
# Tell Perl input and output is encoded using UTF-8.
use open ':std', ':encoding(UTF-8)';
my #words = qw( καί τό εἰς );
my %words = map { $_ => 1 } #words;
my $pat = join '|', map quotemeta, keys %words;
while (<>) {
if (/$pat/) {
print;
}
}
Usage:
script.pl file.in >file.out
Notes:
Make sure the source code is encoded using UTF-8 and that you use use utf8;.
Make sure you use the use open line and specify the appropriate encoding for your data file. (If it's not UTF-8, change it.)

Can I use the y operator to do a non-one-to-one transliteration in Perl?

The y operator in Perl does character-by-character transliteration. For example, if we do y/abc/dfg to the string "foobar", we get "foofdr". But what if I want to transliterate "ā" to "ei" and "ä" to "a:" and "ō" to "әu" and "o" to "ɒ".
I tried the following line of code but no luck:(
y/āäōo/(ei)(a:)(әu)ɒ/
Do we hopefully have a workaround for this problem? Or do I have to repeatedly use the s operator and do a lot of cumbersome substitutions?
Thanks in advance for any guidance :)
In this case, create a hash and go from the keys to the strings easily.
use warnings;
use strict;
use utf8;
binmode STDOUT, ":utf8";
my $string = "āäōo";
my %trans = qw/ā ei ä a: ō u o ɒ/;
my $keys = join '', keys %trans;
$string =~ s/([$keys])/$trans{$1}/g;
print "$string\n";
You need to alter this if your keys are more than one character long by sorting the keys in order of decreasing length and joining them using ( | | ) instead of [ ].
It sounds like you're trying to do something similar to Text::Unaccent::PurePerl.