searching words in Greek in Unix and Perl - perl

I have txt files that are greek and now I want to search specific words in them using perl and bash ... the words are like ?a?, t?, e??
I was searching for words in english and now want to replace them by greek but all I get is ??? mostly... for Perl:
my %word = map { $_ => 1 } qw/name date birth/;
and for bash
for X in name date birth
do
can someone please help me?

#!/usr/bin/perl
use strict;
use warnings;
# Tell Perl your code is encoded using UTF-8.
use utf8;
# Tell Perl input and output is encoded using UTF-8.
use open ':std', ':encoding(UTF-8)';
my #words = qw( καί τό εἰς );
my %words = map { $_ => 1 } #words;
my $pat = join '|', map quotemeta, keys %words;
while (<>) {
if (/$pat/) {
print;
}
}
Usage:
script.pl file.in >file.out
Notes:
Make sure the source code is encoded using UTF-8 and that you use use utf8;.
Make sure you use the use open line and specify the appropriate encoding for your data file. (If it's not UTF-8, change it.)

Related

Text::SpellChecker module and Unicode

#!/usr/local/bin/perl
use strict;
use warnings;
use Text::SpellChecker;
my $text = "coördinator";
my $checker = Text::SpellChecker->new( text => $text );
while ( my $word = $checker->next_word ) {
print "Bad word is $word\n";
}
Output: Bad word is rdinator
Desired: Bad word is coördinator
The module is breaking if I have Unicode in $text. Any idea how can this be solved?
I have Aspell 0.50.5 installed which is being used by this module. I think this might be the culprit.
Edit: As Text::SpellChecker requires either Text::Aspell or Text::Hunspell, I removed Text::Aspell and installed Hunspell, Text::Hunspell, then:
$ hunspell -d en_US -l < badword.txt
coördinator
Shows correct result. This means there's something wrong either with my code or Text::SpellChecker.
Taking Miller's suggestion in consideration I did the below
#!/usr/local/bin/perl
use strict;
use warnings;
use Text::SpellChecker;
use utf8;
binmode STDOUT, ":encoding(utf8)";
my $text = "coördinator";
my $flag = utf8::is_utf8($text);
print "Flag is $flag\n";
print "Text is $text\n";
my $checker = Text::SpellChecker->new(text => $text);
while (my $word = $checker->next_word) {
print "Bad word is $word\n";
}
OUTPUT:
Flag is 1
Text is coördinator
Bad word is rdinator
Does this mean the module is not able to handle utf8 characters properly?
It is Text::SpellChecker bug - the current version assumes ASCII only words.
http://cpansearch.perl.org/src/BDUGGAN/Text-SpellChecker-0.11/lib/Text/SpellChecker.pm
#
# next_word
#
# Get the next misspelled word.
# Returns false if there are no more.
#
sub next_word {
...
while ($self->{text} =~ m/([a-zA-Z]+(?:'[a-zA-Z]+)?)/g) {
IMHO the best fix would use per language/locale word splitting regular expression or leave word splitting to underlaying library used. aspell list reports coördinator as single word.
I've incorporated Chankey's solution and released version 0.12 to the CPAN, give it a try.
The validity of diaeresis in words like coördinator is interesting. The default aspell and hunspell dictionaries seem to mark it as incorrect, though some publications may disagree.
best,
Brian

How to use utf8 encode with open pragma

I have problem with utf8::encode when use pragma use open qw(:std :utf8);
Example
#!/usr/bin/env perl
use v5.16;
use utf8;
use open qw(:std :utf8);
use Data::Dumper;
my $word = "+банк";
say Dumper($word);
say utf8::is_utf8($word) ? 1 : 0;
utf8::encode($word);
say Dumper($word);
say utf8::is_utf8($word) ? 1 : 0;
Output
$VAR1 = "+\x{431}\x{430}\x{43d}\x{43a}";
1
$VAR1 = '+банк';
0
When I remove this pragma use open qw(:std :utf8);, everything is OK.
$VAR1 = "+\x{431}\x{430}\x{43d}\x{43a}";
1
$VAR1 = '+банк';
0
Thank you in advanced!
If you're going to replace utf8::encode($word); with use open qw(:std :utf8);, you'll actually need to remove the utf8::encode($word);. In the version that doesn't work, you're encoding twice.
utf8::encode is not what you want if you are going to print to a filehandle upon which perl expects to output utf8.
utf8::encode says take this string and give me a string where each character is a byte of the utf8 encoding of the input string. This would normally be only done if you are then going to use that string in some way where perl won't be automatically converting to utf8 if necessary.
If you add a say length($word); after the encode, you will see that $word is 9 characters, not the original 5.

Comparing two non-ascii strings in perl

I am unable to compare two non-ascii strings, although both the strings appear the same on the console. Below is what I tried. Please let me know what code is missing here, so that the two variables shall be equal.
if($lineContent[7] ne $name) {
/*Control coming to here*/
print "###### Values MIS-MATCHED\n";
} else {
print "###### Values MATCHED\n";
}
$lineContent[7] is from a CSV file
$name is from an XML file
When Putty's console is in the default Characterset
CSV Val: ENB69-åºå°å±
XML Val: ENB69-åºå°å±
When Putty's Console is set to UTF-8
CSV Val: ENB69-基地局
XML Val: ENB69-基地局
#!/usr/bin/perl
use warnings;
use strict;
use Encode;
binmode STDOUT, ":encoding(utf8)";
open F1, "<:utf8", "$ARGV[0]" or die "$!";
open F2, "<", "$ARGV[0]" or die "$!";
my $a1 = <F1>;
chomp $a1;
my $a2 = <F2>;
chomp $a2;
if ($a1 eq $a2) {
print "$a1=$a2 is true\n";
} else {
print "$a1=$a2 is false\n";
}
my $b = decode("utf-8", $a2);
if ($a1 eq $b) {
print "$a1=$b is true\n";
} else {
print "$a1=$b is false\n";
}
I wrote a test program listed above. And create a text file with one line: 基地局.
When you run the program with this text file, you can get a false and a true.
I don't know what's in your program, but I guess the csv file is read as a plain text without any parsers or encode/decode procedures, whereas the xml file must be parsed by some library, so that the internal encoding mechanism is different for the two string variables, including some leading bytes of encoding notation.
Simply put, you can try to encode or decode one of the two string variables, and see if they match.
By the way, this is my first answer here, hope it can be a little bit helpful to you ;-)
From your dump results, it's obvious. The first variable stores 9 characters which constrcut 基地局 in utf-8 encoding in its internal structure. The second variable represents 3 characters in its internal structure. They have same byte stream, and are equal in a byte-stream view but not equal in a character-based comparison.
Use decode/encode can solve your problem.
Your inputs:
"ENB13-\345\237\272\345\234\260\345\261\200"
"ENB13-\x{57fa}\x{5730}\x{5c40}"
As you can see, these are clearly not the same. Specifically, the first is the UTF-8 encoding of the other. Always decode inputs. Always encode outputs.
use strict;
use warnings;
use utf8; # Source code is saved as UTF-8
use open ':std', ':encoding(UTF-8)'; # Terminal expects UTF-8
my $name = "ENB69-基地局";
while ($line = <STDIN>) {
chomp;
my #lineContent = split /\t/, $line;
print($lineContent[7] eq $name ?1:0, "\n"); # 1
}
Personally I would be a little more careful if you know that you are comparing unicode strings. Unicode::Collate is the module for the job.
Of course you should also read tchrist's now-famous SO post on the topic of enabling unicode in Perl, https://stackoverflow.com/a/6163129/468327, but utf8::all does an admirable job of turning on proper unicode support. Note that better unicode handling was added to the Perl core in version 5.14 so I require that here as well.
Finally here is a quick script that does the comparison, of course you would populate the variables by reading the files as needed:
#!/usr/bin/env perl
use v5.14;
use strict;
use warnings;
use utf8::all;
use Unicode::Collate;
my $collator = Unicode::Collate->new;
my $csv = "ENB69-基地局";
my $xml = "ENB69-基地局";
say $collator->eq($csv, $xml) ? "equal" : "unequal";

Perl's YAML::XS and unicode

I am trying to use perl's YAML::XS module on unicode letters and it doesn't seem working the way it should.
I write this in the script (which is saved in utf-8)
use utf8;
binmode STDOUT, ":utf8";
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159
use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;
Instead of something sane, -: Å is printed. According to this link, though, it should be working fine.
Yes, when I YAML::XS::Load it back, I got the correct strings again, but I don't like the fact the dumped string seems to be in some wrong encoding.
Am I doing something wrong? I am always unsure about unicode in perl, to be frank...
clarification: my console supports UTF-8. Also, when I print it to file, opened with utf8 handle with open $file, ">:utf8" instead of STDOUT, it still doesn't print correct utf-8 letters.
Yes, you're doing something wrong. You've misunderstood what the link you mentioned means. Dump & Load work with raw UTF-8 bytes; i.e. strings containing UTF-8 but with the UTF-8 flag off.
When you print those bytes to a filehandle with the :utf8 layer, they get interpreted as Latin-1 and converted to UTF-8, producing double-encoded output (which can be read back successfully as long as you double-decode it). You want to binmode STDOUT, ':raw' instead.
Another option is to call utf8::decode on the string returned by Dump. This will convert the raw UTF-8 bytes to a character string (with the UTF-8 flag on). You can then print the string to a :utf8 filehandle.
So, either
use utf8;
binmode STDOUT, ":raw";
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159
use YAML::XS;
my $s = YAML::XS::Dump($hash);
print $s;
Or
use utf8;
binmode STDOUT, ":utf8";
my $hash = {č => "ř"}; #czech letters with unicode codes U+010D and U+0159
use YAML::XS;
my $s = YAML::XS::Dump($hash);
utf8::decode($s);
print $s;
Likewise, when reading from a file, you want to read in :raw mode or use utf8::encode on the string before passing it to Load.
When possible, you should just use DumpFile & LoadFile, letting YAML::XS deal with opening the file correctly. But if you want to use STDIN/STDOUT, you'll have to deal with Dump & Load.
It works if you don't use binmode STDOUT, ":utf8";. Just don't ask me why.
I'm using the next for the utf-8 JSON and YAML. No error handling, but can show how to do.
The bellow allows me:
uses NFC normalisation on input and NO NDF on output. Simply useing everything in NFC
can edit the YAML/JSON files with utf8 enabled vim and bash tools
"inside" the perl works things like \w regexes and lc uc and so on (at least for my needs)
source code is utf8, so can write regexes /á/
My "broilerplate"...
use 5.014;
use warnings;
use utf8;
use feature qw(unicode_strings);
use charnames qw(:full);
use open qw(:std :utf8);
use Encode qw(encode decode);
use Unicode::Normalize qw(NFD NFC);
use File::Slurp;
use YAML::XS;
use JSON::XS;
run();
exit;
sub run {
my $yfilein = "./in.yaml"; #input yaml
my $jfilein = "./in.json"; #input json
my $yfileout = "./out.yaml"; #output yaml
my $jfileout = "./out.json"; #output json
my $ydata = load_utf8_yaml($yfilein);
my $jdata = load_utf8_json($jfilein);
#the "uc" is not "fully correct" but works for my needs
$ydata->{$_} = uc($ydata->{$_}) for keys %$ydata;
$jdata->{$_} = uc($jdata->{$_}) for keys %$jdata;
save_utf8_yaml($yfileout, $ydata);
save_utf8_json($jfileout, $jdata);
}
#using File::Slurp for read/write files
#NFC only on input - and not NFD on output (change this if you want)
#this ensure me than i can edit and copy/paste filenames without problems
sub load_utf8_yaml { return YAML::XS::Load(encode_nfc_read(shift)) }
sub load_utf8_json { return decode_json(encode_nfc_read(shift)) }
sub encode_nfc_read { return encode 'utf8', NFC read_file shift, { binmode => ':utf8' } }
#more effecient
sub rawsave_utf8_yaml { return write_file shift, {binmode=>':raw'}, YAML::XS::Dump shift }
#similar as for json
sub save_utf8_yaml { return write_file shift, {binmode=>':utf8'}, decode 'utf8', YAML::XS::Dump shift }
sub save_utf8_json { return write_file shift, {binmode=>':utf8'}, JSON::XS->new->pretty(1)->encode(shift) }
You can try the next in.yaml
---
á: ä
č: ď
é: ě
í: ĺ
ľ: ň
ó: ô
ö: ő
ŕ: ř
š: ť
ú: ů
ü: ű
ý: ž

How can I reverse a string that contains combining characters in Perl?

I have the string "re\x{0301}sume\x{0301}" (which prints like this: résumé) and I want to reverse it to "e\x{0301}muse\x{0301}r" (émusér). I can't use Perl's reverse because it treats combining characters like "\x{0301}" as separate characters, so I wind up getting "\x{0301}emus\x{0301}er" ( ́emuśer). How can I reverse the string, but still respect the combining characters?
You can use the \X special escape (match a non-combining character and all of the following combining characters) with split to make a list of graphemes (with empty strings between them), reverse the list of graphemes, then join them back together:
#!/usr/bin/perl
use strict;
use warnings;
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print "original: $original\n",
"wrong: $wrong\n",
"right: $right\n";
The best answer is to use Unicode::GCString, as Sinan points out
I modified Chas's example a bit:
Set the encoding on STDOUT to avoid "wide character in print" warnings;
Use a positive lookahead assertion (and no separator retention mode) in split (doesn't work after 5.10, apparently, so I removed it)
It's basically the same thing with a couple of tweaks.
use strict;
use warnings;
binmode STDOUT, ":utf8";
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print <<HERE;
original: [$original]
wrong: [$wrong]
right: [$right]
HERE
You can use Unicode::GCString:
Unicode::GCString treats Unicode string as a sequence of extended grapheme clusters defined by Unicode Standard Annex #29 [UAX #29].
#!/usr/bin/env perl
use utf8;
use strict;
use warnings;
use feature 'say';
use open qw(:std :utf8);
use Unicode::GCString;
my $x = "re\x{0301}sume\x{0301}";
my $y = Unicode::GCString->new($x);
my $wrong = reverse $x;
my $correct = join '', reverse #{ $y->as_arrayref };
say "$x -> $wrong";
say "$y -> $correct";
Output:
résumé -> ́emuśer
résumé -> émusér
Perl6::Str->reverse also works.
In the case of the string résumé, you can also use the Unicode::Normalize core module to change the string to a fully composed form (NFC or NFKC) before reverseing; however, this is not a general solution, because some combinations of base character and modifier have no precomposed Unicode codepoint.
Some of the other answers contain elements that don't work well. Here is a working example tested on Perl 5.12 and 5.14. Failing to specify the binmode will cause the output to generate error messages. Using a positive lookahead assertion (and no separator retention mode) in split will cause the output to be incorrect on my Macbook.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'unicode_strings';
binmode STDOUT, ":utf8";
my $original = "re\x{0301}sume\x{0301}";
my $wrong = reverse $original;
my $right = join '', reverse split /(\X)/, $original;
print "original: $original\n",
"wrong: $wrong\n",
"right: $right\n";