Text::SpellChecker module and Unicode - perl

#!/usr/local/bin/perl
use strict;
use warnings;
use Text::SpellChecker;
my $text = "coördinator";
my $checker = Text::SpellChecker->new( text => $text );
while ( my $word = $checker->next_word ) {
print "Bad word is $word\n";
}
Output: Bad word is rdinator
Desired: Bad word is coördinator
The module is breaking if I have Unicode in $text. Any idea how can this be solved?
I have Aspell 0.50.5 installed which is being used by this module. I think this might be the culprit.
Edit: As Text::SpellChecker requires either Text::Aspell or Text::Hunspell, I removed Text::Aspell and installed Hunspell, Text::Hunspell, then:
$ hunspell -d en_US -l < badword.txt
coördinator
Shows correct result. This means there's something wrong either with my code or Text::SpellChecker.
Taking Miller's suggestion in consideration I did the below
#!/usr/local/bin/perl
use strict;
use warnings;
use Text::SpellChecker;
use utf8;
binmode STDOUT, ":encoding(utf8)";
my $text = "coördinator";
my $flag = utf8::is_utf8($text);
print "Flag is $flag\n";
print "Text is $text\n";
my $checker = Text::SpellChecker->new(text => $text);
while (my $word = $checker->next_word) {
print "Bad word is $word\n";
}
OUTPUT:
Flag is 1
Text is coördinator
Bad word is rdinator
Does this mean the module is not able to handle utf8 characters properly?

It is Text::SpellChecker bug - the current version assumes ASCII only words.
http://cpansearch.perl.org/src/BDUGGAN/Text-SpellChecker-0.11/lib/Text/SpellChecker.pm
#
# next_word
#
# Get the next misspelled word.
# Returns false if there are no more.
#
sub next_word {
...
while ($self->{text} =~ m/([a-zA-Z]+(?:'[a-zA-Z]+)?)/g) {
IMHO the best fix would use per language/locale word splitting regular expression or leave word splitting to underlaying library used. aspell list reports coördinator as single word.

I've incorporated Chankey's solution and released version 0.12 to the CPAN, give it a try.
The validity of diaeresis in words like coördinator is interesting. The default aspell and hunspell dictionaries seem to mark it as incorrect, though some publications may disagree.
best,
Brian

Related

searching words in Greek in Unix and Perl

I have txt files that are greek and now I want to search specific words in them using perl and bash ... the words are like ?a?, t?, e??
I was searching for words in english and now want to replace them by greek but all I get is ??? mostly... for Perl:
my %word = map { $_ => 1 } qw/name date birth/;
and for bash
for X in name date birth
do
can someone please help me?
#!/usr/bin/perl
use strict;
use warnings;
# Tell Perl your code is encoded using UTF-8.
use utf8;
# Tell Perl input and output is encoded using UTF-8.
use open ':std', ':encoding(UTF-8)';
my #words = qw( καί τό εἰς );
my %words = map { $_ => 1 } #words;
my $pat = join '|', map quotemeta, keys %words;
while (<>) {
if (/$pat/) {
print;
}
}
Usage:
script.pl file.in >file.out
Notes:
Make sure the source code is encoded using UTF-8 and that you use use utf8;.
Make sure you use the use open line and specify the appropriate encoding for your data file. (If it's not UTF-8, change it.)

How can I get case-insensitive completion with Term::ReadLine::Gnu?

I can't seem to get case-insensitive completion when using Term::ReadLine::Gnu. Take this example script:
use strict;
use warnings;
use 5.010;
use Term::ReadLine;
my $term = Term::ReadLine->new('test');
say "Using " . $term->ReadLine;
if (my $attr = $term->Attribs) {
$term->ornaments(0);
$attr->{basic_word_break_characters} = ". \t\n";
$attr->{completer_word_break_characters} = " \t\n";
$attr->{completion_function} = \&complete_word;
} # end if attributes
my #words = qw(apple approve Adam America UPPER UPPERCASE UNUSED);
sub complete_word
{
my ($text, $line, $start) = #_;
return grep(/^$text/i, #words);
} # end complete_word
while (1) {
$_ = $term->readline(']');
last unless /\S/; # quit on empty input
} # end while 1
Note that complete_word does case-insensitive matching. If I run this with Term::ReadLine::Perl (by doing PERL_RL=Perl perl script.pl) it works as I expect. Typing a<TAB><TAB> lists all 4 words. Typing u<TAB><TAB> converts u to U and lists 3 words.
When I use Term::ReadLine::Gnu instead (PERL_RL=Gnu perl script.pl or just perl script.pl), it only does case-sensitive completion. Typing a<TAB> gives app. Typing u<TAB><TAB> doesn't list any completions.
I even have set completion-ignore-case on in my /etc/inputrc, but it still doesn't work here. (It works fine in bash, though.)
Is there any way to get Term::ReadLine::Gnu to do case-insensitive completion?
It would appear the problem is in the Term::ReadLine::Gnu::XS::_trp_completion_function() (a wrapper for the user-defined completion function).
Your matches are retrieved correctly from your complete_word() function, but then the following snippet from the wrapper does its own case-sensitive match:
for (; $_i <= $#_matches; $_i++) {
return $_matches[$_i] if ($_matches[$_i] =~ /^\Q$text/);
}
where #_matches is the result of your complete_word() and $text is the completed text so far.
So it looks like the answer is no, there is no supported way to get Term::ReadLine::Gnu to do case-insensitive completion. You would have to to override the private Term::ReadLine::Gnu::XS::_trp_completion_function (an ugly hack to be sure) -- or modify XS.pm directly (arguably an even uglier hack).
EDIT: Term::ReadLine::Gnu version used: 1.20

how can i fetch the whole word on the basis of index no of that string in perl

I have one string of line like
comments:[I#1278327] is related to office communicator.i fixed the bug to declare it null at first time.
Here I am searching index of I#then I want the whole word means [I#1278327]. I'm doing it like this:
open(READ1,"<letter.txt");
while(<READ1>)
{
if(index($_,"I#")!=-1)
{
$indexof=index($_,"I#");
print $indexof,"\n";
$string=substr($_,$indexof);##i m cutting that string first from index of I# to end then...
$string=substr($string,0,index($string," "));
$lengthof=length($string);
print $lengthof,"\n";
print $string,"\n";
print $_,"\n";
}
}
Is any API is there in perl to find the word length directly after finding the index of I# in that line.
You could do something like:
$indexof=index($_,"I#");
$index2 = index($_,' ',$indexof);
$lengthof = $index2 - $indexof;
However, the bigger issue is you are using Perl as if it were BASIC. A more perlish approach to the task of printing selected lines:
use strict;
use warnings;
open my $read, '<', 'letter.txt'; # safer version of open
LINE:
while (<$read>) {
print "$1 - $_" if (/(I#.*?) /);
}
I would use a regex instead, a regex will allow you to match a pattern ("I#") and also capture other data from the string:
$_ =~ m/I#(\d+)/;
The line above will match and set $1 to the number.
See perldoc perlre

Read from a file and compare the content with a variable

#!/usr/bin/perl
some code........
..................
system ("rpm -q iptables > /tmp/checkIptables");
my $iptables = open FH, "/tmp/checkIptables";
The above code checks whether iptables is installed in your Linux machine? If it is installed the command rpm -q iptables will give the output as shown below:
iptables-1.4.7-3.el6.x86_64
Now I have redirected this output to the file named as checkIptables.
Now I want to check whether the variable $iptables matches with the output given above or not. I do not care about version numbers.
It should be something like
if ($iptables eq iptables*){
...............
.......................}
But iptables* gives error.
You could use a regex to check the string:
$iptables =~ /^iptables/
Also, you do not need a tmp file, you can instead open a pipe:
use strict;
use warnings;
use autodie;
open my $fh, '-|', "rpm -q iptables";
my $line = <$fh>;
if ($line =~ /^iptables/) {
print "iptables is installed";
}
This will read the first line of the output, and check it against the regex.
Or you can use backticks:
my $lines = `rpm -q iptables`;
if ($lines =~ /^iptables/) {
print "iptables is installed";
}
Note that backticks may return more than one line of data, so you may need to compensate for that.
I think what you're looking for is a regular expression or a "pattern match". You want the string to match a pattern, not a particular thing.
if ( $iptables =~ /^iptables\b/ ) {
...
}
=~ is the binding operator and tells the supplied regular expression that its source is that variable. The regular expression simply says look at the beginning of the string for the sequence "iptables" followed by a "word-break". Since '-' is a "non-word" character (not alphanumeric or '_') it breaks the word. You could use '-' as well:
/^iptables-/
But you can probably do the whole thing with this statement:
use strict;
use warnings;
use List::MoreUtils qw<any>;
...
if ( any { m/^iptables-/ } `rpm -q iptables` ) {
...
}
piping the output directly into a list via backticks and searching through that list via any (See List::MoreUtils::any
Why not just look at the return value of "rpm -q", which will return 0 or 1 whether it is installed or not respectively?

What's an easy way to print a multi-line string without variable substitution in Perl?

I have a Perl program that reads in a bunch of data, munges it, and then outputs several different file formats. I'd like to make Perl be one of those formats (in the form of a .pm package) and allow people to use the munged data within their own Perl scripts.
Printing out the data is easy using Data::Dump::pp.
I'd also like to print some helper functions to the resulting package.
What's an easy way to print a multi-line string without variable substitution?
I'd like to be able to do:
print <<EOL;
sub xyz {
my $var = shift;
}
EOL
But then I'd have to escape all of the $'s.
Is there a simple way to do this? Perhaps I can create an actual sub and have some magic pretty-printer print the contents? The printed code doesn't have to match the input or even be legible.
Enclose the name of the delimiter in single quotes and interpolation will not occur.
print <<'EOL';
sub xyz {
my $var = shift;
}
EOL
You could use a templating package like Template::Toolkit or Text::Template.
Or, you could roll your own primitive templating system that looks something like this:
my %vars = qw( foo 1 bar 2 );
Write_Code(\$vars);
sub Write_Code {
my $vars = shift;
my $code = <<'END';
sub baz {
my $foo = <%foo%>;
my $bar = <%bar%>;
return $foo + $bar;
}
END
while ( my ($key, $value) = each %$vars ) {
$code =~ s/<%$key%>/$value/g;
}
return $code;
}
This looks nice and simple, but there are various traps and tricks waiting for you if you DIY. Did you notice that I failed to use quotemeta on my key names in the substituion?
I recommend that you use a time-tested templating library, like the ones I mentioned above.
You can actually continue a string literal on the next line, like this:
my $mail = "Hello!
Blah blah.";
Personally, I find that more readable than heredocs (the <<<EOL thing mentioned elsewhere).
Double quote " interpolates variables, but you can use '. Note you'll need to escape any ' in your string for this to work.
Perl is actually quite rich in convenient things to make things more readable, e.g. other quote-operations. qq and q correspond to " and ' and you can use whatever delimiter makes sense:
my $greeting = qq/Hello there $name!
Nice to meet you/; # Interpolation
my $url = q|http://perlmonks.org/|; # No need to escape /
(note how the syntax coloring here didn't quite keep up)
Read perldoc perlop (find in page: "Quote and Quote-like Operators") for more information.
Use a data section to store the Perl code:
#!/usr/bin/perl
use strict;
use warnings;
print <DATA>;
#print munged data
__DATA__
package MungedData;
use strict;
use warnings;
sub foo {
print "foo\n";
}
Try writing your code as an actual perl subroutine, then using B::Deparse to get the source code at runtime.