Convert UTF8 string into numeric values in Perl - perl

For example,
my $str = '中國c'; # Chinese language of china
I want to print out the numeric values
20013,22283,99

unpack will be more efficient than split and ord, because it doesn't have to make a bunch of temporary 1-character strings:
use utf8;
my $str = '中國c'; # Chinese language of china
my #codepoints = unpack 'U*', $str;
print join(',', #codepoints) . "\n"; # prints 20013,22283,99
A quick benchmark shows it's about 3 times faster than split+ord:
use utf8;
use Benchmark 'cmpthese';
my $str = '中國中國中國中國中國中國中國中國中國中國中國中國中國中國c';
cmpthese(0, {
'unpack' => sub { my #codepoints = unpack 'U*', $str; },
'split-map' => sub { my #codepoints = map { ord } split //, $str },
'split-for' => sub { my #cp; for my $c (split(//, $str)) { push #cp, ord($c) } },
'split-for2' => sub { my $cp; for my $c (split(//, $str)) { $cp = ord($c) } },
});
Results:
Rate split-map split-for split-for2 unpack
split-map 85423/s -- -7% -32% -67%
split-for 91950/s 8% -- -27% -64%
split-for2 125550/s 47% 37% -- -51%
unpack 256941/s 201% 179% 105% --
The difference is less pronounced with a shorter string, but unpack is still more than twice as fast. (split-for2 is a bit faster than the other splits because it doesn't build a list of codepoints.)

See perldoc -f ord:
foreach my $c (split(//, $str))
{
print ord($c), "\n";
}
Or compressed into a single line: my #chars = map { ord } split //, $str;
Data::Dumpered, this produces:
$VAR1 = [
20013,
22283,
99
];

To have utf8 in your source code recognized as such, you must use utf8; beforehand:
$ perl
use utf8;
my $str = '中國c'; # Chinese language of china
foreach my $c (split(//, $str))
{
print ord($c), "\n";
}
__END__
20013
22283
99
or more tersely,
print join ',', map ord, split //, $str;

http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html
#!/usr/bin/env perl
use utf8; # so literals and identifiers can be in UTF-8
use v5.12; # or later to get "unicode_strings" feature
use strict; # quote strings, declare variables
use warnings; # on by default
use warnings qw(FATAL utf8); # fatalize encoding glitches
use open qw(:std :utf8); # undeclared streams in UTF-8
# use charnames qw(:full :short); # unneeded in v5.16
# http://perldoc.perl.org/functions/sprintf.html
# vector flag
# This flag tells Perl to interpret the supplied string as a vector of integers, one for each character in the string.
my $str = '中國c';
printf "%*vd\n", ",", $str;

Related

How to find the number of vowels in a string using Perl

sub Solution{
my $n=$_[0];
my $m=lc $_[1];
my #chars=split("",$m);
my $result=0;
my #vowels=("a","e","i","o","u");
#OUTPUT [uncomment & modify if required]
for(my $i=0;$i<$n;$i=$i+1){
for(my $j=0;$j<5;$j=$j+1){
if($chars[$i]==$vowels[$j]){
$result=$result+1;
last;
}
}
}
print $result;
}
#INPUT [uncomment & modify if required]
my $n=<STDIN>;chomp($n);
my $m=<STDIN>;chomp($m);
Solution($n,$m);
So I wrote this solution to find the number of vowels in a string. $n is the length of the string and $m is the string.
However, for the input 3 nam I always get the input as 3.
Can someone help me debug it?
== compares numbers. eq compares strings. So instead of $chars[$i]==$vowels[$j] you should write $chars[$i] eq $vowels[$j]. If you had used use warnings;, which is recommended, you'd have gotten a warning about that.
And by the way, there's no need to work with extra variables for the length. You can get the length of a string with length() and of an array for example with scalar(). Also, the last index of an array #a can be accessed with $#a. Or you can use foreach to iterate over all elements of an array.
A better solution is using a tr operator which, in scalar context, returns the number of replacements:
perl -le 'for ( #ARGV ) { $_ = lc $_; $n = tr/aeiouy//; print "$_: $n"; }' Use Perl to count how many vowels are in each string
use: 2
perl: 1
to: 1
count: 2
how: 1
many: 2
vowels: 2
are: 2
in: 1
each: 2
string: 1
I included also y, which is sometimes a vowel, see: https://simple.wikipedia.org/wiki/Vowel
Let me suggest a better approach to count letters in a text
#!/usr/bin/env perl
#
# vim: ai:ts=4:sw=4
#
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $debug = 0; # debug flag
my %count;
my #vowels = qw/a e i o u/;
map{
chomp;
my #chars = split '';
map{ $count{$_}++ } #chars;
} <DATA>;
say Dumper(\%count) if $debug;
foreach my $vowel (#vowels) {
say "$vowel: $count{$vowel}";
}
__DATA__
So I wrote this solution to find the number of vowels in a string. $n is the length of the string and $m is the string. However, for the input 3 nam I always get the input as 3.
Can someone help me debug it?
Output
a: 7
e: 18
i: 12
o: 12
u: 5
Your code is slightly modified form
#!/usr/bin/env perl
#
# vim: ai:ts=4:sw=4
#
use strict;
use warnings;
use feature 'say';
my $input = get_input('Please enter sentence:');
say "Counted vowels: " . solution($input);
sub get_input {
my $prompt = shift;
my $input;
say $prompt;
$input = <STDIN>;
chomp($input);
return $input;
}
sub solution{
my $str = lc shift;
my #chars=split('',$str);
my $count=0;
my #vowels=qw/a e i o u/;
map{
my $c=$_;
map{ $count++ if $c eq $_} #vowels;
} #chars;
return $count;
}

Is there any way to make printf/sprintf handle combining characters correctly?

Combining characters appear to count as whole characters in printf and sprintf's calculations:
[ é]
[ é]
The text above was created by the following code:
#!/usr/bin/perl
use strict;
use warnings;
binmode STDOUT, ":utf8";
for my $s ("\x{e9}", "e\x{301}") {
printf "[%5s]\n", $s;
}
I expected the code to print:
[ é]
[ é]
I don't see any discussion of Unicode, let alone combining characters, in the function descriptions. Are printf and sprintf useless in the face of Unicode? Is this just a bug in Perl 5.20.1 that could be fixed? Is there a replacement someone has written?
It looks like the answer is to use Unicode::GCString
#!/usr/bin/perl
use strict;
use warnings;
use Unicode::GCString;
binmode STDOUT, ":utf8";
for my $s ("\x{e9}", "e\x{301}", "e\x{301}\x{302}") {
printf "[%s]\n", pad($s, 5);
}
sub pad {
my ($s, $length) = #_;
my $gcs = Unicode::GCString->new($s);
return((" " x ($length - $gcs->columns)) . $s);
}
You should probably be aware of the Perl Unicode Cookbook. In particular ℞ #34, which deals with this very issue. As a bonus, Perl v5.20.2 has it available as perldoc unicook.
In any case: The code included in that article is as follows:
use Unicode::GCString;
use Unicode::Normalize;
my #words = qw/crème brûlée/;
#words = map { NFC($_), NFD($_) } #words;
for my $str (#words) {
my $gcs = Unicode::GCString->new($str);
my $cols = $gcs->columns;
my $pad = " " x (10 - $cols);
say str, $pad, " |";
}

Bug with parsing by Text::CSV_XS?

Tried to use Text::CSV_XS to parse some logs. However, the following code doesn't do what I expected -- split the line into pieces according to separator " ".
The funny thing is, if I remove the double quote in the string $a, then it will do splitting.
Wonder if it's a bug or I missed something. Thanks!
use Text::CSV_XS;
$a = 'id=firewall time="2010-05-09 16:07:21 UTC"';
$userDefinedSeparator = Text::CSV_XS->new({sep_char => " "});
print "$userDefinedSeparator\n";
$userDefinedSeparator->parse($a);
my $e;
foreach $e ($userDefinedSeparator->fields) {
print $e, "\n";
}
EDIT:
In the above code snippet, it I change the = (after time) to be a space, then it works fine. Started to wonder whether this is a bug after all?
$a = 'id=firewall time "2010-05-09 16:07:21 UTC"';
You have confused the module by leaving both the quote character and the escape character set to double quote ", and then left them embedded in the fields you want to split.
Disable both quote_char and escape_char, like this
use strict;
use warnings;
use Text::CSV_XS;
my $string = 'id=firewall time="2010-05-09 16:07:21 UTC"';
my $space_sep = Text::CSV_XS->new({
sep_char => ' ',
quote_char => undef,
escape_char => undef,
});
$space_sep->parse($string);
for my $field ($space_sep->fields) {
print "$field\n";
}
output
id=firewall
time="2010-05-09
16:07:21
UTC"
But note that you have achieved exactly the same things as print "$_\n" for split ' ', $string, which is to be preferred as it is both more efficient and more concise.
In addition, you must always use strict and use warnings; and never use $a or $b as variable names, both because they are used by sort and because they are meaningless and undescriptive.
Update
As #ThisSuitIsBlackNot points out, your intention is probably not to split on spaces but to extract a series of key=value pairs. If so then this method puts the values straight into a hash.
use strict;
use warnings;
my $string = 'id=firewall time="2010-05-09 16:07:21 UTC"';
my %data = $string =~ / ([^=\s]+) \s* = \s* ( "[^"]*" | [^"\s]+ ) /xg;
use Data::Dump;
dd \%data;
output
{ id => "firewall", time => "\"2010-05-09 16:07:21 UTC\"" }
Update
This program will extract the two name=value strings and print them on separate lines.
use strict;
use warnings;
my $string = 'id=firewall time="2010-05-09 16:07:21 UTC"';
my #fields = $string =~ / (?: "[^"]*" | \S )+ /xg;
print "$_\n" for #fields;
output
id=firewall
time="2010-05-09 16:07:21 UTC"
If you are not actually trying to parse csv data, you can get the time field by using Text::ParseWords, which is a core module in Perl 5. The benefit to using this module is that it handles quotes very well.
use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;
my $str = 'id=firewall time="2010-05-09 16:07:21 UTC"';
my #fields = quotewords(' ', 0, $str);
print Dumper \#fields;
my %hash = map split(/=/, $_, 2), #fields;
print Dumper \%hash;
Output:
$VAR1 = [
'id=firewall',
'time=2010-05-09 16:07:21 UTC'
];
$VAR1 = {
'time' => '2010-05-09 16:07:21 UTC',
'id' => 'firewall'
};
I also included how you can make the data more accessible by adding it to a hash. Note that hashes cannot contain duplicate keys, so you need a new hash for each new time key.

Pairs as hash keys

Does anyone know how to make a hash with pairs of strings serving as keys in perl?
Something like...
{
($key1, $key2) => $value1;
($key1, $key3) => $value2;
($key2, $key3) => $value3;
etc....
You can't have a pair of scalars as a hash key, but you can make a multilevel hash:
my %hash;
$hash{$key1}{$key2} = $value1;
$hash{$key1}{$key3} = $value2;
$hash{$key2}{$key3} = $value3;
If you want to define it all at once:
my %hash = ( $key1 => { $key2 => $value1, $key3 => $value2 },
$key2 => { $key3 => $value3 } );
Alternatively, if it works for your situation, you could just concatenate your keys together
$hash{$key1 . $key2} = $value1; # etc
Or add a delimiter to separate the keys:
$hash{"$key1:$key2"} = $value1; # etc
You could use an invisible separator to join the coordinates:
Primarily for mathematics, the Invisible Separator (U+2063) provides a separator between characters where punctuation or space may be omitted such as in a two-dimensional index like i⁣j.
#!/usr/bin/env perl
use utf8;
use v5.12;
use strict;
use warnings;
use warnings qw(FATAL utf8);
use open qw(:std :utf8);
use charnames qw(:full :short);
use YAML;
my %sparse_matrix = (
mk_key(34,56) => -1,
mk_key(1200,11) => 1,
);
print Dump \%sparse_matrix;
sub mk_key { join("\N{INVISIBLE SEPARATOR}", #_) }
sub mk_vec { map [split "\N{INVISIBLE SEPARATOR}"], #_ }
~/tmp> perl mm.pl |xxd
0000000: 2d2d 2d0a 3132 3030 e281 a331 313a 2031 ---.1200...11: 1
0000010: 0a33 34e2 81a3 3536 3a20 2d31 0a .34...56: -1.
Usage: Multiple keys of a single value in a hash can be used for implementing a 2D matrix or N-dimensional matrix!
#!/usr/bin/perl -w
use warnings;
use strict;
use Data::Dumper;
my %hash = ();
my ($a, $b, $c) = (2,3,4);
$hash{"$a, $b ,$c"} = 1;
$hash{"$b, $c ,$a"} = 1;
foreach(keys(%hash) )
{
my #a = split(/,/, $_);
print Dumper(#a);
}
I do this:
{ "$key1\x1F$key2" => $value, ... }
Usually with a helper method:
sub getKey() {
return join( "\x1F", #_ );
}
{ getKey( $key1, $key2 ) => $value, ... }
----- EDIT -----
Updated the code above to use the ASCII Unit Separator per the recommendation from #chepner above
Use $; implicitly (or explicitly) in your hash keys, used for multidimensional emulation, like so:
my %hash;
$hash{$key1, $key2} = $value; # or %hash = ( $key1.$;.$key2 => $value );
print $hash{$key1, $key2} # returns $value
You can even set $; to \x1F if needed (the default is \034, from SUBSEP in awk):
local $; = "\x1F";

extract every nth number

i want to extract every 3rd number ( 42.034 , 41.630 , 40.158 as so on ) from the file
see example-
42.034 13.749 28.463 41.630 12.627 28.412 40.158 12.173 30.831 26.823
12.596 32.191 26.366 13.332 32.938 25.289 12.810 32.419 23.949 13.329
Any suggestions using perl script ?
Thanks,
dac
You can split file's contents to separate numbers and use the modulo operator to extract every 3rd number:
my $contents = do { local $/; open my $fh, "file" or die $!; <$fh> };
my #numbers = split /\s+/, $contents;
for (0..$#numbers) {
$_ % 3 == 0 and print "$numbers[$_]\n";
}
use strict;
use warnings;
use 5.010; ## for say
use List::MoreUtils qw/natatime/;
my #vals = qw/42.034 13.749 28.463 41.630 12.627 28.412 40.158 12.173 30.831
26.823 12.596 32.191 26.366 13.332 32.938 25.289 12.810 32.419 23.949 13.329/;
my $it = natatime 3, #vals;
say while (($_) = $it->());
This is probably the shortest way to specify that. If #list is your list of numbers
#list[ grep { $_ % 3 == 0 } 0..$#list ]
It's a one-liner!
$ perl -lane 'print for grep {++$i % 3 == 1} #F' /path/to/your/input
-n gives you line-by-line processing, -a autosplitting for field processing, and $i (effectively initialized to zero for our purposes) keeps count of the number of fields processed...
This method avoids reading the entire file into memory at once:
use strict;
my #queue;
while (<>) {
push #queue, / ( \d+ (?: \. \d* ) ? ) /gx;
while (#queue >= 3) {
my $third = (splice #queue, 0, 3)[2];
print $third, "\n"; # Or do whatever with it.
}
}
If the file has 10 numbers in every line you can use this:
perl -pe 's/([\d.]+) [\d.]+ [\d.]+/$1/g;' file
It's not a clean solution but it should "do the job".
Looks like this post lacked a solution that didn't read the whole file and used grep.
#!/usr/bin/perl -w
use strict;
my $re = qr/-?\d+(?:\.\d*)/; # Insert a more precise regexp here
my $n = 3;
my $count = 0;
while (<>) {
my #res = grep { not $count++ % $n } m/($re)/go;
print "#res\n";
};
I believe you’ll find that this work per spec, behaves politely, and never reads in more than it needs to.
#!/usr/bin/env perl
use 5.010_001;
use strict;
use autodie;
use warnings qw[ FATAL all ];
use open qw[ :std IO :utf8 ];
END { close STDOUT }
use Regexp::Common;
my $real_num_rx = $RE{num}{real};
my $left_edge_rx = qr{
(?: (?<= \A ) # or use \b
| (?<= \p{White_Space} ) # or use \D
)
}x;
my $right_edge_rx = qr{
(?= \z # or use \b
| \p{White_Space} # or use \D
)
}x;
my $a_number_rx = $left_edge_rx
. $real_num_rx
. $right_edge_rx
;
if (-t STDIN && #ARGV == 0) {
warn "$0: reading numbers from stdin,"
. " type ^D to end, ^C to kill\n";
}
$/ = " ";
my $count = 0;
while (<>) {
while (/($a_number_rx)/g) {
say $1 if $count++ % 3 == 0;
}
}