I am attempting to create a hashmap from a text file. The way the text file is set up is as follows.
(integer)<-- varying white space --> (string value)
. . .
. . .
. . .
(integer)<-- varying white space --> (string value)
eg:
5 this is a test
23 this is another test
123 this is the final test
What I want to do is assign the key to the integer, and then the entire string following to the value. I was trying something along the lines of
%myHashMap;
while(my $info = <$fh>){
chomp($info);
my ($int, $string) = split/ /,$info;
$myHashMap{$int} = $string;
}
This doesn't work though because I have spaces in the string. Is there a way to clear the initial white space, grab the integer, assign it to $int, then clear white space till you get to the string, then take the remainder of the text on that line and place it in my $string value?
You could replace
split / /, $info # Fields are separated by a space.
with
split / +/, $info # Fields are separated by spaces.
or the more general
split /\s+/, $info # Fields are separated by whitespace.
but you'd still face with the problem of the leading spaces. To ignore those, use
split ' ', $info
This special case splits on whitespace, ignoring leading whitespace.
Don't forget to tell Perl that you expect at most two fields!
$ perl -E'say "[$_]" for split(" ", " 1 abc def ghi", 2)'
[1]
[abc def ghi]
The other option would be to use the following:
$info =~ /^\s*(\S+)\s+(\S.*)/
You just need to split each line of text on whitespace into two fields
This example program assumes that the input file is passed as a parameter on the command line. I have used Data::Dump only to show the resulting hash structure
use strict;
use warnings 'all';
my %data;
while ( <DATA> ) {
s/\s*\z//;
my ($key, $val) = split ' ', $_, 2;
next unless defined $val; # Ensure that there were two fields
$data{$key} = $val;
}
use Data::Dump;
dd \%data;
output
{
5 => "this is a test",
23 => "this is another test",
123 => "this is the final test",
}
First you clear initial white space use this
$info =~ s/^\s+//g;
second you have more than 2 spaces in between integer and string so use split like this to give 2 space with plus
split/ +/,$info;
The code is
use strict;
use warnings;
my %myHashMap;
while(my $info = <$fh>){
chomp($info);
$info =~ s/^\s+//g;
my ($int, $string) = split/ +/,$info;
$myHashMap{$int} = $string;
}
Using awk, I can print a number with commas as thousands separators.
(with a export LC_ALL=en_US.UTF-8 beforehand).
awk 'BEGIN{printf("%\047d\n", 24500)}'
24,500
I expected the same format to work with Perl, but it does not:
perl -e 'printf("%\047d\n", 24500)'
%'d
The Perl Cookbook offers this solution:
sub commify {
my $text = reverse $_[0];
$text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;
return scalar reverse $text;
}
However I am assuming that since the printf option works in awk, it should also work in Perl.
The apostrophe format modifier is a non-standard POSIX extension.
The documentation for Perl's printf has this to say about such extensions
Perl does its own "sprintf" formatting: it emulates the C
function sprintf(3), but doesn't use it except for
floating-point numbers, and even then only standard modifiers
are allowed. Non-standard extensions in your local sprintf(3)
are therefore unavailable from Perl.
The Number::Format module will do this for you, and it takes its default settings from the locale, so is as portable as it can be
use strict;
use warnings 'all';
use v5.10.1;
use Number::Format 'format_number';
say format_number(24500);
output
24,500
A more perl-ish solution:
$a = 12345678; # no comment
$b = reverse $a; # $b = '87654321';
#c = unpack("(A3)*", $b); # $c = ('876', '543', '21');
$d = join ',', #c; # $d = '876,543,21';
$e = reverse $d; # $e = '12,345,678';
print $e;
outputs 12,345,678.
I realize this question was from almost 4 years ago, but since it comes up in searches, I'll add an elegant native Perl solution I came up with. I was originally searching for a way to do it with sprintf, but everything I've found indicates that it can't be done. Then since everyone is rolling their own, I thought I'd give it a go, and this is my solution.
$num = 12345678912345; # however many digits you want
while($num =~ s/(\d+)(\d\d\d)/$1\,$2/){};
print $num;
Results in:
12,345,678,912,345
Explanation:
The Regex does a maximal digit search for all leading digits. The minimum number of digits in a row it'll act on is 4 (1 plus 3). Then it adds a comma between the two. Next loop if there are still 4 digits at the end (before the comma), it'll add another comma and so on until the pattern doesn't match.
If you need something safe for use with more than 3 digits after the decimal, use this modification: (Note: This won't work if your number has no decimal)
while($num =~ s/(\d+)(\d\d\d)([.,])/$1\,$2$3/){};
This will ensure that it will only look for digits that ends in a comma (added on a previous loop) or a decimal.
Most of these answers assume that the format is universal. It isn't. CLDR uses Unicode information to figure it out. There's a long thread in How to properly localize numbers?.
CPAN has the CLDR::Number module:
#!perl
use v5.10;
use CLDR::Number;
use open qw(:std :utf8);
my $locale = $ARGV[0] // 'en';
my #numbers = qw(
123
12345
1234.56
-90120
);
my $cldr = CLDR::Number->new( locale => $locale );
my $decf = $cldr->decimal_formatter;
foreach my $n ( #numbers ) {
say $decf->format($n);
}
Here are a few runs:
$ perl comma.pl
123
12,345
1,234.56
-90,120
$ perl comma.pl es
123
12.345
1234,56
-90.120
$ perl comma.pl bn
১২৩
১২,৩৪৫
১,২৩৪.৫৬
-৯০,১২০
It seems heavyweight, but the output is correct and you don't have to allow the user to change the locale you want to use. However, when it's time to change the locale, you are ready to go. I also prefer this to Number::Format because I can use a locale that's different from my local settings for my terminal or session, or even use multiple locales:
#!perl
use v5.10;
use CLDR::Number;
use open qw(:std :utf8);
my #locales = qw( en pt bn );
my #numbers = qw(
123
12345
1234.56
-90120
);
my #formatters = map {
my $cldr = CLDR::Number->new( locale => $_ );
my $decf = $cldr->decimal_formatter;
[ $_, $cldr, $decf ];
} #locales;
printf "%10s %10s %10s\n" . '=' x 32 . "\n", #locales;
foreach my $n ( #numbers ) {
printf "%10s %10s %10s\n",
map { $_->[-1]->format($n) } #formatters;
}
The output has three locales at once:
en pt bn
================================
123 123 ১২৩
12,345 12.345 ১২,৩৪৫
1,234.56 1.234,56 ১,২৩৪.৫৬
-90,120 -90.120 -৯০,১২০
Here's an elegant Perl solution I've been using for over 20 years :)
1 while $text =~ s/(.*\d)(\d\d\d)/$1\.$2/g;
And if you then want two decimal places:
$text = sprintf("%0.2f", $text);
1 liner: Use a little loop whith a regex:
while ($number =~ s/^(\d+)(\d{3})/$1,$2/) {}
Example:
use strict;
use warnings;
my #numbers = (12321, 12.12, 122222.3334, '1234abc', '1.1', '1222333444555,666.77');
for(#numbers) {
my $number = $_;
while ($number =~ s/^(\d+)(\d{3})/$1,$2/) {}
print "$_ -> $number\n";
}
Output:
12321 -> 12,321
12.12 -> 12.12
122222.3334 -> 122,222.3334
1234abc -> 1,234abc
1.1 -> 1.1
1222333444555,666.77 -> 1,222,333,444,555,666.77
Pattern:
(\d+)(\d{3})
-> Take all numbers but the last 3 in group 1
-> Take the remaining 3 numbers in group2 on the beginning of $number
-> Followed is ignored
Substitution
$1,$2
-> Put a seperator sign (,) between group 1 and 2
-> The rest remains unchanged
So if you have 12345.67 the numers the regex uses are 12345. The '.' and all followed is ignored.
1. run (12345.67):
-> matches: 12345
-> group 1: 12,
group 2: 345
-> substitute 12,345
-> result: 12,345.67
2. run (12,345.67):
-> does not match!
-> while breaks.
Parting from #Laura's answer, I tweaked the pure perl, regex-only solution to work for numbers with decimals too:
while ($formatted_number =~ s/^(-?\d+)(\d{3}(?:,\d{3})*(?:\.\d+)*)$/$1,$2/) {};
Of course this assumes a "," as thousands separator and a "." as decimal separator, but it should be trivial to use variables to account for that for your given locale(s).
I used the following but it does not works as of perl v5.26.1
sub format_int
{
my $num = shift;
return reverse(join(",",unpack("(A3)*", reverse int($num))));
}
The form that worked for me was:
sub format_int
{
my $num = shift;
return scalar reverse(join(",",unpack("(A3)*", reverse int($num))));
}
But to use negative numbers the code must be:
sub format_int
{
if ( $val >= 0 ) {
return scalar reverse join ",", unpack( "(A3)*", reverse int($val) );
} else {
return "-" . scalar reverse join ",", unpack( "(A3)*", reverse int(-$val) );
}
}
Did somebody say Perl?
perl -pe '1while s/(\d+)(\d{3})/$1,$2/'
This works for any integer.
# turning above answer into a function
sub format_float
# returns number with commas..... and 2 digit decimal
# so format_float(12345.667) returns "12,345.67"
{
my $num = shift;
return reverse(join(",",unpack("(A3)*", reverse int($num)))) . sprintf(".%02d",int(100*(.005+($num - int($num)))));
}
sub format_int
# returns number with commas.....
# so format_int(12345.667) returns "12,345"
{
my $num = shift;
return reverse(join(",",unpack("(A3)*", reverse int($num))));
}
I wanted to print numbers it in a currency format. If it turned out even, I still wanted a .00 at the end. I used the previous example (ty) and diddled with it a bit more to get this.
sub format_number {
my $num = shift;
my $result;
my $formatted_num = "";
my #temp_array = ();
my $mantissa = "";
if ( $num =~ /\./ ) {
$num = sprintf("%0.02f",$num);
($num,$mantissa) = split(/\./,$num);
$formatted_num = reverse $num;
#temp_array = unpack("(A3)*" , $formatted_num);
$formatted_num = reverse (join ',', #temp_array);
$result = $formatted_num . '.'. $mantissa;
} else {
$formatted_num = reverse $num;
#temp_array = unpack("(A3)*" , $formatted_num);
$formatted_num = reverse (join ',', #temp_array);
$result = $formatted_num . '.00';
}
return $result;
}
# Example call
# ...
printf("some amount = %s\n",format_number $some_amount);
I didn't have the Number library on my default mac OS X perl, and I didn't want to mess with that version or go off installing my own perl on this machine. I guess I would have used the formatter module otherwise.
I still don't actually like the solution all that much, but it does work.
This is good for money, just keep adding lines if you handle hundreds of millions.
sub commify{
my $var = $_[0];
#print "COMMIFY got $var\n"; #DEBUG
$var =~ s/(^\d{1,3})(\d{3})(\.\d\d)$/$1,$2$3/;
$var =~ s/(^\d{1,3})(\d{3})(\d{3})(\.\d\d)$/$1,$2,$3$4/;
$var =~ s/(^\d{1,3})(\d{3})(\d{3})(\d{3})(\.\d\d)$/$1,$2,$3,$4$5/;
$var =~ s/(^\d{1,3})(\d{3})(\d{3})(\d{3})(\d{3})(\.\d\d)$/$1,$2,$3,$4,$5$6/;
#print "COMMIFY made $var\n"; #DEBUG
return $var;
}
A solution that produces a localized output:
# First part - Localization
my ( $thousands_sep, $decimal_point, $negative_sign );
BEGIN {
my ( $l );
use POSIX qw(locale_h);
$l = localeconv();
$thousands_sep = $l->{ 'thousands_sep' };
$decimal_point = $l->{ 'decimal_point' };
$negative_sign = $l->{ 'negative_sign' };
}
# Second part - Number transformation
sub readable_number {
my $val = shift;
#my $thousands_sep = ".";
#my $decimal_point = ",";
#my $negative_sign = "-";
sub _readable_int {
my $val = shift;
# a pinch of PERL magic
return scalar reverse join $thousands_sep, unpack( "(A3)*", reverse $val );
}
my ( $i, $d, $r );
$i = int( $val );
if ( $val >= 0 ) {
$r = _readable_int( $i );
} else {
$r = $negative_sign . _readable_int( -$i );
}
# If there is decimal part append it to the integer result
if ( $val != $i ) {
( undef, $d ) = ( $val =~ /(\d*)\.(\d*)/ );
$r = $r . $decimal_point . $d;
}
return $r;
}
The first part gets the symbols used in the current locale to be used on the second part.
The BEGIN block is used to calculate the sysmbols only once at the beginning.
If for some reason there is need to not use POSIX locale, one can ommit the first part and uncomment the variables on the second part to hardcode the sysmbols to be used ($thousands_sep, $thousands_sep and $thousands_sep)
I'm having some trouble with parsing a file.
Two lines in the file contain the word ' Mapped', and I would like to extract the number that is in those two lines.
And this is my code:
my %cellHash = ();
my $mapped = 0;
my $alnPairs = 0;
my #mappedReads = ();
while (<ALIGN_SUMMARY>) {
chomp($_);
if (/Mapped/) {
print "\n$_\n";
$mapped = (split / /, $_)[2];
push(#mappedReads, $mapped);
}
if (/Aligned pairs/) {
print "\n$_\n";
$alnPairs = (split / /, $_)[4];
}
}
{ $cellHash{$cellDir} } = (
'MappedR1' => $mappedReads[0] ,
'MappedR2' => $mappedReads[1] ,
'AlnPairs' => $alnPairs ,
);
foreach my $cellName ( keys %cellHash){
print OUTPUT $cellName,
"\t", ${ $cellHash{$cellName} }{"LibSize"},
"\t", ${ $cellHash{$cellName} }{"MappedR1"},
"\t", ${ $cellHash{$cellName} }{"MappedR2"},
"\t", ${ $cellHash{$cellName} }{"AlnPairs"},
"\n";
}
But the OUTPUT file only has the 'AlignedPairs' column and never anything in MappedR1 or MappedR2.
What am I doing wrong? Thanks!
When I look at the file, it looks like there is more than a single space. Here is an example of what I mean and what I did to extract the number.
my $test = "blah : 123455";
my #test_ary = split(/ /, $test);
print scalar #test_ary . "\n"; # Prints the size of the array
$number = $1 if $test =~ m/([0-9]+)/;
print "$number\n"; # Prints the extracted number
Output of run:
Size of array: 8
The extracted number: 123455
Hope this helps.
First off, paste in your actual input and output if you want anyone to actually test somethnig for you, not an image.
Second, you're not splitting on whitespace, you're splitting on a single literal space. Use the special case of
split ' ', $_;
to split on arbitrary length whitespace, discarding leading and trailing whitespace.
I need to handle URI (i.e. percent) encoding and decoding in my Perl script. How do I do that?
This is a question from the official perlfaq. We're importing the perlfaq to Stack Overflow.
This is the official FAQ answer minus subsequent edits.
Those % encodings handle reserved characters in URIs, as described in RFC 2396, Section 2. This encoding replaces the reserved character with the hexadecimal representation of the character's number from the US-ASCII table. For instance, a colon, :, becomes %3A.
In CGI scripts, you don't have to worry about decoding URIs if you are using CGI.pm. You shouldn't have to process the URI yourself, either on the way in or the way out.
If you have to encode a string yourself, remember that you should never try to encode an already-composed URI. You need to escape the components separately then put them together. To encode a string, you can use the URI::Escape module. The uri_escape function returns the escaped string:
my $original = "Colon : Hash # Percent %";
my $escaped = uri_escape( $original );
print "$escaped\n"; # 'Colon%20%3A%20Hash%20%23%20Percent%20%25'
To decode the string, use the uri_unescape function:
my $unescaped = uri_unescape( $escaped );
print $unescaped; # back to original
If you wanted to do it yourself, you simply need to replace the reserved characters with their encodings. A global substitution is one way to do it:
# encode
$string =~ s/([^^A-Za-z0-9\-_.!~*'()])/ sprintf "%%%0x", ord $1 /eg;
#decode
$string =~ s/%([A-Fa-f\d]{2})/chr hex $1/eg;
DIY encode (improving above version):
$string =~ s/([^^A-Za-z0-9\-_.!~*'()])/ sprintf "%%%02x", ord $1 /eg;
(note the '%02x' rather than only '%0x')
DIY decode (adding '+' -> ' '):
$string =~ s/\+/ /g; $string =~ s/%([A-Fa-f\d]{2})/chr hex $1/eg;
Coders helping coders - bartering knowledge!
Maybe this will help deciding which method to choose.
Benchmarks on perl 5.32. Every function returns same result for given $input.
Code:
#!/usr/bin/env perl
my $input = "ala ma 0,5 litra 40%'owej vodki :)";
use Net::Curl::Easy;
my $easy = Net::Curl::Easy->new();
use URI::Encode qw( uri_encode );
use URI::Escape qw( uri_escape );
use Benchmark(cmpthese);
cmpthese(-3, {
'a' => sub {
my $string = $input;
$string =~ s/([^^A-Za-z0-9\-_.!~*'()])/ sprintf "%%%0x", ord $1 /eg;
},
'b' => sub {
my $string = $input;
$string = $easy->escape( $string );
},
'c' => sub {
my $string = $input;
$string = uri_encode( $string, {encode_reserved => 1} );
},
'd' => sub {
my $string = $input;
$string = uri_escape( $string );
},
});
And results:
Rate c d a b
c 5618/s -- -98% -99% -100%
d 270517/s 4716% -- -31% -80%
a 393480/s 6905% 45% -- -71%
b 1354747/s 24016% 401% 244% --
Not surprising. A specialized C solution is the fast. An in-place regex with no sub calls is quite fast, followed closely by a copying regex with a sub call. I didn't look into why uri_encode was so much worse than uri_escape.
use URI and it will make URLs that just work.
I was working on parsing an Excel file that has Japanese in some of the cells. By using Spreadsheet::ParseExcel (Ver. 0.15) (which I know is older than current version), some of the cells with the characters below:
<設定B-1コース>
are appearing as:
print Dumper $oWkc->{_Value};
$VAR1 = "\x{ff1c}\x{8a2d}\x{5b9a}B-\x{ff11}\x{30b3}\x{30fc}\x{30b9}\x{ff1e}";
and
print $oWkc->{Val} . "\n";
[-0
$VAR1 = "\x{ff1c}\x{8a2d}\x{5b9a}B-\x{ff13}\x{30b3}\x{30fc}\x{30b9}\x{ff1e}";
[-0
If I want to get these values printed in the actual foramat, I am setting the STDOUT File handle to ":utf8", and my terminal to display utf-8 encoding (otherwise I am getting some "wide character" warning). Here I have to pick cells with B-1 or B-2 , but I am not sure what should be set inside my script so that these characters can be treated as what I am able to see them on STDOUT.
Currently I am using a regular expression to convert these wide characters to their corresponding ASCII value. As an example if I want to match B-1 which is stored as 'B-\x{ff11}', I will be
$oWkc->{_Value} =~ /([AB]-)(\x{ff11}|\x{ff12}|\x{ff13}/
my $lookup = $1.$2;
$lookup =~ s/\x{ff11}/1/;
$lookup =~ s/\x{ff12}/2/;
$lookup =~ s/\x{ff13}/3/;
For reference, B-1, A-2 etc these values are coming from some other source, and currently are ranging from A|B-[1-3].
What is the standard way to deal with these wide characters? I am not able to use encode/decode etc. Can any one give me some direction?
Currently though I am able to get the work done using regex...
While I did not verify it (I am not going to install a module from March 2001), the module apparently already decodes to Perl native strings, so you do not have to do much. The straightforward way works just fine, no need to overcomplicate things by those substitutions.
use utf8;
my $val = '<設定B-1コース>';
# does it match A or B, followed by a dash, followed by a fullwidth 1,2 or 3?
$val =~ /(?:A|B)-[123]/; # returns true/1
To deal with multi-byte characters in Spreadsheet::ParseExcel you should update to the latest version and use the FmtJapan formatter. Several bug fixes around Japanese formatting went into recent versions.
Here is an example:
#!/usr/bin/perl
use warnings;
use strict;
use Spreadsheet::ParseExcel;
use Spreadsheet::ParseExcel::FmtJapan;
my $filename = 'Test2000J.xls';
my $parser = Spreadsheet::ParseExcel->new();
my $formatter = Spreadsheet::ParseExcel::FmtJapan->new();
my $workbook = $parser->parse($filename, $formatter);
if ( !defined $workbook ) {
die "Parsing error: ", $parser->error(), ".\n";
}
# Set your output encoding.
binmode STDOUT, ':encoding(cp932)';
# Or maybe this:
#binmode STDOUT, ':utf8';
for my $worksheet ( $workbook->worksheets() ) {
print "Worksheet name: ", $worksheet->get_name(), "\n\n";
my ( $row_min, $row_max ) = $worksheet->row_range();
my ( $col_min, $col_max ) = $worksheet->col_range();
for my $row ( $row_min .. $row_max ) {
for my $col ( $col_min .. $col_max ) {
my $cell = $worksheet->get_cell( $row, $col );
next unless $cell;
print " Row, Col = ($row, $col)\n";
print " Value = ", $cell->value(), "\n";
print " Unformatted = ", $cell->unformatted(), "\n";
print "\n";
}
}
}