Cutting apart string in Perl

Cutting apart string in Perl - perl

I have a string in Perl that is 23 digits long. I need to cut it apart into different pieces. First 2 digits in one variable, next 3 in another variable, next 4 into another variable, etc. Basically the 23 digits needs to end up as 6 separate variables (2,3,4,4,3,7) characters, in that order.
Any ideas how I can cut the string up like this?

There are lots of ways to do it, but the shortest is probably unpack:
my $string = '1' x 23;
my #values = unpack 'A2A3A4A4A3A7', $string;
If you need separate variables, you can use a list assignment:
my ($v1, $v2, $v3, $v4, $v5, $v6) = unpack 'A2A3A4A4A3A7', $string;

Expanding on Alex's method, rather than specify each start and end, use the list you gave of lengths.
#!/usr/bin/env perl
use strict;
use warnings;
my $string = "abcdefghijklmnopqrstuvw";
my $pos = 0;
my #split = map {
my $start = $pos;
my $end = $_;
$pos += $end;
substr( $string, $start, $end);
} (2,3,4,4,3,7);
print "$_\n" for #split;
This said you probably should look at unpack which is used for fixed width fields. I have no experience with it though.

You could use a regex, viz:
$string =~ /\d{2}\d{3}\d{4}\d{4}\d{3}\d{7}/
and capture each part by surrounding with brackets ().
You then find each capture in the variables $1, $2 ...
or get them all in the returned list
See perldoc perlre

You want to use perldoc substr.
$substring = substr($string, $start, $length);
I'd also use `map' on a list of [start, length] pairs to make your life easier:
$string = "123456789";
#values = map {substr($string, $_->[0], $_->[1])} ([1, 3], [4, 2] , ...);

Here's a sub that will do it, using the already discussed unpack.
sub string_slices {
my $str = shift;
return unpack( join( 'A', '', #_ ), $str );
}

Related

How to substring a string with several position with Perl?

I have several places where I want to cut my string in several parts.
For example:
$string= "AACCAAGTAA";
#cut_places= {0,4, 8 };
My $string should look like this: AACC AAGT AA;
How can I do that?

To populate an array, use round parentheses, not curly brackets (they're used for hash references).
One possible way is to use substr where the first argument is the position, so you can use the array elements. You just need to compute the length by subtracting the position from the following one; and to be able to compute the last length, you need the length of the whole string, too:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = 'AACCAAGTAA';
my #cut_places = (0, 4, 8);
push #cut_places, length $string;
my #parts = map {
substr $string, $cut_places[$_], $cut_places[$_+1] - $cut_places[$_]
} 0 .. $#cut_places - 1;
say for #parts;
If the original array contained lengths instead of positions, the code would be much easier.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = 'AACCAAGTAA';
my #lengths = (4, 4, 2); # 4, 4, 4 would work, too
my #parts = unpack join("", map "A$_", #lengths), $string;
say for #parts;
See unpack for details.

Here's a solution that starts by calculating the forward differences in the list of positions. The length of the string is first appended to the end of the list of it doesn't already span the full string
The differences are then used to build an unpack format string, which is used to build the required sequence of substrings.
I have written the functionality as a do block, which would be simple to convert to a subroutine if desired.
use strict;
use warnings 'all';
use feature 'say';
my $string = 'AACCAAGTAA';
my #cut_places = ( 0, 4, 8 );
my #parts = do {
my #places = #cut_places;
my $len = length $string;
push #places, $len unless $places[-1] >= $len;
my #w = map { $places[$_]-$places[$_-1] } 1 .. $#places;
my $patt = join ' ', map { "A$_" } #w;
unpack $patt, $string;
};
say "#parts";
output
AACC AAGT AA

Work out the lengths of needed parts first, then all methods are easier. Here regex is used
use warnings;
use strict;
use feature 'say';
my $string = 'AACCAAGTAA';
my #pos = (0, 4, 8);
my #lens = do {
my $prev = shift #pos;
"$prev", map { my $e = $_ - $prev; $prev = $_; $e } #pos;
};
my $patt = join '', map { '(.{'.$_.'})' } #lens;
my $re = qr/$patt/;
my #parts = grep { /./ } $string =~ /$re(.*)/g;
say for #parts;
The lengths #lens are computed by subtracting the successive positions, 2-1, 3-2 (etc). I use do merely so that the #prev variable, unneeded elsewhere, doesn't "pollute" the rest of the code.
The "$prev" is quoted so that it is evaluated first, before it changes in map.
The matches returned by regex are passed through grep to filter out empty string(s) due to the 0 position (or whenever successive positions are the same).
This works for position arrays of any lengths, as long as positions are consistent with a string.

Perl printf to use commas as thousands-separator

Using awk, I can print a number with commas as thousands separators.
(with a export LC_ALL=en_US.UTF-8 beforehand).
awk 'BEGIN{printf("%\047d\n", 24500)}'
24,500
I expected the same format to work with Perl, but it does not:
perl -e 'printf("%\047d\n", 24500)'
%'d
The Perl Cookbook offers this solution:
sub commify {
my $text = reverse $_[0];
$text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;
return scalar reverse $text;
}
However I am assuming that since the printf option works in awk, it should also work in Perl.

The apostrophe format modifier is a non-standard POSIX extension.
The documentation for Perl's printf has this to say about such extensions
Perl does its own "sprintf" formatting: it emulates the C
function sprintf(3), but doesn't use it except for
floating-point numbers, and even then only standard modifiers
are allowed. Non-standard extensions in your local sprintf(3)
are therefore unavailable from Perl.
The Number::Format module will do this for you, and it takes its default settings from the locale, so is as portable as it can be
use strict;
use warnings 'all';
use v5.10.1;
use Number::Format 'format_number';
say format_number(24500);
output
24,500

A more perl-ish solution:
$a = 12345678; # no comment
$b = reverse $a; # $b = '87654321';
#c = unpack("(A3)*", $b); # $c = ('876', '543', '21');
$d = join ',', #c; # $d = '876,543,21';
$e = reverse $d; # $e = '12,345,678';
print $e;
outputs 12,345,678.

I realize this question was from almost 4 years ago, but since it comes up in searches, I'll add an elegant native Perl solution I came up with. I was originally searching for a way to do it with sprintf, but everything I've found indicates that it can't be done. Then since everyone is rolling their own, I thought I'd give it a go, and this is my solution.
$num = 12345678912345; # however many digits you want
while($num =~ s/(\d+)(\d\d\d)/$1\,$2/){};
print $num;
Results in:
12,345,678,912,345
Explanation:
The Regex does a maximal digit search for all leading digits. The minimum number of digits in a row it'll act on is 4 (1 plus 3). Then it adds a comma between the two. Next loop if there are still 4 digits at the end (before the comma), it'll add another comma and so on until the pattern doesn't match.
If you need something safe for use with more than 3 digits after the decimal, use this modification: (Note: This won't work if your number has no decimal)
while($num =~ s/(\d+)(\d\d\d)([.,])/$1\,$2$3/){};
This will ensure that it will only look for digits that ends in a comma (added on a previous loop) or a decimal.

Most of these answers assume that the format is universal. It isn't. CLDR uses Unicode information to figure it out. There's a long thread in How to properly localize numbers?.
CPAN has the CLDR::Number module:
#!perl
use v5.10;
use CLDR::Number;
use open qw(:std :utf8);
my $locale = $ARGV[0] // 'en';
my #numbers = qw(
123
12345
1234.56
-90120
);
my $cldr = CLDR::Number->new( locale => $locale );
my $decf = $cldr->decimal_formatter;
foreach my $n ( #numbers ) {
say $decf->format($n);
}
Here are a few runs:
$ perl comma.pl
123
12,345
1,234.56
-90,120
$ perl comma.pl es
123
12.345
1234,56
-90.120
$ perl comma.pl bn
১২৩
১২,৩৪৫
১,২৩৪.৫৬
-৯০,১২০
It seems heavyweight, but the output is correct and you don't have to allow the user to change the locale you want to use. However, when it's time to change the locale, you are ready to go. I also prefer this to Number::Format because I can use a locale that's different from my local settings for my terminal or session, or even use multiple locales:
#!perl
use v5.10;
use CLDR::Number;
use open qw(:std :utf8);
my #locales = qw( en pt bn );
my #numbers = qw(
123
12345
1234.56
-90120
);
my #formatters = map {
my $cldr = CLDR::Number->new( locale => $_ );
my $decf = $cldr->decimal_formatter;
[ $_, $cldr, $decf ];
} #locales;
printf "%10s %10s %10s\n" . '=' x 32 . "\n", #locales;
foreach my $n ( #numbers ) {
printf "%10s %10s %10s\n",
map { $_->[-1]->format($n) } #formatters;
}
The output has three locales at once:
en pt bn
================================
123 123 ১২৩
12,345 12.345 ১২,৩৪৫
1,234.56 1.234,56 ১,২৩৪.৫৬
-90,120 -90.120 -৯০,১২০

Here's an elegant Perl solution I've been using for over 20 years :)
1 while $text =~ s/(.*\d)(\d\d\d)/$1\.$2/g;
And if you then want two decimal places:
$text = sprintf("%0.2f", $text);

1 liner: Use a little loop whith a regex:
while ($number =~ s/^(\d+)(\d{3})/$1,$2/) {}
Example:
use strict;
use warnings;
my #numbers = (12321, 12.12, 122222.3334, '1234abc', '1.1', '1222333444555,666.77');
for(#numbers) {
my $number = $_;
while ($number =~ s/^(\d+)(\d{3})/$1,$2/) {}
print "$_ -> $number\n";
}
Output:
12321 -> 12,321
12.12 -> 12.12
122222.3334 -> 122,222.3334
1234abc -> 1,234abc
1.1 -> 1.1
1222333444555,666.77 -> 1,222,333,444,555,666.77
Pattern:
(\d+)(\d{3})
-> Take all numbers but the last 3 in group 1
-> Take the remaining 3 numbers in group2 on the beginning of $number
-> Followed is ignored
Substitution
$1,$2
-> Put a seperator sign (,) between group 1 and 2
-> The rest remains unchanged
So if you have 12345.67 the numers the regex uses are 12345. The '.' and all followed is ignored.
1. run (12345.67):
-> matches: 12345
-> group 1: 12,
group 2: 345
-> substitute 12,345
-> result: 12,345.67
2. run (12,345.67):
-> does not match!
-> while breaks.

Parting from #Laura's answer, I tweaked the pure perl, regex-only solution to work for numbers with decimals too:
while ($formatted_number =~ s/^(-?\d+)(\d{3}(?:,\d{3})*(?:\.\d+)*)$/$1,$2/) {};
Of course this assumes a "," as thousands separator and a "." as decimal separator, but it should be trivial to use variables to account for that for your given locale(s).

I used the following but it does not works as of perl v5.26.1
sub format_int
{
my $num = shift;
return reverse(join(",",unpack("(A3)*", reverse int($num))));
}
The form that worked for me was:
sub format_int
{
my $num = shift;
return scalar reverse(join(",",unpack("(A3)*", reverse int($num))));
}
But to use negative numbers the code must be:
sub format_int
{
if ( $val >= 0 ) {
return scalar reverse join ",", unpack( "(A3)*", reverse int($val) );
} else {
return "-" . scalar reverse join ",", unpack( "(A3)*", reverse int(-$val) );
}
}

Did somebody say Perl?
perl -pe '1while s/(\d+)(\d{3})/$1,$2/'
This works for any integer.

# turning above answer into a function
sub format_float
# returns number with commas..... and 2 digit decimal
# so format_float(12345.667) returns "12,345.67"
{
my $num = shift;
return reverse(join(",",unpack("(A3)*", reverse int($num)))) . sprintf(".%02d",int(100*(.005+($num - int($num)))));
}
sub format_int
# returns number with commas.....
# so format_int(12345.667) returns "12,345"
{
my $num = shift;
return reverse(join(",",unpack("(A3)*", reverse int($num))));
}

I wanted to print numbers it in a currency format. If it turned out even, I still wanted a .00 at the end. I used the previous example (ty) and diddled with it a bit more to get this.
sub format_number {
my $num = shift;
my $result;
my $formatted_num = "";
my #temp_array = ();
my $mantissa = "";
if ( $num =~ /\./ ) {
$num = sprintf("%0.02f",$num);
($num,$mantissa) = split(/\./,$num);
$formatted_num = reverse $num;
#temp_array = unpack("(A3)*" , $formatted_num);
$formatted_num = reverse (join ',', #temp_array);
$result = $formatted_num . '.'. $mantissa;
} else {
$formatted_num = reverse $num;
#temp_array = unpack("(A3)*" , $formatted_num);
$formatted_num = reverse (join ',', #temp_array);
$result = $formatted_num . '.00';
}
return $result;
}
# Example call
# ...
printf("some amount = %s\n",format_number $some_amount);
I didn't have the Number library on my default mac OS X perl, and I didn't want to mess with that version or go off installing my own perl on this machine. I guess I would have used the formatter module otherwise.
I still don't actually like the solution all that much, but it does work.

This is good for money, just keep adding lines if you handle hundreds of millions.
sub commify{
my $var = $_[0];
#print "COMMIFY got $var\n"; #DEBUG
$var =~ s/(^\d{1,3})(\d{3})(\.\d\d)$/$1,$2$3/;
$var =~ s/(^\d{1,3})(\d{3})(\d{3})(\.\d\d)$/$1,$2,$3$4/;
$var =~ s/(^\d{1,3})(\d{3})(\d{3})(\d{3})(\.\d\d)$/$1,$2,$3,$4$5/;
$var =~ s/(^\d{1,3})(\d{3})(\d{3})(\d{3})(\d{3})(\.\d\d)$/$1,$2,$3,$4,$5$6/;
#print "COMMIFY made $var\n"; #DEBUG
return $var;
}

A solution that produces a localized output:
# First part - Localization
my ( $thousands_sep, $decimal_point, $negative_sign );
BEGIN {
my ( $l );
use POSIX qw(locale_h);
$l = localeconv();
$thousands_sep = $l->{ 'thousands_sep' };
$decimal_point = $l->{ 'decimal_point' };
$negative_sign = $l->{ 'negative_sign' };
}
# Second part - Number transformation
sub readable_number {
my $val = shift;
#my $thousands_sep = ".";
#my $decimal_point = ",";
#my $negative_sign = "-";
sub _readable_int {
my $val = shift;
# a pinch of PERL magic
return scalar reverse join $thousands_sep, unpack( "(A3)*", reverse $val );
}
my ( $i, $d, $r );
$i = int( $val );
if ( $val >= 0 ) {
$r = _readable_int( $i );
} else {
$r = $negative_sign . _readable_int( -$i );
}
# If there is decimal part append it to the integer result
if ( $val != $i ) {
( undef, $d ) = ( $val =~ /(\d*)\.(\d*)/ );
$r = $r . $decimal_point . $d;
}
return $r;
}
The first part gets the symbols used in the current locale to be used on the second part.
The BEGIN block is used to calculate the sysmbols only once at the beginning.
If for some reason there is need to not use POSIX locale, one can ommit the first part and uncomment the variables on the second part to hardcode the sysmbols to be used ($thousands_sep, $thousands_sep and $thousands_sep)

Perl int to array

I have a number stored in a Perl variable and I want to 'pass/convert/store' its digits in the different positions of an array. An example for a better sight:
I have, let's say, this number stored:
$hello = 429384
And I need a new array with the digits stored in it, so:
$hello2[0] = 4
$hello2[1] = 2
$hello2[2] = 9
Etc..
I can probably make it with a couple of loops, but I want to know if there is an efficient and fast way to do it. Thx in advance!

my #hello = split //, $hello;
In Perl if you use number in a string operator, the conversion is done automatically

$hello = 429384;
#hello = split //, $hello;
print $hello[0];

Using only Regex and without using any inbuilt function:
#!/usr/bin/perl
use strict;
use warnings;
my $string=429384;
my #numbers = $string =~ /./g; # dot matches a single character at a time
#and returns it
print "#numbers \n";

this is significantly faster than the regexp way:
$string = '1234567890';
$_-=48 for #digits = unpack 'C*',$string;
benchmark:
use Time::HiRes;
$string = '1234567890';
$start_time = [Time::HiRes::gettimeofday()];
for (1.. 100000){
$_-=48 for #digits= unpack 'C*',$string;
}
$diff = Time::HiRes::tv_interval($start_time);
print "\n\n$diff\n";
$start_time = [Time::HiRes::gettimeofday()];
for (1.. 100000){
#digits = split //, $string;
}
$diff = Time::HiRes::tv_interval($start_time);
print "\n\n$diff\n";
output:
0.265814
0.314735

How to get consecutive pairs of words in Perl

With this sentence:
my $sent = "Mapping and quantifying mammalian transcriptomes RNA-Seq";
We want to get all possible consecutive pairs of words.
my $var = ['Mapping and',
'and quantifying',
'quantifying mammalian',
'mammalian transcriptomes',
'transcriptomes RNA-Seq'];
Is there a compact way to do it?

Yes.
my $sent = "Mapping and quantifying mammalian transcriptomes RNA-Seq";
my #pairs = $sent =~ /(?=(\S+\s+\S+))\S+/g;

A variation that (perhaps unwisely) relies on operator evaluation order but doesn't rely on fancy regexes or indices:
my #words = split /\s+/, $sent;
my $last = shift #words;
my #var;
push #var, $last . ' ' . ($last = $_) for #words;

This works:
my #sent = split(/\s+/, $sent);
my #var = map { $sent[$_] . ' ' . $sent[$_ + 1] } 0 .. $#sent - 1;
i.e. just split the original string into an array of words, and then use map to iteratively produce the desired pairs.

I don't have it as a single line, but the following code should give you somewhere to start. Basically does it with a push and a regext with /g.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Indent = 1;
my $t1 = 'aa bb cc dd ee ff';
my $t2 = 'aa bb cc dd ee';
foreach my $txt ( $t1, $t2 )
{
my #a;
push( #a, $& ) while( $txt =~ /\G\S+(\s+\S+|)\s*/g );
print Dumper( \#a );
}
One liner thanks to the syntax from #ysth
my #a = $txt =~ /\G(\S+(?:\s+\S+|))\s*/g;
My regex is slightly different in that if you have an odd number of words, the last word still gets an entry.

Split on comma, but only when not in parenthesis

I am trying to do a split on a string with comma delimiter
my $string='ab,12,20100401,xyz(A,B)';
my #array=split(',',$string);
If I do a split as above the array will have values
ab
12
20100401
xyz(A,
B)
I need values as below.
ab
12
20100401
xyz(A,B)
(should not split xyz(A,B) into 2 values)
How do I do that?

use Text::Balanced qw(extract_bracketed);
my $string = "ab,12,20100401,xyz(A,B(a,d))";
my #params = ();
while ($string) {
if ($string =~ /^([^(]*?),/) {
push #params, $1;
$string =~ s/^\Q$1\E\s*,?\s*//;
} else {
my ($ext, $pre);
($ext, $string, $pre) = extract_bracketed($string,'()','[^()]+');
push #params, "$pre$ext";
$string =~ s/^\s*,\s*//;
}
}
This one supports:
nested parentheses;
empty fields;
strings of any length.

Here is one way that should work.
use Regexp::Common;
my $string = 'ab,12,20100401,xyz(A,B)';
my #array = ($string =~ /(?:$RE{balanced}{-parens=>'()'}|[^,])+/g);
Regexp::Common can be installed from CPAN.
There is a bug in this code, coming from the depths of Regexp::Common. Be warned that this will (unfortunately) fail to match the lack of space between ,,.

Well, old question, but I just happened to wrestle with this all night, and the question was never marked answered, so in case anyone arrives here by Google as I did, here's what I finally got. It's a very short answer using only built-in PERL regex features:
my $string='ab,12,20100401,xyz(A,B)';
$string =~ s/((\((?>[^)(]*(?2)?)*\))|[^,()]*)(*SKIP),/$1\n/g;
my #array=split('\n',$string);
Commas that are not inside parentheses are changed to newlines and then the array is split on them. This will ignore commas inside any level of nested parentheses, as long as they're properly balanced with a matching number of open and close parens.
This assumes you won't have newline \n characters in the initial value of $string. If you need to, either temporarily replace them with something else before the substitution line and then use a loop to replace back after the split, or just pick a different delimiter to split the array on.

Limit the number of elements it can be split into:
split(',', $string, 4)

Here's another way:
my $string='ab,12,20100401,xyz(A,B)';
my #array = ($string =~ /(
[^,]*\([^)]*\) # comma inside parens is part of the word
|
[^,]*) # split on comma outside parens
(?:,|$)/gx);
Produces:
ab
12
20100401
xyz(A,B)

Here is my attempt. It should handle depth well and could even be extended to include other bracketed symbols easily (though harder to be sure that they MATCH). This method will not in general work for quotation marks rather than brackets.
#!/usr/bin/perl
use strict;
use warnings;
my $string='ab,12,20100401,xyz(A(2,3),B)';
print "$_\n" for parse($string);
sub parse {
my ($string) = #_;
my #fields;
my #comma_separated = split(/,/, $string);
my #to_be_joined;
my $depth = 0;
foreach my $field (#comma_separated) {
my #brackets = $field =~ /(\(|\))/g;
foreach (#brackets) {
$depth++ if /\(/;
$depth-- if /\)/;
}
if ($depth == 0) {
push #fields, join(",", #to_be_joined, $field);
#to_be_joined = ();
} else {
push #to_be_joined, $field;
}
}
return #fields;
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Cutting apart string in Perl - perl

There are lots of ways to do it, but the shortest is probably unpack: my $string = '1' x 23; my #values = unpack 'A2A3A4A4A3A7', $string; If you need separate variables, you can use a list assignment: my ($v1, $v2, $v3, $v4, $v5, $v6) = unpack 'A2A3A4A4A3A7', $string;

You could use a regex, viz: $string =~ /\d{2}\d{3}\d{4}\d{4}\d{3}\d{7}/ and capture each part by surrounding with brackets (). You then find each capture in the variables $1, $2 ... or get them all in the returned list See perldoc perlre

You want to use perldoc substr. $substring = substr($string, $start, $length); I'd also use `map' on a list of [start, length] pairs to make your life easier: $string = "123456789"; #values = map {substr($string, $_->[0], $_->[1])} ([1, 3], [4, 2] , ...);

Here's a sub that will do it, using the already discussed unpack. sub string_slices { my $str = shift; return unpack( join( 'A', '', #_ ), $str ); }

Related

How to substring a string with several position with Perl?

Perl printf to use commas as thousands-separator

Perl int to array

How to get consecutive pairs of words in Perl

Split on comma, but only when not in parenthesis

Categories

Resources