extract multiple substr and match & replace using perl - perl

I need to extract multiple substrings at fixed positions from a line and the same time replace whitespaces at another position.
For example, I have a string '01234567890 '. I want to extract characters at positions 1,2,6,7,8 and the same time if position 12, 13 are whitespaces, I want to replace them with 0101. It is all position based.
What is the best way to achieve this using perl ?
I can use substr and string comparison and then concatenate them together, but the code looked rather chuncky....

I would probably split (or: explode) the string into an array of single chars:
my #chars = split //, $string; # // is special with split
Now we can do array slices: extracting multiple arguments at once.
use List::MoreUtils qw(all);
if (all {/\s/} #chars[12, 13]) {
#chars[12, 13] = (0, 1);
my #extracted_chars = #chars[1, 2, 6..8];
# do something with extracted data.
}
We can then turn the #chars back into a string like
$string = join "", #chars;
If you want to remove certain chars instead of extracting them, you would have to use slices inside a loop, an ugly undertaking.
Complete sub with nice interface to do this kind of thing
sub extract (%) {
my ($at, $ws, $ref) = #{{#_}}{qw(at if_whitespace from)};
$ws //= [];
my #chars = split //, $$ref;
if (all {/\s/} #chars[#$ws]) {
#chars[#$ws] = (0, 1) x int(#$ws / 2 + 1);
$$ref = join "", #chars;
return #chars[#$at];
}
return +();
}
my $string = "0123456789ab \tef";
my #extracted = extract from => \$string, at => [1,2,6..8], if_whitespace => [12, 13];
say "#extracted";
say $string;
Output:
1 2 6 7 8
0123456789ab01ef

This is two separate operations, and should be coded as such. This code seems to do what you need.
use strict;
use warnings;
my $str = 'abcdefghijab efghij';
my #extracted = map { substr $str, $_, 1 } 1, 2, 6, 7, 8;
print "#extracted\n";
for (substr $str, 12, 2) {
$_ = '01' if $_ eq ' ';
}
print $str, "\n";
output
b c g h i
abcdefghijab01efghij

Related

How to substring a string with several position with Perl?

I have several places where I want to cut my string in several parts.
For example:
$string= "AACCAAGTAA";
#cut_places= {0,4, 8 };
My $string should look like this: AACC AAGT AA;
How can I do that?
To populate an array, use round parentheses, not curly brackets (they're used for hash references).
One possible way is to use substr where the first argument is the position, so you can use the array elements. You just need to compute the length by subtracting the position from the following one; and to be able to compute the last length, you need the length of the whole string, too:
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = 'AACCAAGTAA';
my #cut_places = (0, 4, 8);
push #cut_places, length $string;
my #parts = map {
substr $string, $cut_places[$_], $cut_places[$_+1] - $cut_places[$_]
} 0 .. $#cut_places - 1;
say for #parts;
If the original array contained lengths instead of positions, the code would be much easier.
#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
my $string = 'AACCAAGTAA';
my #lengths = (4, 4, 2); # 4, 4, 4 would work, too
my #parts = unpack join("", map "A$_", #lengths), $string;
say for #parts;
See unpack for details.
Here's a solution that starts by calculating the forward differences in the list of positions. The length of the string is first appended to the end of the list of it doesn't already span the full string
The differences are then used to build an unpack format string, which is used to build the required sequence of substrings.
I have written the functionality as a do block, which would be simple to convert to a subroutine if desired.
use strict;
use warnings 'all';
use feature 'say';
my $string = 'AACCAAGTAA';
my #cut_places = ( 0, 4, 8 );
my #parts = do {
my #places = #cut_places;
my $len = length $string;
push #places, $len unless $places[-1] >= $len;
my #w = map { $places[$_]-$places[$_-1] } 1 .. $#places;
my $patt = join ' ', map { "A$_" } #w;
unpack $patt, $string;
};
say "#parts";
output
AACC AAGT AA
Work out the lengths of needed parts first, then all methods are easier. Here regex is used
use warnings;
use strict;
use feature 'say';
my $string = 'AACCAAGTAA';
my #pos = (0, 4, 8);
my #lens = do {
my $prev = shift #pos;
"$prev", map { my $e = $_ - $prev; $prev = $_; $e } #pos;
};
my $patt = join '', map { '(.{'.$_.'})' } #lens;
my $re = qr/$patt/;
my #parts = grep { /./ } $string =~ /$re(.*)/g;
say for #parts;
The lengths #lens are computed by subtracting the successive positions, 2-1, 3-2 (etc). I use do merely so that the #prev variable, unneeded elsewhere, doesn't "pollute" the rest of the code.
The "$prev" is quoted so that it is evaluated first, before it changes in map.
The matches returned by regex are passed through grep to filter out empty string(s) due to the 0 position (or whenever successive positions are the same).
This works for position arrays of any lengths, as long as positions are consistent with a string.

Splitting and tallying substrings within mixed integer-string data

Input Data (example):
40A3B35A3C
30A5B28A2C2B
Desired output (per-line) is a single number determined by the composition of the code 40A3B35A3C and the following rules:
if A - add the proceeding number to the running total
if B - add the proceeding number to the running total
if C - subtract the proceeding number from the running total
40A 3B 35A 3C would thus produce 40 + 3 + 35 - 3 = 75.
Output from both lines:
75
63
Is there an efficient way to achieve this for a particular column (such as $F[2]) in a tab-delimited .txt file using a one-liner? I have considered splitting the entire code into individual characters, then performing if statement checks to detect A/B/C, but my Perl knowledge is limited and I am unsure how to go about this.
When you use split with a capture, the captured group is returned from split, too.
perl -lane '
#ar = split /([ABC])/, $F[2];
$s = 0;
$s += $n * ("C" eq $op ? -1 : 1) while ($n, $op) = splice #ar, 0, 2;
print $s
' < input
Or maybe more declarative:
BEGIN { %one = ( A => 1,
B => 1,
C => -1 ) }
#ar = split /([ABC])/, $F[2];
$s = 0;
$s += $n * $one{$op} while ($n, $op) = splice #ar, 0, 2;
print $s
When working through a string like this, it's useful to know that regular expressions can return a list of results.
E.g.
my #matches = $str =~ m/(\d+[A-C])/g; #will catch repeated instances
So you can do something like this:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
my $total;
#break the string into digit+letter groups.
for (m/(\d+[A-C])/g) {
#separate out this group into num and code.
my ( $num, $code ) = m/(\d+)([A-C])/;
print "\t",$num, " => ", $code, "\n";
if ( $code eq "C" ) {
$total -= $num;
}
else {
$total += $num;
}
}
print $total, " => ", $_;
}
__DATA__
40A3B35A3C
30A5B28A2C2B
perl -lne 'push #a,/([\d]+)[AB]/g;
push #b,/([\d]+)[C]/g;
$sum+=$_ for(#a);$sum-=$_ for(#b);
print $sum;#a=#b=();undef $sum' Your_file
how it works
use the command line arg as the input
set the hash "%op" to the
operations per letter
substitute the letters for operators in the
input evaluate the substituted input as an expression
use strict;
use warnings;
my %op=qw(A + B + C -);
$ARGV[0] =~ s/(\d+)(A|B|C)/$op{$2} $1/g;
print eval($ARGV[0]);

How to get values in different array from main array splitting by keyword in perl? [duplicate]

This question already has answers here:
Getting many values in an array in perl
(3 answers)
Closed 7 years ago.
I have one string FORCE=(1,10,A,11,20,D,31,5,BI,A,36,9,NU,D,46,9,D)
I want to store these values in different arrays when ever A/D is found, using perl.
Eg.
Array1=1,10,A
Array2=11,20,D
Array3=31,5,BI,A
Array4=36,9,NU,D
Array5=46,9,D
It is not known that the bunch will be of 3 or 4 values!
Currently I am splitting the array with split
#!/usr/bin/perl
use strict;
use warnings;
#main = "FORCE=(1,10,A,11,20,D,31,5,BI,A,36,9,NU,D,46,9,D)";
my #val = split(/,/,$1);
print "Val Array = #val\n";
But how to proceed further?
# Grab the stuff inside the parens.
my $input = "FORCE=(1,10,A,11,20,D,31,5,BI,A,36,9,NU,D,46,9,D)";
my ($vals_str) = $input =~ /\(([^)]+)\)/;
# Get substrings of interest.
my #groups = $vals_str =~ /[^,].+?,[AD](?=,|$)/g;
# Split those into your desired arrays.
my #forces = map [split /,/, $_], #groups;
Note that this regex-based approach is reasonable for situations when you can assume that your input data is fairly clean. If you need to handle messier data and need your code to perform validation, I would suggest that you consider a different parsing strategy (as suggested in other answers).
my $str = 'FORCE=(1,10,A,11,20,D,31,5,BI,A,36,9,NU,D,46,9,D)';
my ($list) = $str =~ /^[^=]*=\(([^()]*)\)$/
or die("Unexpected format");
my #list = split(/,/, $list);
my #forces;
while (#list) {
my #force;
while (1) {
die('No "A" or "D" value found') if !#list;
push #force, shift(#list);
last if $force[-1] eq 'A' || $force[-1] eq 'D';
}
push #forces, \#force;
}
Result:
#{$forces[0]} = ( 1, 10, 'A' );
#{$forces[1]} = ( 11, 20, 'D' );
#{$forces[2]} = ( 31, 5, 'BI', 'A' );
#{$forces[3]} = ( 36, 9, 'NU', 'D' );
#{$forces[4]} = ( 46, 9, 'D' );
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils 'part';
# Grab the stuff inside the parens.
my $input = "FORCE=(1,10,A,11,20,D,31,5,BI,A,36,9,NU,D,46,9,D)";
my ($vals_str) = $input =~ /\(([^)]+)\)/;
my #val = split(/,/,$vals_str);
print "Val Array = #val\n";
my $i = 0;
my #partitions = part { $_ eq 'A' || $_ eq 'D' ? $i++ : $i } #val;
creates an array #partitions where each element is a reference to an array with the 3 or 4 elements you want grouped.
Let's start with some issues:
#main = "FORCE=(1,10,A,11,20,D,31,5,BI,A,36,9,NU,D,46,9,D)";
You have use strict, but first you never declare #main, and #main is an array, but you're assigning it a single string.
my #val = split(/,/,$1);
Where does $1 come from?
print "Val Array = #val\n";
This might actually work. if #val had anything in it.
You have:
Array1=1,10,A
Array2=11,20,D
Array3=31,5,BI,A
Array4=36,9,NU,D
Array5=46,9,D
As your desired results. Are these scalar variables, or are these sub-arrays?
I'm going to assume the following:
You need to convert your FORCE string into an array.
You need your results in various arrays.
Because of this, I'm going to use an Array of Arrays which means I'm going to be using References.
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
# Convert the string into an array
my $force = "FORCE=(1,10,A,11,20,D,31,5,BI,A,36,9,NU,D,46,9,D)";
$force =~ s/FORCE=\((.*)\)/$1/; # Remove the "FORCE=(" prefix and the ")" suffix
my #main = split /,/, $force; # Convert string into an array
my #array_of_arrays; # Where I'm storing the arrays of arrays
my $array_of_arrays_number = 0; # Array number I'm using for #arrays
while (#main) { # Going through my "#main" array one character at a time
# Take a character from the #main array and put it onto whatever array of arrays you're pushing items into
my $character = shift #main;
push #{ $array_of_arrays[$array_of_arrays_number] }, $character;
# If Character is 'A' or 'D', start a new array_of_arrays
if ( $character eq 'A' or $character eq 'D' ) {
$array_of_arrays_number += 1;
}
}
# Let's print out these arrays
for my $array_number ( 0..$#array_of_arrays ) {
say "Array$array_number = ", join ", ", #{ $array_of_arrays[$array_number] };
}
I like functional approach so there is the version which makes splice indices first and then generates arrays of subarrays
use strict;
use warnings;
use Carp;
sub splice_force ($) {
my $str = shift;
croak "Unexpected format" unless $str =~ /^FORCE=\(([^()]*)\)/;
my #list = split ',', $1;
# find end positions for each splice
my #ends = grep $list[$_] =~ /^[AD]$/, 0 .. $#list;
# make array with starting positions
my #starts = ( 0, map $_ + 1, #ends );
#finally make splices (ignore last #starts element so iterate by #ends)
map [ #list[ shift(#starts) .. $_ ] ], #ends;
}
my $str = 'FORCE=(1,10,A,11,20,D,31,5,BI,A,36,9,NU,D,46,9,D)';
print "#$_\n" for splice_force $str;
You can do this without creating intermediate arrays:
#!/usr/bin/env perl
use strict;
use warnings;
my $input = q{FORCE=(1,10,A,11,20,D,31,5,BI,A,36,9,NU,D,46,9,D)};
my #groups = ([]);
while ($input =~ / ([A-Z0-9]+) ( [,)] ) /xg) {
my ($token, $sep) = ($1, $2);
push #{ $groups[-1] }, $token;
$token =~ /\A(?:A|D)\z/
or next;
$sep eq ')'
and last;
push #groups, [];
}
use YAML::XS;
print Dump \#groups;
Output:
---
- - '1'
- '10'
- A
- - '11'
- '20'
- D
- - '31'
- '5'
- BI
- A
- - '36'
- '9'
- NU
- D
- - '46'
- '9'
- D
There is no need for anything more than split. This solution checks that the string has the expected form and extracts the characters between the parentheses. Then that is split on commas that are preceded by a field that contains A or D, and the result is split again on commas.
use strict;
use warnings;
use 5.014; # For \K regex pattern
my $str = 'FORCE=(1,10,A,11,20,D,31,5,BI,A,36,9,NU,D,46,9,D)';
my #parts;
if ( $str =~ /FORCE \s* = \s* \( ( [^)]+ ) \)/x ) {
#parts = map [ split /,/ ], split / [AD] [^,]* \K , /x, $1;
}
use Data::Dump;
dd \#parts;
output
[
[1, 10, "A"],
[11, 20, "D"],
[31, 5, "BI", "A"],
[36, 9, "NU", "D"],
[46, 9, "D"],
]

Cutting apart string in Perl

I have a string in Perl that is 23 digits long. I need to cut it apart into different pieces. First 2 digits in one variable, next 3 in another variable, next 4 into another variable, etc. Basically the 23 digits needs to end up as 6 separate variables (2,3,4,4,3,7) characters, in that order.
Any ideas how I can cut the string up like this?
There are lots of ways to do it, but the shortest is probably unpack:
my $string = '1' x 23;
my #values = unpack 'A2A3A4A4A3A7', $string;
If you need separate variables, you can use a list assignment:
my ($v1, $v2, $v3, $v4, $v5, $v6) = unpack 'A2A3A4A4A3A7', $string;
Expanding on Alex's method, rather than specify each start and end, use the list you gave of lengths.
#!/usr/bin/env perl
use strict;
use warnings;
my $string = "abcdefghijklmnopqrstuvw";
my $pos = 0;
my #split = map {
my $start = $pos;
my $end = $_;
$pos += $end;
substr( $string, $start, $end);
} (2,3,4,4,3,7);
print "$_\n" for #split;
This said you probably should look at unpack which is used for fixed width fields. I have no experience with it though.
You could use a regex, viz:
$string =~ /\d{2}\d{3}\d{4}\d{4}\d{3}\d{7}/
and capture each part by surrounding with brackets ().
You then find each capture in the variables $1, $2 ...
or get them all in the returned list
See perldoc perlre
You want to use perldoc substr.
$substring = substr($string, $start, $length);
I'd also use `map' on a list of [start, length] pairs to make your life easier:
$string = "123456789";
#values = map {substr($string, $_->[0], $_->[1])} ([1, 3], [4, 2] , ...);
Here's a sub that will do it, using the already discussed unpack.
sub string_slices {
my $str = shift;
return unpack( join( 'A', '', #_ ), $str );
}

Is there any way to grab a slice to the end of an anonymous array in Perl?

So this has been making me go bonkers for the last half hour. Is there any way for me to grab an array slice to the end of an anonymous array? I've tried:
(split(' ',$test_line))[1..$#_]
and I've tried:
(split(' ',$test_line))[1..-1]
but aggravatingly, neither of those work. I really don't want to have an extra temp variable instantiated to the intermediate array (which I don't need). And I really don't want to use an ugly and unreadable one liner (found a few of those online). Is there really no straight forward way to do this?
A list, which is what you have in your example, can not be sliced from the end. This is mainly because lists are not proper data structures in Perl, but more a construct that the interpreter uses to move data around. So knowing that you can only slice a list from the begining, your options are to either put it in an array variable and then slice, change your algorithm to return what you want, or the following:
If you are assigning this value to something, you can use undef in each slot you dont want:
my (undef, #list) = split ' ' => $test_line;
If you post some more code, I can revise.
Alternatively, you can use some tools from functional programming. The pair of functions drop and take can be useful to resize a list without additional variables:
sub take {
my $n = shift;
#_[0..$n-1]
}
sub drop {
my $n = shift;
#_[$n..$#_]
}
and then your example becomes
drop 1, split ' ' => $test_line;
drop 1 is also commonly called tail
sub tail {drop 1, #_}
and of course, since all of these are so short, if you wanted to inline it:
sub {shift; #_}->(split ' ' => ...)
When the OP said slice, I thought of splice:
#allTheWordsExceptTheFirstTwo = splice #{[split' ', $test_line]}, 2;
#allExceptTheFirstAndLastTwo = splice #{[split' ', $test_line]}, 2, -2;
You can use negative ranges in the array subscript to address an arbitrary number of elements from the end:
my $x = join ' ' => 'a' .. 'z';
my #x = (split ' ', $x)[-13 .. -1];
However, this requires you to know the total number of elements in the result of split to eliminate just the first element.
If this happens in only one place, using a do block should work:
my $x = join ' ', 'a' .. 'z';
my #x = do { my #y = (split ' ', $x); #y[1 .. $#y] };
In your case, I would factor out the whole operation to a subroutine if it is supposed to be used frequently, passing the string rather than the result of the split to the subroutine (can be further generalized by allowing the user to pass the split pattern as well:
my $x = join ' ', 'a' .. 'g';
my #x = skip_first_n_from_split(3, $x);
print Dump \#x;
sub skip_first_n_from_split {
my ($n, $x) = #_;
my #y = split ' ', $x;
return #y[$n .. $#y];
}
Having fun:
#!/usr/bin/perl
use strict; use warnings;
my $x = join ' ', 1 .. 8;
my #skippers = map make_skipper(' ', $_), 0 .. 7;
print "#$_\n" for map $_->($x), #skippers;
sub make_skipper {
my ($pattern, $n) = #_;
return sub {
my $string = shift;
my $i = 0;
return [ grep $i++ >= $n, split $pattern, $string ];
}
}
Output:
1 2 3 4 5 6 7 8
2 3 4 5 6 7 8
3 4 5 6 7 8
4 5 6 7 8
5 6 7 8
6 7 8
7 8
8
I don't believe you can specify an index for the last element of an arbitrary list expression, but how about:
split(' ', (split ' ', $test_line, 2)[1])
By the way, there are no anonymous arrays here (or in your original question), only lists.
This just got answered 'neatly' on Perlmonks by BrowserUK, this is what you'd do:
my #slice = sub{ #_[1..$#_] }->( split ' ', $test_line );
if you're open to splitting twice:
my #g = (split ' ', $test_str)[1..split(' ', $test_str)];
or more correctly (since split returns the number of fields found (one more than the last field's index since it is 0-based):
my #g = (split ' ', $test_str)[1..split(' ', $test_str)-1];
unfortunately these throw a deprecated warning under 'warnings' pragma, and clobbers the contents of #_ (unless you're using 5.12, then you're good, otherwise go with a temporary variable, inline sub or a loop).