perl splitting a string - perl

I realize that I did not make the title good enough, but I am not getting any better.
Assume there is a string
$str = "(aa)(bb)(cc)(dd)(ee)";
That is, there are substrings, enclosed in parenthesis, there is no space between the parenthesis groups, i.e. like ()(), but inside the parenthesis, where i wrote aa, bb, cc, etc, there can be spaces. Parenthesis may be nested, but that is not absolutely important. but there are unknown number of parenthesis groups.
Now I want to split the string in an array of strings, each element having a (balanced) parenthesis enclosed element . that is,
# #arr contains now ("(aa)", "(bb)", "(cc)" .. etc)
of course i can implement a counter based method, but wont perl, being perl, have some built in methods for this? I dont quite know how this particular operation called, so i dont know what to look for, string splitting is too general, no?
edit: splitting parentheses delimmited string in perl <--- searching this is not returning me anything useful, i guess this is due to the fact it is not really DELIMITED, enclosed?

#arr= map { "$_)" } split /\)/, $str;
This method strips the ending parenthesee, but then adds it back.
Another way is with the 'global' flag on a regex, which returns all matches.
#arr= ( $str =~ /\([^)]*\)/g )

There are several suggestions.
For example the first:
use strict;
my $str = "(aa)(bb)(cc)(dd)(ee)";
my #arr;
while ($str =~ /(\(.*?\))/ig) {
push #arr, $1;
};

If we ignore nesting, what you want to do is split between ) and (.
my #arr = split /(?<=\()(?=\()/, $str;
Instead of splitting, you could also extract the parts.
my #arr = $str =~ /( \( [^()]* \) )/xg;
Matching nested parens is just a matter of applying this regex pattern recursively.
my #arr = $str =~ /\G ( \( (?: [^()]++ | (?1) )* \) )/xg;

Related

Matching special character (###!~`%^&()[]}{;') and replace it with _ (underscore) in perl

I want to remove all special characters except this 2 character .-
$name=~s/[^\w\d\.-]/_/g ;
But the line above it not only removes the special character but also non-alphabet characters e.g Arabic or other none alphabet characters.
How to remove only these characters (###!~`%^&()[]}{;',)
There are a few things to consider here.
First, do \d and \w really do what you think they do? Recent perls are Unicode aware (and in some cases locale aware), and those character classes aren't the same in every situation.
Since you know what you want to exclude, you can just put that directly into the character class. You need escape only the ] so it doesn't end the character class:
use v5.10;
my $name = "(Hello] #&^% {World[} (###!~`%^&()[]}{;',)!";
$name =~ s/[(###!~`%^&()[\]}{;',)]/_/g;
say $name;
Mark Jason Dominus has written about the "American" and "Prussian" approaches to cleansing data. You can specify what to exclude, or what to include.
If you specify the things to exclude, you potentially pass through some things that you should have excluded but did not. This may be because you forgot or didn't even know you should exclude it. These unintended situations may bite you.
If you specify only the things that are safe, you potentially miss out on things you should pass through, but bad things don't get through by mistakes of omission.
You then might try this, where you don't use the character class shortcuts:
$name =~ s/[^0-9A-Za-z.-]/_/g;
But the output is a bit weird because this also replaces whitespace. You might add the \s shortcut in there:
$name =~ s/[^0-9A-Za-z\s.-]/_/g;
But the meaning of \s has also changed over time too (vertical tab!) and is also Unicode aware. You could list the whitespace you would accept:
$name =~ s/[^0-9A-Za-z\x20.-]/_/g;
But no this is getting a bit weird. There's another way. You can go back to the ASCII versions of the character class shortcuts with the /a flag:
$name =~ s/[^\d\w\s.-]/_/ga;
The regex operator flags are in perlop since they apply to an operator. But, for as long as I've been using Perl and telling that to people in classes, someone I still go to perlre first.
Transliterate
Second, the substitution operator may be more than you need though. If you want to change single characters into other single characters, the transliteration operator may be what you need. It changes the character on the left with the corresponding character on the right:
$name =~ tr/abc/XYZ/; # a -> X, b -> Y, c -> Z
If you don't have enough characters to match up on the right, it reuses the last character:
$name =~ tr/abc/XY/; # a -> X, b -> Y, c -> Y
So, in your case with one underscore:
$name =~ tr/##!~`%^&()[]}{;',/_/;
Since the sequence of characters in tr/// aren't a regular expression, you don't worry about metacharacters.
Just for giggles
If this pattern is something you want to use in multiple places, you might want to give it a name with a user-defined Unicode property. Once it has a name, you use that everywhere and can update for everyone at the same time:
use v5.10;
my $name = "(Hello] #&^% {World[} (###!~`%^&()[]}{;',)!";
$name =~ s/\p{IsForbidden}/_/g;
say $name;
sub IsForbidden {
# see https://perldoc.perl.org/perlunicode#User-Defined-Character-Properties
state $exclude = q|##!~`%^&()[]}{;',|;
state $string =
join '',
map { sprintf "%X\n", ord }
split( //, $exclude );
return $string;
}
Building on Gene's comment, specify what you want to replace but I'd escape each special character. Note, to replace #, use \#\# in character array as shown in line 2:
$name = "# # R ! ~## ` % ^ & ( O ){{();,'`## { } ;!!! ' N , ";
$name =~ s/[\#\!\~\`\%\&\^\(\)\{\}\;\'\,\#\#]//g;
$name =~ s/ *//g;
print $name;
### Outputs RON

How can I remove all the vowels unless they are in word beginnings?

$text = "I like apples more than oranges\n";
#words = split /” “/, $text;
foreach (#words) [1..] {
if $words "AEIOUaeiou";
$words =~ tr/A E I O U a e i o u//d;
}
print "$words\n";
"I like apples more than oranges" will become "I lk appls mr thn orngs". "I" in "I", "a" in "appls" and "o" in "orngs" will stay because they are the first letter in the word.
This is my research assignment as a first year student. I am allowed to ask questions and later cite them. Please don't be mean.
I know you say you are not allowed to use a regex, but for everyone else that shows up here I'll show the use of proper tools. But, then I'll do something just as useful with tr///.
One of the tricks of programming (and mathematics) decomposing what look like hard problems into easier problems, especially if you already have solutions for the easy problems. (Read about Parnas decomposition, for example).
So, the question is "How can I remove all the vowels unless they are in word beginnings?" (after I made your title a bit shorter). This led the answers to think about words, so they split up the input, did some work to ensure they weren't working on the first character, and then reassembled the result.
But, another way to frame the problem is "How do I remove all the vowels that come after another letter?". The only letter that doesn't come after another letter is the first letter of a word.
The regex for a vowel that comes after another letter is simple (but I'll stick to ASCII here, although it is just as simple for any Unicode letter):
[a-z][aeiou]
That only matches when there is a vowel after the first letter. Now you want to replace all of those with nothing. Use the substitution operator, s///. The /g flag makes all global substitutions and the /i makes it case insensitive:
s/[a-z][aeiou]//gi;
But, there's a problem. It also replaces that leading letter. That's easy enough to fix. The \K in a substitution says to ignore the part of the pattern before it in the replacement. Anything before the \K is not replaced. So, this only replaces the vowels:
s/[a-z]\K[aeiou]//gi;
But, maybe there are vowels next to each other, so throw in the + quantifier for "one or more" of the preceding item:
s/[a-z]\K[aeiou]+//gi;
You don't need to care about words at all.
Some other ways
Saying that a letter must follow another letter has a special zero-width assertion: the non-word boundary, \B (although that also counts digits and underscore as "letters"):
s/\B[aeiou]+//gi;
The \K was introduced v5.10 and was really a nifty trick to have a variable-width lookbehind. But, the lookbehind here is fixed width: it's one character:
s/(?<=[a-z])[aeiou]+//gi;
But, caring about words
Suppose you need to handle each word separately, for some other requirement. It looks like you've mixed a little Python-ish sort of code, and it would be nice if Perl could do that :). The problem doesn't change that much because you can do the same thing for each individual word.
foreach my $word ( split /\s+/, $x ) {
.... # same thing for each word
}
But, here's an interesting twist? How do you put it all back together? The other solutions just use a single space assuming that's the separator. Maybe there should be two spaces, or tabs, or whatever. The split has a special "separator retention mode" that can keep whatever was between the pieces. When you have captures in the split pattern, those capture values are part of the output list:
my #words_and_separators = split /(\s+)/, $x;
Since you know that none of the separators will have vowels, you can make substitutions on them knowing they won't change. This means you can treat them just like the words (that is, there is no special case, which is another thing to think about as you decompose problems). To get your final string with the original spacing, join on the empty string:
my $ending_string = join '', #words_and_separators;
So, here's how that might all look put together. I'll add the /r flag on the substitution so it returns the modified copy instead of working on the original (don't modify the control variable!):
my #words;
foreach my $word ( split /(\s+)/, $x ) {
push #words, $word =~ s/\B[aeiou]+//gr;
}
my $ending_string = join '', #words;
But, that foreach is a bit annoying. This list pipeline is the same, and it's easier to read these bottom to top. Each thing produces a list that flows into the thing above it. This is how I'd probably express it in real code:
my $ending_string =
join '',
map { s/\B[aeiou]+//gr } # each item is in $_
split /(\s+)/, $x;
Now, here's the grand finale. What if we didn't split thing up on whitespace but on whitespace and the first letter of each word? With separator retention mode we know that we only have to affect every other item, so we count them as we do the map:
my $n = 0;
my $ending_string =
join '',
map { ++$n % 2 ? tr/aeiouAEIOU//dr : $_ }
split /((?:^|\s+)[a-z])/i, $x;
But, I wouldn't write this technique in this way because someone would ultimately find me and exact their revenge. Instead, that foreach I found annoying before may soothe the angry masses:
my $n = 0;
foreach ( split /((?:^|\s+)[a-z])/i, $x ) {
print ++$n % 2 ? tr/aeiouAEIOU//dr : $_;
}
This now remembers the actual separators from the original string and leaves alone the first character of the "word" because it's not in the element we will modify.
The code in the foreach doesn't need to use the conditional operator, ?: or some of the other features. The important part is skipping every other element. That split pattern is a bit of a puzzler if you haven't seen it before, but that's what you get with those sorts of requirements. I think modifying a portion of the substring is just as likely to trip up people on a first read.
I mean, if they are going to make you do it the wrong way in the homework, strike back with something that will take up a bit of their time. :)
Oh, this is fun
I had another idea, because tr/// has another task beyond transliteration. It also counts. Because it returns the number of replacements, if you replace anything with itself, you get a count of the occurrences of that thing. You can count vowels, for instance:
my $has_vowels = $string =~ tr/aeiou/aeiou/; # counts vowels
But, with a string of one letter, that means you have a way to tell if it is a vowel:
my $is_vowel = substr( $string, $i, 1 ) =~ tr/aeiou/aeiou/;
You also can know things about the previous character:
my $is_letter = substr( $string, $i - 1, 1 ) =~ tr/a-zA-Z/a-zA-Z/;
Put that together and you can look at any position and know if it's a vowel that follows a letter. If so, you skip that letter. Otherwise, you add that letter to the output:
use v5.10;
$x = "I like apples more than oranges oooooranges\n";
my $output = substr $x, 0, 1; # avoid the -1 trap (end of string!)
for( my $i = 1; $i < length $x; $i++ ) {
if( substr( $x, $i, 1 ) =~ tr/aeiou/aeiou/ ) { # is a vowel
next if substr( $x, $i - 1, 1 ) =~ tr/a-zA-Z/a-zA-Z/;
}
$output .= substr $x, $i, 1;
}
say $output;
This has the fun consequence of using the recommended operator but completely bypassing the intent. But, this is a proper and intended use of tr///.
It appears that you need to put a little more effort into learning Perl before taking on challenges like this. Your example contains a lot of code that simply isn't valid Perl.
$x = "I like apples more than oranges\n"; #the original sentence
foreach $i in #x[1..] {
You assign your text to the scalar variable $x, but then try to use the array variable #x. In Perl, these are two completely separate variables that have no connection whatsoever. Also, in Perl, the range operator (..) needs values at both ends.
If you had an array called #x (and you don't, you have a scalar) then you could do what you're trying to do here with foreach $i (#x)
if $i "AEIOUaeiou";
I'm not sure what you're trying to do here. I guess the nearest useful Perl expression I can see would be something like:
if ($i =~ /^[AEIOUaeiou]$/)
Which would test if $i is a vowel. But that's a regex, so you're not allowed to use it.
Obviously, I'd solve this problem with a regex, but as those are banned, I've reached for some slightly more obscure Perl features in my code below (that's so your teacher won't believe this is your solution if you just cut and paste it):
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $text = "I like apples more than oranges\n";
# Split the string into an array of words
my #words = split /\s+/, $text;
# For each word...
for (#words) {
# Get a substring that omits the first character
# and use tr/// to remove vowels from that substring
substr($_, 1) =~ tr/AEIOUaeiou//d;
}
# Join the array back together
$text = join ' ', #words;
say $text;
Update: Oh, and notice that I've used tr/AEIUOaeiou//d where you have tr/A E I O U a e i o u//d. It probably won't make any difference here (depending on your approach - but you'll probably be applying it to strings that don't contain spaces) but it's good practice to only include the characters that you want to remove.
We can go over the input string from the end and remove any vowel that's not preceded by a space. We go from right to left so we don't have to adjust the position after each deletion. We don't need to check the very first letter, it shouldn't be ever removed. To remove a vowel, we can use tr///d on the substr of the original string.
for my $i (reverse 1 .. length $x) {
substr($x, $i, 1) =~ tr/aeiouAEIOU//d
if substr($x, $i - 1, 1) ne ' ';
}
Firstly your if statement is wrong.
Secondly this is not a Perl code.
Here is a piece of code that will work, but there is a better way to do it
my $x = "I like apples more than oranges\n";
my $new = "";
my #arr;
foreach my $word (split(' ', $x)) {
#arr = split('', $word);
foreach (my $i; $i<scalar #arr; $i++){
if ($i == 0){
$new .= $arr[$i];
}
elsif (index("AEIOUaeiou", $arr[$i]) == -1) {
$new .= $arr[$i];
}
}
$new .= " ";
}
print "$new\n";
Here I am splitting the string in order to get an array, then I am checking if the given char is a vowel, if it's not, I am appending it to a new string.
Always include
use strict;
use warnings;
on top of your code.
Clearly this is an exercise in lvalues. Obviously. Indubitably!
#!/usr/bin/env perl
# any old perl will do
use 5.010;
use strict;
use warnings;
# This is not idomatic nor fantastic code. Idiotastic?
$_='I am yclept Azure-Orange, queueing to close a query. How are YOU?';
# My little paws typed "local pos" and got
# "Useless localization of match position" :(
# so a busy $b keeps/restores that value
while (/\b./g) {
substr($_,$b=pos,/\b/g && -$b+pos)
# Suggestion to use tr is poetic, not pragmatic,
# ~ tr is sometimes y and y is sometimes a vowel
=~ y/aeiouAEIOU//d;
pos=$b;
}
say
# "say" is the last word.
Was there an embargo against using s/// substitution, or against using all regex? For some reason I thought matching was OK, just not substitution. If matches are OK, I have an idea that "improves" upon this by removing $b through pattern matching side effects. Will see if it pans out. If not, should be pretty easy to replace /\b/ and pos with index and variables, though the definition of word boundary over-simplifies in that case.
(edit) here it is a little more legible with nary a regex
my $text="YO you are the one! The-only-person- asking about double spaces.
Unfortunate about newlines...";
for (my $end=length $text;
$end > 0 && (my $start = rindex $text,' ',$end);
$end = $start-1) {
# y is a beautiful letter, using it for vowels is poetry.
substr($text,2+$start,$end-$start) =~ y/aeiouUOIEA//d;
}
say $text;
Maybe more devious minds will succeed with vec, unpack, open, fork?
You can learn about some of these techniques via
perldoc -f substr
perldoc -f pos
perldoc re
As for my own implementer notes, the least important thing is ending without punctuation so nothing can go after

Perl - Convert integer to text Char(1,2,3,4,5,6)

I am after some help trying to convert the following log I have to plain text.
This is a URL so there maybe %20 = 'space' and other but the main bit I am trying convert is the char(1,2,3,4,5,6) to text.
Below is an example of what I am trying to convert.
select%20char(45,120,49,45,81,45),char(45,120,50,45,81,45),char(45,120,51,45,81,45)
What I have tried so far is the following while trying to added into the char(in here) to convert with the chr($2)
perl -pe "s/(char())/chr($2)/ge"
All this has manage to do is remove the char but now I am trying to convert the number to text and remove the commas and brackets.
I maybe way off with how I am doing as I am fairly new to to perl.
perl -pe "s/word to remove/word to change it to/ge"
"s/(char(what goes in here))/chr($2)/ge"
Output try to achieve is
select -x1-Q-,-x2-Q-,-x3-Q-
Or
select%20-x1-Q-,-x2-Q-,-x3-Q-
Thanks for any help
There's too much to do here for a reasonable one-liner. Also, a script is easier to adjust later
use warnings;
use strict;
use feature 'say';
use URI::Escape 'uri_unescape';
my $string = q{select%20}
. q{char(45,120,49,45,81,45),char(45,120,50,45,81,45),}
. q{char(45,120,51,45,81,45)};
my $new_string = uri_unescape($string); # convert %20 and such
my #parts = $new_string =~ /(.*?)(char.*)/;
$parts[1] = join ',', map { chr( (/([0-9]+)/)[0] ) } split /,/, $parts[1];
$new_string = join '', #parts;
say $new_string;
this prints
select -x1-Q-,-x2-Q-,-x3-Q-
Comments
Module URI::Escape is used to convert percent-encoded characters, per RFC 3986
It is unspecified whether anything can follow the part with char(...)s, and what that might be. If there can be more after last char(...) adjust the splitting into #parts, or clarify
In the part with char(...)s only the numbers are needed, what regex in map uses
If you are going to use regex you should read up on it. See
perlretut, a tutorial
perlrequick, a quick-start introduction
perlre, the full account of syntax
perlreref, a quick reference (its See Also section is useful on its own)
Alright, this is going to be a messy "one-liner". Assuming your text is in a variable called $text.
$text =~ s{char\( ( (?: (?:\d+,)* \d+ )? ) \)}{
my #arr = split /,/, $1;
my $temp = join('', map { chr($_) } #arr);
$temp =~ s/^|$/"/g;
$temp
}xeg;
The regular expression matches char(, followed by a comma-separated list of sequences of digits, followed by ). We capture the digits in capture group $1. In the substitution, we split $1 on the comma (since chr only works on one character, not a whole list of them). Then we map chr over each number and concatenate the result into a string. The next line simply puts quotation marks at the start and end of the string (presumably you want the output quoted) and then returns the new string.
Input:
select%20char(45,120,49,45,81,45),char(45,120,50,45,81,45),char(45,120,51,45,81,45)
Output:
select%20"-x1-Q-","-x2-Q-","-x3-Q-"
If you want to replace the % escape sequences as well, I suggest doing that in a separate line. Trying to integrate both substitutions into one statement is going to get very hairy.
This will do as you ask. It performs the decoding in two stages: first the URI-encoding is decoded using chr hex $1, and then each char() function is translated to the string corresponding to the character equivalents of its decimal parameters
use strict;
use warnings 'all';
use feature 'say';
my $s = 'select%20char(45,120,49,45,81,45),char(45,120,50,45,81,45),char(45,120,51,45,81,45)';
$s =~ s/%(\d+)/ chr hex $1 /eg;
$s =~ s{ char \s* \( ( [^()]+ ) \) }{ join '', map chr, $1 =~ /\d+/g }xge;
say $s;
output
select -x1-Q-,-x2-Q-,-x3-Q-

Split functions

I want to get the split characters. I tried the below coding, but I can able to get the splitted text only. However if the split characters are same then it should be returned as that single characters
For example if the string is "asa,agas,asa" then only , should be returned.
So in the below case I should get as "| : ;" (joined with space)
use strict;
use warnings;
my $str = "Welcome|a:g;v";
my #value = split /[,;:.%|]/, $str;
foreach my $final (#value) {
print $final, "\n";
}
split splits a string into elements when given what separates those elements, so split is not what you want. Instead, use:
my #punctuations = $str =~ /([,;:.%|])/g;
So you want to get the opposite of split
try:
my #value=split /[^,;:.%|]+/,$str;
It will split on anything but the delimiters you set.
Correction after commnets:
my #value=split /[^,;:.%|]+/,$str;
shift #value;
this works fine, and gives unique answers
#value = ();
foreach(split('',",;:.%|")) { push #value,$_ if $str=~/$_/; }
To extract all the separators only once, you need something more elaborate
my #punctuations = keys %{{ map { $_ => 1 } $str =~ /[,;:.%|]/g }};
Sounds like you call "split characters" what the rest of us call "delimiters" -- if so, the POSIX character class [:punct:] might prove valuable.
OTOH, if you have a defined list of delimiters, and all you want to do is list the ones present in the string, it's much more efficient to use m// rather than split.

How to extract a number from a string in Perl?

I have
print $str;
abcd*%1234$sdfsd..#d
The string would always have only one continuous stretch of numbers, like 1234 in this case. Rest all will be either alphabets or other special characters.
How can I extract the number (1234 in this case) and store it back in str?
This page suggests that I should use \d, but how?
If you don't want to modify the original string, you can extract the numbers by capturing them in the regex, using subpatterns. In list context, a regular expression returns the matches defined in the subpatterns.
my $str = 'abc 123 x456xy 789foo';
my ($first_num) = $str =~ /(\d+)/; # 123
my #all_nums = $str =~ /(\d+)/g; # (123, 456, 789)
$str =~ s/\D//g;
This removes all nondigit characters from the string. That's all that you need to do.
EDIT: if Unicode digits in other scripts may be present, a better solution is:
$str =~ s/[^0-9]//g;
If you wanted to do it the destructive way, this is the fastest way to do it.
$str =~ tr/0-9//cd;
translate all characters in the complement of 0-9 to nothing, delete them.
The one caveat to this approach, and Phillip Potter's, is that were there another group of digits further down the string, they would be concatenated with the first group of digits. So it's not clear that you would want to do this.
The surefire way to get one and only one group of digits is
( $str ) = $str =~ /(\d+)/;
The match, in a list context returns a list of captures. The parens around $str are simply to put the expression in a list context and assign the first capture to $str.
Personally, I would do it like this:
$s =~ /([0-9]+)/;
print $1;
$1 will contain the first group matched the given regular expression (the part in round brackets).