split function extension - perl

I am learning the sample code from split function.
Sample code.
#!C:\Perl\bin\perl.exe
use strict;
use warnings;
my $info = "Caine:Michael:Actor:14, Leafy Drive";
my #personal = split(/:/, $info);
# #personal = ("Caine", "Michael", "Actor", "14, Leafy Drive");
If change the $info = "Caine Michael Actor /* info data */";
How to use the split(/ /, $info) to export the result below.
# #personal = ("Caine", "Michael", "Actor", "info data");
Thank you.

Alternative approach:
Have you considered using the 3-parameter version of split:
$info = "Caine Michael Actor /* info data */";
#personal= split(' ',$info,4);
resulting in
#personal=('Caine','Michael','Actor','/* info data */');
then you would have to remove / * * / .. to get your result...

It really is better to use regex for this:
$info = "Caine Michael Actor /* info data */";
$info =~ /(\w+)\s+(\w+)\s+(\w+).*\/\*(.+)\*\//;
#personal = ($1, $2, $3, $4);
Mainly because your input string has ambiguities related to word separators not easily handled by split.
In case you're wondering how to read the regex:
/
(\w+) # CAPTURE a sequence of one of more word characters into $1
\s+ # MATCH one or more white space
(\w+) # CAPTURE a sequence of one of more word characters into $2
\s+ # MATCH one or more white space
(\w+) # CAPTURE a sequence of one of more word characters into $3
.* # MATCH zero or more of anything
\/\* # MATCH the opening of C-like comment /*
(.+) # CAPTURE a sequence of one or more of anything into $4
\*\/ # MATCH the closing of C-like comment */
/x

since there isn't an answer yet that handles the general case, here goes:
split isn't your best bet here, and since the delimiter can be both a matched and non matched character, it will be clearest to invert the problem and describe what you do what to match, which in this case is either a string of non space characters, or the contents of a c style comment.
use strict;
use warnings;
my $info = "Caine Michael Actor /* info data */";
my #personal = grep {defined} $info =~ m! /\* \s* (.+?) \s* \*/ | (\S+) !xg;
say join ', ' => #personal;
that will return a list of words / contents of comments in any sequence you need. The syntax highlighter doesn't highlight the above regex properly, the regex is everything between !

Cooked something up :). Does work only for you example. Cannot generalize
use strict;
use warnings;
my $info = "Caine Michael Actor /* info data */";
if($info=~m{/\*\s*(.*?)\s*\*/})
{
my $temp = $1;
$temp=~s{\s+}{##}g;
$info=~s{/\*\s*(.*?)\s*\*/}{$temp};
}
my #personal = split(/ /, $info);
foreach(#personal)
{
s{##}{ }g;
print "$_\n";
}
Output:
C:>perl a.pl
Caine
Michael
Actor
info data

Related

Perl - Convert integer to text Char(1,2,3,4,5,6)

I am after some help trying to convert the following log I have to plain text.
This is a URL so there maybe %20 = 'space' and other but the main bit I am trying convert is the char(1,2,3,4,5,6) to text.
Below is an example of what I am trying to convert.
select%20char(45,120,49,45,81,45),char(45,120,50,45,81,45),char(45,120,51,45,81,45)
What I have tried so far is the following while trying to added into the char(in here) to convert with the chr($2)
perl -pe "s/(char())/chr($2)/ge"
All this has manage to do is remove the char but now I am trying to convert the number to text and remove the commas and brackets.
I maybe way off with how I am doing as I am fairly new to to perl.
perl -pe "s/word to remove/word to change it to/ge"
"s/(char(what goes in here))/chr($2)/ge"
Output try to achieve is
select -x1-Q-,-x2-Q-,-x3-Q-
Or
select%20-x1-Q-,-x2-Q-,-x3-Q-
Thanks for any help
There's too much to do here for a reasonable one-liner. Also, a script is easier to adjust later
use warnings;
use strict;
use feature 'say';
use URI::Escape 'uri_unescape';
my $string = q{select%20}
. q{char(45,120,49,45,81,45),char(45,120,50,45,81,45),}
. q{char(45,120,51,45,81,45)};
my $new_string = uri_unescape($string); # convert %20 and such
my #parts = $new_string =~ /(.*?)(char.*)/;
$parts[1] = join ',', map { chr( (/([0-9]+)/)[0] ) } split /,/, $parts[1];
$new_string = join '', #parts;
say $new_string;
this prints
select -x1-Q-,-x2-Q-,-x3-Q-
Comments
Module URI::Escape is used to convert percent-encoded characters, per RFC 3986
It is unspecified whether anything can follow the part with char(...)s, and what that might be. If there can be more after last char(...) adjust the splitting into #parts, or clarify
In the part with char(...)s only the numbers are needed, what regex in map uses
If you are going to use regex you should read up on it. See
perlretut, a tutorial
perlrequick, a quick-start introduction
perlre, the full account of syntax
perlreref, a quick reference (its See Also section is useful on its own)
Alright, this is going to be a messy "one-liner". Assuming your text is in a variable called $text.
$text =~ s{char\( ( (?: (?:\d+,)* \d+ )? ) \)}{
my #arr = split /,/, $1;
my $temp = join('', map { chr($_) } #arr);
$temp =~ s/^|$/"/g;
$temp
}xeg;
The regular expression matches char(, followed by a comma-separated list of sequences of digits, followed by ). We capture the digits in capture group $1. In the substitution, we split $1 on the comma (since chr only works on one character, not a whole list of them). Then we map chr over each number and concatenate the result into a string. The next line simply puts quotation marks at the start and end of the string (presumably you want the output quoted) and then returns the new string.
Input:
select%20char(45,120,49,45,81,45),char(45,120,50,45,81,45),char(45,120,51,45,81,45)
Output:
select%20"-x1-Q-","-x2-Q-","-x3-Q-"
If you want to replace the % escape sequences as well, I suggest doing that in a separate line. Trying to integrate both substitutions into one statement is going to get very hairy.
This will do as you ask. It performs the decoding in two stages: first the URI-encoding is decoded using chr hex $1, and then each char() function is translated to the string corresponding to the character equivalents of its decimal parameters
use strict;
use warnings 'all';
use feature 'say';
my $s = 'select%20char(45,120,49,45,81,45),char(45,120,50,45,81,45),char(45,120,51,45,81,45)';
$s =~ s/%(\d+)/ chr hex $1 /eg;
$s =~ s{ char \s* \( ( [^()]+ ) \) }{ join '', map chr, $1 =~ /\d+/g }xge;
say $s;
output
select -x1-Q-,-x2-Q-,-x3-Q-

Perl parsing Text File with regular expression

I have a file with the following random structures:
USMS 1362224754632|<REQ MSISDN="00966590832186" CONTRACT="580" SUBSCRIPTION="AAA" FORMAT="ascii" TEXT="L2"
or
USMS 1362224754632|<REQ MSISDN="00966590832186" CONTRACT="580" SUBSCRIPTION="BBB" THRESHOLDID="1" FORMAT="ascii" TEXT="L2"
I am trying to parse it with perl to get the values like the following:
1362224754632;00966590832186;580;AAA;L2
Below is the code:
if($Record =~ /USMS (.*?)|<REQ MSISDN="(.*?)" CONTRACT="(.*?)" SUBSCRIPTION="(.*?)" FORMAT="(.*?)" THRESHOLDID="(.*?)" TEXT="(.*?)"/)
{
print LOGFILE "$1;$2;$3;$4;$5;$6;$7\n";
}
elsif($Record =~ /USMS (.*?)|<REQ MSISDN="(.*?)" CONTRACT="(.*?)" SUBSCRIPTION="(.*?)" FORMAT="(.*?)" TEXT="(.*?)"/)
{
print LOGFILE "$1;$2;$3;$4;$5;$6\n";
}
But I am getting always:
;;;;;
Pipe (|) is a special character in regular expressions. Escape it, like: \| and it will work.
if($Record =~ /USMS (.*?)\|<REQ MSISDN="(.*?)" CONTRACT="(.*?)" SUBSCRIPTION="(.*?)" FORMAT="(.*?)" THRESHOLDID="(.*?)" TEXT="(.*?)"/)
and the same for the else branch.
Instead of using a single regex, I would split the data into its separate sections first, then approach them separately.
my($usms_part, $request) = split / \s* \|<REQ \s* /x, $Record;
my($usms_id) = $usms_part =~ /^USMS (\d+)$/;
my %request;
while( $request =~ /(\w+)="(.*?)"/g ) {
$request{$1} = $2;
}
Rather than having to hard code all the possible key/value pairs, and their possible orderings, you can parse them generically in one piece of code.
Change
(.*?)
to
([a-zA-Z0-9]*)
It looks like all you want is the fields contained in double-quotes.
That looks like this
use strict;
use warnings;
while (<DATA>) {
my #values = /"([^"]+)"/g;
print join(';', #values), "\n";
}
__DATA__
USMS 1362224754632|<REQ MSISDN="00966590832186" CONTRACT="580" SUBSCRIPTION="AAA" FORMAT="ascii" TEXT="L2"
USMS 1362224754632|<REQ MSISDN="00966590832186" CONTRACT="580" SUBSCRIPTION="BBB" THRESHOLDID="1" FORMAT="ascii" TEXT="L2"
output
00966590832186;580;AAA;ascii;L2
00966590832186;580;BBB;1;ascii;L2

Perl parsing - mixture of chars, tabs and spaces

I have the following types of line in my code:
MMAPI_CLOCK_OUTPUTS = 1, /*clock outputs system*/
MMAPI_SYSTEM_MANAGEMENT = 0, /*sys man system*/
I want to parse them to get:
'MMAPI_CLOCK_OUTPUTS'
'1'
'clock outputs system'
So I tried:
elsif($TheLine =~ /\s*(.*)s*=s*(.*),s*\/*(.*)*\//)
but this doesn't get the last string 'clock outputs system'
What should the parsing code actually be?
You should escape the slashes, stars and the s for spaces. Instead of writing /, * or s in your regex, write \/, \* and \s:
/\s*(.*)\s=\s*(.*),\s\/\*(.*)\*\//
if($TheLine =~ m%^(\S+)\s+=\s+(\d+),\s+/\*(.*)\*/%) {
print "$1 $2 $3\n"
}
This uses % as an alternative delimiter in order to avoid leaning toothpick syndrome when you escape the / characters.
Try this regex: /^\s*(.*?)\s*=\s*(\d+),\s*\/\*(.*?)\*\/$/
Here is an example in which you can test it:
#!/usr/bin/perl
use strict;
use warnings;
my $str = "MMAPI_CLOCK_OUTPUTS = 1, /*clock outputs system*/\n
MMAPI_SYSTEM_MANAGEMENT = 0, /*sys man system*/";
while ($str =~ /^\s*(.*?)\s*=\s*(\d+),\s*\/\*(.*?)\*\/$/gm) {
print "$1 $2 $3 \n";
}
# Output:
# MMAPI_CLOCK_OUTPUTS 1 clock outputs system
# MMAPI_SYSTEM_MANAGEMENT 0 sys man system

Extracting specific lines with Perl

I am writing a perl program to extract lines that are in between the two patterns i am matching. for example the below text file has 6 lines. I am matching load balancer and end. I want to get the 4 lines that are in between.
**load balancer**
new
old
good
bad
**end**
My question is how do you extract lines in between load balancer and end into an array. Any help is greatly appreciated.
You can use the flip-flop operator to tell you when you are between the markers. It will also include the actual markers, so you'll need to except them from the data collection.
Note that this will mash together all the records if you have several, so if you do you need to store and reset #array somehow.
use strict;
use warnings;
my #array;
while (<DATA>) {
if (/^load balancer$/ .. /^end$/) {
push #array, $_ unless /^(load balancer|end)$/;
}
}
print #array;
__DATA__
load balancer
new
old
good
bad
end
You can use the flip-flop operator.
Additionally, you can also use the return value of the flipflop to filter out the boundary lines. The return value is a sequence number (starting with 1) and the last number has the string E0 appended to it.
# Define the marker regexes separately, cuz they're ugly and it's easier
# to read them outside the logic of the loop.
my $start_marker = qr{^ \s* \*\*load \s balancer\*\* \s* $}x;
my $end_marker = qr{^ \s* \*\*end\*\* \s* $}x;
while( <DATA> ) {
# False until the first regex is true.
# Then it's true until the second regex is true.
next unless my $range = /$start_marker/ .. /$end_marker/;
# Flip-flop likes to work with $_, but it's bad form to
# continue to use $_
my $line = $_;
print $line if $range !~ /^1$|E/;
}
__END__
foo
bar
**load balancer**
new
old
good
bad
**end**
baz
biff
Outputs:
new
old
good
bad
If you prefer a command line variation:
perl -ne 'print if m{\*load balancer\*}..m{\*end\*} and !m{\*load|\*end}' file
For files like this, I often use a change in the Record Separator ( $/ or $RS from English )
use English qw<$RS>;
local $RS = "\nend\n";
my $record = <$open_handle>;
When you chomp it, you get rid of that line.
chomp( $record );

How can i detect symbols using regular expression in perl?

Please how can i use regular expression to check if word starts or ends with a symbol character, also how to can i process the text within the symbol.
Example:
(text) or te-xt, or tex't. or text?
change it to
(<t>text</t>) or <t>te-xt</t>, or <t>tex't</t>. or <t>text</t>?
help me out?
Thanks
I assume that "word" means alphanumeric characters from your example? If you have a list of permitted characters which constitute a valid word, then this is enough:
my $string = "x1 .text1; 'text2 \"text3;\"";
$string =~ s/([a-zA-Z0-9]+)/<t>$1<\/t>/g;
# Add more to character class [a-zA-Z0-9] if needed
print "$string\n";
# OUTPUT: <t>x1</t> .<t>text1</t>; '<t>text2</t> "<t>text3</t>;"
UPDATE
Based on your example you seem to want to DELETE dashes and apostrophes, if you want to delete them globally (e.g. whether they are inside the word or not), before the first regex, you do
$string =~ s/['-]//g;
I am using DVK's approach here, but with a slight modification. The difference is that her/his code would also put the tags around all words that don't contain/are next to a symbol, which (according to the example given in the question) is not desired.
#!/usr/bin/perl
use strict;
use warnings;
sub modify {
my $input = shift;
my $text_char = 'a-zA-Z0-9\-\''; # characters that are considered text
# if there is no symbol, don't change anything
if ($input =~ /^[a-zA-Z0-9]+$/) {
return $input;
}
else {
$input =~ s/([$text_char]+)/<t>$1<\/t>/g;
return $input;
}
}
my $initial_string = "(text) or te-xt, or tex't. or text?";
my $expected_string = "(<t>text</t>) or <t>te-xt</t>, or <t>tex't</t>. or <t>text</t>?";
# version BEFORE edit 1:
#my #aux;
# take the initial string apart and process it one word at a time
#my #string_list = split/\s+/, $initial_string;
#
#foreach my $string (#string_list) {
# $string = modify($string);
# push #aux, $string;
#}
#
# put the string together again
#my $final_string = join(' ', #aux);
# ************ EDIT 1 version ************
my $final_string = join ' ', map { modify($_) } split/\s+/, $initial_string;
if ($final_string eq $expected_string) {
print "it worked\n";
}
This strikes me as a somewhat long-winded way of doing it, but it seemed quicker than drawing up a more sophisticated regex...
EDIT 1: I have incorporated the changes suggested by DVK (using map instead of foreach). Now the syntax highlighting is looking even worse than before; I hope it doesn't obscure anything...
This takes standard input and processes it to and prints on Standard output.
while (<>) {
s {
( [a-zA-z]+ ) # word
(?= [,.)?] ) # a symbol
}
{<t>$1</t>}gx ;
print ;
}
You might need to change the bit to match the concept of word.
I have use the x modifeid to allow the regexx to be spaced over more than one line.
If the input is in a Perl variable, try
$string =~ s{
( [a-zA-z]+ ) # word
(?= [,.)?] ) # a symbol
}
{<t>$1</t>}gx ;