Why wrong output from the RegEx? - perl

When I run the script below, I get
$VAR1 = [
'ok0.ok]][[file:ok1.ok',
undef,
undef,
'ok2.ok|dgdfg]][[file:ok3.ok',
undef,
undef,
undef,
undef,
undef,
undef,
undef,
undef,
undef,
undef,
undef,
undef,
undef
];
where I was hoping for ok0.ok ok1.ok ok2.ok ok3.ok and ideally also ok4.ok ok5.ok ok6.ok ok7.ok
Question
Can anyone see what I am doing wrong?
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $html = "sdfgdfg[[file:ok0.ok]][[file:ok1.ok ]] [[file:ok2.ok|dgdfg]][[file:ok3.ok |dfgdfgg]] [[media:ok4.ok]] [[media:ok5.ok ]] [[media:ok6.ok|dgdfg]] [[media:ok7.ok |dfgdfgg]]ggg";
my #seen = ($html =~ /file:(.*?) |\||\]/g);
print Dumper \#seen;

A negated character class can simplify things a bit, I think. Be explicit as to your anchors (file:, or media:), and explicit as to what terminates the sequence (a space, pipe, or closing bracket). Then capture.
my #seen = $html =~ m{(?:file|media):([^\|\s\]]+)}g;
Explained:
my #seen = $html =~ m{
(?:file|media): # Match either 'file' or 'media', don't capture, ':'
( [^\|\s\]]+ ) # Match and capture one or more, anything except |\s]
}gx;
Capturing stops as soon as ], |, or \s is encountered.

It looks like you are trying to match everything starting with file: and ending with either a space, a pipe or a closing square bracket.
Your OR-statement at the end of the regexp needs to be between (square) brackets itself though:
my #seen = ($html =~ /file:(.*?)[] |]/g);
If you want the media: blocks as well, OR the file part. You might want a non-capturing group here:
my #seen = ($html =~ /(?:file|media):(.*?)[] |]/g);
How it works
The first statement will capture everything between 'file:' and a ], | or .
The second statement does the same, but with both file and media. We use a non-capturing group (?:group) instead of (group) so the word is not put into your #seen.

Try with
my #seen = ($html =~ /\[\[\w+:(\w+\.\w+)\]\]/g);

this is what your regex does:
...
my $ss = qr {
file: # start with file + column as anchor
( # start capture group
.*? # use any character in a non-greedy sweep
) # end capture group
\s # end non-greedy search on a **white space**
| # OR expression encountered up to here with:
\| # => | charachter
| # OR expression encountered up to here with:
\] # => ] charachter
}x;
my #seen = $html =~ /$ss/g;
...
and this is what your regex is supposed to do:
...
my $rb = qr {
\w : # alphanumeric + column as front anchor
( # start capture group
[^]| ]+ # the terminating sequence
) # end capture group
}x;
my #seen = $html =~ /$rb/g;
...
If you want a short, concise regex and know what you do, you could drop the capturing group altogether and use full capture chunk in list context together with positive lookbehind:
...
my #seen = $html =~ /(?<=(?:.file|media):)[^] |]+/g; # no cature group ()
...
or, if no other structure in your data as shown is to be dealt with, use the : as only anchor:
...
my #seen = $html =~ /(?<=:)[^] |]+/g; # no capture group and short
...
Regards
rbo

Depending on the possible characters in the file name, I think you probably want
my #seen = $html =~ /(?:file|media):([\w.]+)/g;
which captures all of ok0.ok through to ok7.ok.
It relies on the file names containing alphanumeric characters plus underscore and dot.

I hope this is what you required.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $string = "sdfgdfg[[file:ok0.ok]][[file:ok1.ok ]] [[file:ok2.ok|dgdfg]][[file:ok3.ok |dfgdfgg]] [[media:ok4.ok]] [[media:ok5.ok ]] [[media:ok6.ok|dgdfg]] [[media:ok7.ok |dfgdfgg]]ggg";
my #matches;
#matches = $string =~ m/ok\d\.ok/g;
print Dumper #matches;
Output:
$VAR1 = 'ok0.ok';
$VAR2 = 'ok1.ok';
$VAR3 = 'ok2.ok';
$VAR4 = 'ok3.ok';
$VAR5 = 'ok4.ok';
$VAR6 = 'ok5.ok';
$VAR7 = 'ok6.ok';
$VAR8 = 'ok7.ok';
Regards,
Kiran.

Related

Bug with parsing by Text::CSV_XS?

Tried to use Text::CSV_XS to parse some logs. However, the following code doesn't do what I expected -- split the line into pieces according to separator " ".
The funny thing is, if I remove the double quote in the string $a, then it will do splitting.
Wonder if it's a bug or I missed something. Thanks!
use Text::CSV_XS;
$a = 'id=firewall time="2010-05-09 16:07:21 UTC"';
$userDefinedSeparator = Text::CSV_XS->new({sep_char => " "});
print "$userDefinedSeparator\n";
$userDefinedSeparator->parse($a);
my $e;
foreach $e ($userDefinedSeparator->fields) {
print $e, "\n";
}
EDIT:
In the above code snippet, it I change the = (after time) to be a space, then it works fine. Started to wonder whether this is a bug after all?
$a = 'id=firewall time "2010-05-09 16:07:21 UTC"';
You have confused the module by leaving both the quote character and the escape character set to double quote ", and then left them embedded in the fields you want to split.
Disable both quote_char and escape_char, like this
use strict;
use warnings;
use Text::CSV_XS;
my $string = 'id=firewall time="2010-05-09 16:07:21 UTC"';
my $space_sep = Text::CSV_XS->new({
sep_char => ' ',
quote_char => undef,
escape_char => undef,
});
$space_sep->parse($string);
for my $field ($space_sep->fields) {
print "$field\n";
}
output
id=firewall
time="2010-05-09
16:07:21
UTC"
But note that you have achieved exactly the same things as print "$_\n" for split ' ', $string, which is to be preferred as it is both more efficient and more concise.
In addition, you must always use strict and use warnings; and never use $a or $b as variable names, both because they are used by sort and because they are meaningless and undescriptive.
Update
As #ThisSuitIsBlackNot points out, your intention is probably not to split on spaces but to extract a series of key=value pairs. If so then this method puts the values straight into a hash.
use strict;
use warnings;
my $string = 'id=firewall time="2010-05-09 16:07:21 UTC"';
my %data = $string =~ / ([^=\s]+) \s* = \s* ( "[^"]*" | [^"\s]+ ) /xg;
use Data::Dump;
dd \%data;
output
{ id => "firewall", time => "\"2010-05-09 16:07:21 UTC\"" }
Update
This program will extract the two name=value strings and print them on separate lines.
use strict;
use warnings;
my $string = 'id=firewall time="2010-05-09 16:07:21 UTC"';
my #fields = $string =~ / (?: "[^"]*" | \S )+ /xg;
print "$_\n" for #fields;
output
id=firewall
time="2010-05-09 16:07:21 UTC"
If you are not actually trying to parse csv data, you can get the time field by using Text::ParseWords, which is a core module in Perl 5. The benefit to using this module is that it handles quotes very well.
use strict;
use warnings;
use Data::Dumper;
use Text::ParseWords;
my $str = 'id=firewall time="2010-05-09 16:07:21 UTC"';
my #fields = quotewords(' ', 0, $str);
print Dumper \#fields;
my %hash = map split(/=/, $_, 2), #fields;
print Dumper \%hash;
Output:
$VAR1 = [
'id=firewall',
'time=2010-05-09 16:07:21 UTC'
];
$VAR1 = {
'time' => '2010-05-09 16:07:21 UTC',
'id' => 'firewall'
};
I also included how you can make the data more accessible by adding it to a hash. Note that hashes cannot contain duplicate keys, so you need a new hash for each new time key.

How to skip splitting for some part of the line

Say I have a line lead=george wife=jane "his boy"=elroy. I want to split with space but that does not include the "his boy" part. I should be considered as one.
With normal split it is also splitting "his boy" like taking "his" as one and "boy" as second part. How to escape this
Following this i tried
split " ", $_
Just came to know that this will work
use strict; use warnings;
my $string = q(hi my name is 'john doe');
my #parts = $string =~ /'.*?'|\S+/g;
print map { "$_\n" } #parts;
But it does not looks good. Any other simple thing with split itself?
You could use Text::ParseWords for this
use Text::ParseWords;
$list = "lead=george wife=jane \"his boy\"=elroy";
#words = quotewords('\s+', 0, $list);
$i = 0;
foreach (#words) {
print "$i: <$_>\n";
$i++;
}
ouput:
0: <lead=george>
1: <wife=jane>
2: <his boy=elroy>
sub split_space {
my ( $text ) = #_;
while (
$text =~ m/
( # group ($1)
\"([^\"]+)\" # first try find something in quotes ($2)
|
(\S+?) # else minimal non-whitespace run ($3)
)
=
(\S+) # then maximum non-whitespace run ($4)
/xg
) {
my $key = defined($2) ? $2 : $3;
my $value = $4;
print( "key=$key; value=$value\n" );
}
}
split_space( 'lead=george wife=jane "his boy"=elroy' );
Outputs:
key=lead; value=george
key=wife; value=jane
key=his boy; value=elroy
PP posted a good solution. But just to make it sure, that there is a cool other way to do it, comes my solution:
my $string = q~lead=george wife=jane "his boy"=elroy~;
my #split = split / (?=")/,$string;
my #split2;
foreach my $sp (#split) {
if ($sp !~ /"/) {
push #split2, $_ foreach split / /, $sp;
} else {
push #split2,$sp;
}
}
use Data::Dumper;
print Dumper #split2;
Output:
$VAR1 = 'lead=george';
$VAR2 = 'wife=jane';
$VAR3 = '"his boy"=elroy';
I use a Lookahead here for splitting at first the parts which keys are inside quotes " ". After that, i loop through the complete array and split all other parts, which are normal key=values.
You can get the required result using a single regexp, which extract the keys and the values and put the result inside a hash table.
(\w+|"[\w ]+") will match both a single and multiple word in the key side.
The regexp captures only the key and the value, so the result of the match operation will be a list with the following content: key #1, value #1, key #2, value#2, etc.
The hash is automatically initiated with the appropriate keys and values, when the match result is assigned to it.
here is the code
my $str = 'lead=george wife=jane "hello boy"=bye hello=world';
my %hash = ($str =~ m/(?:(\w+|"[\w ]+")=(\w+)(?:\s|$))/g);
## outputs the hash content
foreach $key (keys %hash) {
print "$key => $hash{$key}\n";
}
and here is the output of this script
lead => george
wife => jane
hello => world
"hello boy" => bye

Matching in Perl

I am trying to get text in between two dots of a line, but my program returns the entire line.
For example: I have text which looks like:
My sampledata 1,2 for perl .version 1_1.
I used the following match statement
$x =~ m/(\.)(.*)(\.)/;
My output for $x should be version 1_1, but I am getting the entire line as my match.
In your code, the value of $x will not change after the match.
When $x is successfully matched with m/(.)(.*)(.)/, your three capture groups will contain '.', 'version 1_1' and '.' respectively (in the order given). $2 will give you 'version 1_1'.
Considering that you might probably only want the part 'version 1_1', you need not capture the two dots. This code will give you the same result:
$x =~ m/\.(.*)\./;
print $1;
Try this:
my $str = "My sampledata 1,2 for perl .version 1_1.";
$str =~ /\.\K[^.]+(?=\.)/;
print $&;
The period must be escaped out of a character class.
\K resets all that has been matched before (you can replace it by a lookbehind (?<=\.))
[^.] means any character except a period.
For several results, you can do this:
my $str = "qwerty .target 1.target 2.target 3.";
my #matches = ($str =~ /\.\K[^.]+(?=\.)/g);
print join("\n", #matches);
If you don't want to use twice a period you can do this:
my $str = "qwerty .target 1.target 2.target 3.";
my #matches = ($str =~ /\.([^.]+)\./g);
print join("\n", #matches)."\n";
It should be simple enough to do something like this:
#!/usr/bin/perl
use warnings;
use strict;
my #tests = (
"test one. get some stuff. extra",
"stuff with only one dot.",
"another test line.capture this. whatever",
"last test . some data you want.",
"stuff with only no dots",
);
for my $test (#tests) {
# For this example, I skip $test if the match fails,
# otherwise, I move on do stuff with $want
next if $test !~ /\.(.*)\./;
my $want = $1;
print "got: $want\n";
}
Output
$ ./test.pl
got: get some stuff
got: capture this
got: some data you want

Deparsing/Decomposing - step-by-step this obfuscated perl script

As the title - please can anyone explain how the next scripts works
this prints the text: "Perl guys are smart"
''=~('(?{'.('])##^{'^'-[).*[').'"'.('-[)#{:__({:)[{(-:)^}'^'}>[,[]*&[[[[>[[#[[*_').',$/})')
this prints only "b"
use strict;
use warnings;
''=~('(?{'.('_/).+{'^'/]##_[').'"'.('=^'^'_|').',$/})')
the perl -MO=Deparse shows only this:
use warnings;
use strict 'refs';
'' =~ m[(?{print "b",$/})];
but havent any idea why... ;(
What is the recommended way decomposing like scripts? How to start?
so, tried this:
'' =~
(
'(?{'
.
(
'])##^{' ^ '-[).*['
)
.
'"'
.
(
'-[)#{:__({:)[{(-:)^}' ^ '}>[,[]*&[[[[>[[#[[*_'
)
.
',$/})'
)
several parts are concatenated by .. And the result of the bitwise ^ probably gives the text parts. The:
perl -e "print '-[)#{:__({:)[{(-:)^}' ^ '}>[,[]*&[[[[>[[#[[*_'"
prints "Perl guys are smart" and the first ^ generating "print".
But when, i rewrite it to:
'' =~
(
'(?{'
.
(
'print'
)
.
'"'
.
(
'Perl guys are smart'
)
.
',$/})'
)
My perl told me:
panic: top_env
Strange, first time i saw like error message...
Thats mean: it isn't allowed replace the 'str1' ^ 'str2' with the result, (don't understand why) and why the perl prints the panic message?
my perl:
This is perl 5, version 12, subversion 4 (v5.12.4) built for darwin-multi-2level
Ps: examples are generated here
In the line
.('_/).+{' ^ '/]##_[
when you evaluate ']' ^ '-', the result will be the letter p. ^ is a bitwise string operation, so after that we follow letter by letter to get result string.
Check my script, it works like your example. I hope it will help you.
use v5.14;
# actually we obfuscated print and your word + "
# it looks like that (print).'"'.(yor_word")
my $print = 'print';
my $string = 'special for stackoverflow by fxzuz"';
my $left = get_obfuscated($print);
my $right = get_obfuscated($string);
# prepare result regexp
my $result = "'' =~ ('(?{'.($left).'\"'.($right).',\$/})');";
say 'result obfuscated ' . $result;
eval $result;
sub get_obfuscated {
my $string = shift;
my #letters = split //, $string;
# all symbols like :,&? etc (exclude ' and \)
# we use them for obfuscation
my #array = (32..38, 40..47, 58..64, 91, 93..95, 123..126);
my $left_str = '';
my $right_str = '';
# obfuscated letter by letter
for my $letter (#letters) {
my #result;
# get right xor letters
for my $symbol (#array) {
# prepare xor results
my $result = ord $letter ^ $symbol;
push #result, { left => $result, right => $symbol } if $result ~~ #array;
}
my $rand_elem = $result[rand $#result];
$left_str .= chr $rand_elem->{left};
$right_str .= chr $rand_elem->{right};
}
my $obfuscated = "'$left_str' ^ '$right_str'";
say "$string => $obfuscated";
return $obfuscated;
}
The trick to understanding what's going on here is to look at the string being constructed by the XORs and concatenations:
(?{print "Perl guys are smart",$/})
This is an experimental regular expression feature of the form (?{ code }). So what you see printed to the terminal is the result of
print "Perl guys are smart",$/
being invoked by ''=~.... $/ is Perl's input record separator, which by default is a newline.

perl regex warning: \1 better written as $1 at (eval 1) line 1

use strict;
use warnings;
my $newPasswd = 'abc123';
my #lines = ( "pwd = abc", "pwd=abc", "password=def", "name= Mike" );
my %passwordMap = (
'pwd(\\s*)=.*' => 'pwd\\1= $newPasswd',
'password(\\s*)=.*' => 'password\\1= $newPasswd',
);
print "#lines\n";
foreach my $line (#lines) {
while ( my ( $key, $value ) = each(%passwordMap) ) {
if ( $line =~ /$key/ ) {
my $cmdStr = "\$line =~ s/$key/$value/";
print "$cmdStr\n";
eval($cmdStr);
last;
}
}
}
print "#lines";
run it will give me the correct results:
pwd = abc pwd=abc password=def name= Mike
$line =~ s/pwd(\s*)=.*/pwd\1= $newPasswd/
\1 better written as $1 at (eval 2) line 1 (#1)
$line =~ s/password(\s*)=.*/password\1= $newPasswd/
\1 better written as $1 at (eval 3) line 1 (#1)
pwd = abc123 pwd=abc password= abc123 name= Mike
I don't want to see the warnings, tried to use $1 instead of \1, but it does not work. What should I do? Thanks a lot.
\1 is a regex pattern that means "match what was captured by the first set of capturing parens." It makes absolutely no sense to use that in a replacement expression. To get the string captured by the first set of capturing parens, use $1.
$line =~ s/pwd(\s*)=.*/pwd\1= $newPasswd/
should be
$line =~ s/pwd(\s*)=.*/pwd$1= $newPasswd/
so
'pwd(\\s*)=.*' => 'pwd\\1= $newPasswd',
'password(\\s*)=.*' => 'password\\1= $newPasswd',
should be
'pwd(\\s*)=.*' => 'pwd$1= $newPasswd',
'password(\\s*)=.*' => 'password$1= $newPasswd',
or better yet
qr/((?:pwd|password)\s*=).*/ => '$1= $newPasswd',
I see a lot of repetition in your code.
Assuming you're using Perl 5.10 or later, this is how I would have written your code.
use strict;
use warnings;
use 5.010;
my $new_pass = 'abc123';
my #lines = ( "pwd = abc", "pwd=abc", "password=def", "name= Mike" );
my #match = qw'pwd password';
my $match = '(?:'.join( '|', #match ).')';
say for #lines;
say '';
s/$match \s* = \K .* /$new_pass/x for #lines;
# which is essentially the same as:
# s/($match \s* =) .* /$1$new_pass/x for #lines;
say for #lines;
Assuming that the pattern of your pattern matching map stays the same, why not get rid of it and say simply:
$line =~ s/\s*=.*/=$newPassword/