Parsing HTML-attributes like strings - perl

Have strings like HTML attributes
key1="value1 value2" key2="va3" key4
need parse such string to get HoA:
$parsed = {
key1' => [
'value1',
'value2'
],
key2 => [ 'val3' ], #or key2 => 'val3' doesn't matter..
key4 => undef,
};
Creating the function myself, like:
#!/usr/bin/env perl
use 5.014;
use strict;
use warnings;
use Data::Dumper;
while(<DATA>) {
my $parsed;
chomp;
next if m/\A\s*#/;
while( m/(\w+)(\s*=\s*(["'])(.*?)(\3))?/g ) {
my $k = $1;
if( $4 ) {
my #v = split(/\s+/, $4);
$parsed->{$k} = \#v;
}
else {
$parsed->{$k} = undef;
}
}
say Dumper $parsed;
}
__DATA__
key1="value1 value2" key2 key3="val3"
key1='value1 "value2"' key8 key3='val3'
key1='value1 i\'m' key2 key3="val3"
key1='value1 value2' key8 key3=val3
works and prints correct results for the first 2 lines.
$VAR1 = {
'key1' => [
'value1',
'value2'
],
'key3' => [
'val3'
],
'key2' => undef
};
$VAR1 = {
'key1' => [
'value1',
'"value2"'
],
'key3' => [
'val3'
],
'key8' => undef
};
Unfortunately, it fails on 3rd line - don't know how to handle the escaped quotes. (And just figured out than the key=val (without quotes) is valid too))
Additionally, because don't want reinvent the wheel again, probably exists some module on CPAN for this, only haven't any idea what to search. ;(
EDIT
#mpapec suggested a module, what could greatly help for parsing the RHS part of the "assignement". My problem is than the string contains multiple space delimited LHS=RHS, where the RHS could be quoted (in single and double) or not quoted (in the case of one value) and the RHS values (in the quotes) are space delimited too..
key1="value1 value2" key2="va3" key4 key5=val5 key6='val6' key7='val x\'y zzz'
So I don't know how to break the string into multiple LHS=RHS parts, because can't split at space and can't use my regex, because it fails in escaped quotes. (maybe some more complicated regex what handles escapes could work).
Any suggestions, please?

You can use Text::ParseWords as mpapec suggested:
use strict;
use warnings;
use 5.010;
use Data::Dumper;
use Text::ParseWords;
$Data::Dumper::Sortkeys = 1;
my $string = q{key1="value1 value2" key2="va3" key4 key5=val5 key6='val6' key7='val x\'y zzz'};
my #words = shellwords $string;
my %parsed;
foreach my $word (#words) {
my ($key, $values) = split /=/, $word, 2;
$parsed{$key} //= [];
push #{ $parsed{$key} }, $_ for shellwords $values;
}
print Dumper \%parsed;
Output:
$VAR1 = {
'key1' => [
'value1',
'value2'
],
'key2' => [
'va3'
],
'key4' => [],
'key5' => [
'val5'
],
'key6' => [
'val6'
],
'key7' => [
'val',
'x\'y',
'zzz'
]
};
Note that for consistency, I assigned keys without values an empty array instead of undef. I think this will make the data structure easier to use.
Also note that I called shellwords twice. I did this to remove the backslashes from escaped quotes, so that
key7='val x\'y zzz'
gets split into
val x'y zzz
instead of
val x\'y zzz
(The backslash in x\'y in the output above is added by Data::Dumper; there's no backslash in the variable itself.)

To fix your current issue, you can setup an alteration to handle backslashes in a special way.
#!/usr/bin/env perl
use 5.014;
use strict;
use warnings;
use Data::Dump;
my $parsed;
while (<DATA>) {
chomp;
next if m/\A\s*#/;
while (
m{
(\w+)
(?:
\s* = \s*
(["'])
( (?: (?!\2)[^\\] | \\. )* )
\2
)?
}gx
)
{
my $k = $1;
if ($2) {
( my $val = $3 ) =~ s/\\(.)/$1/g; # Unescape backslashes
$parsed->{$k} = [ split /\s+/, $val ]; # Split words
} else {
$parsed->{$k} = undef;
}
}
dd $parsed;
print "\n";
}
__DATA__
key1="value1 value2" key2 key3="val3"
key1='value1 "value2"' key2 key3='val3'
key1='value1 i\'m' key2 key3="val3"
Outputs:
{ key1 => ["value1", "value2"], key2 => undef, key3 => ["val3"] }
{ key1 => ["value1", "\"value2\""], key2 => undef, key3 => ["val3"] }
{ key1 => ["value1", "i'm"], key2 => undef, key3 => ["val3"] }
There are still other issues to take into account, but perhaps this will help you get further along.

You might consider something based on Parser::MGC. Your examples look like a nice simple loop on
my $key = $self->token_ident;
$self->expect( '=' );
my $value = $self->token_string;

Related

perl: subroutine returns 0 instead of specified array

I have a hash of hashes like this:
my %HoH = (
flintstones => {
1 => "fred",
2 => "barney",
},
jetsons => {
1 => "george",
2 => "jane",
},
simpsons => {
1 => "homer",
2 => "marge",
},
);
My subroutine is meant to search through the values of a specified key, e.g. search all 2s for e and return the value for key 1 in each case.
It works since it can print those things just fine, and I can also print it to a text file. I also want the same lines to be pushed to an array #output.
Why does my subroutine return zero which is saved in $hej in this case.
sub search_hash {
# Arguments are
#
# $hash=hash ref
# $parameter1=key no. to search in
# $parameter2=value to find
# $parameter3=name of text file to write to
my ( $hash, $parameter1, $parameter2, $parameter3 ) = #_, ;
# Loop over the keys in the hash
foreach ( keys %{$hash} ) {
# Get the value for the current key
my $value = $hash->{$_};
my $value2 = $hash->{'1'};
search_hash( $value, $parameter1, $parameter2, $parameter3 );
for my $key ( $parameter1 ) {
my #output; #create array for loop outputs to be saved
if ( $value =~ $parameter2 ) {
push #output, "$value2"; #push lines to array
print "Value: $value\n";
print "Name: $value2\n";
open( my $fh, '>>', $parameter3 );
print $fh ( "$value2\n" );
close $fh;
}
return #output;
}
}
}
my $hej = search_hash( \%HoH, "2", 'e', 'data3.txt' );
print $hej;
output
Can't use string ("fred") as a HASH ref while "strict refs" in use
There is no key "1" in first loop of your hash. Recursive subroutine is not a good choice here.
my $value2 = $hash->{'1'};
Borodin's one line code is great. But we should search 2 s.
search all 2 s for e and return the value for key 1 in each case.
As a summary, search_hash.pl
use strict;
use warnings;
use utf8;
my %HoH = (
Flintstones => { 1 => "Fred", 2 => "Barney" },
Jetsons => { 1 => "George", 2 => "Jane" },
Simpsons => { 1 => "Homer", 2 => "Marge" }
);
my #output2 = map { $_->{1} } grep { $_->{2} =~ /e/ } values %HoH;
open( my $fh, '>', "data3.txt");
print $fh ( "$_\n" ) foreach #output2;
close $fh;
And
perl search_hash.pl
cat data3.txt
OUTPUT:
Fred
Homer
George
The return expression of subroutine is evaluated in the same context as the subroutine itself. Since you're assuming the result of the subroutine to a scalar, the subroutine is evaluated in scalar context, and #output is evaluated in scalar context. In scalar context, an array returns the number of elements it contains. In this case, #output happened to be empty, so search_hash returned zero.
If you want the elements of #output instead of the number of elements in #output, you will need to call the subroutine in list context. Assigning the result to an array is one way of doing that.
This is how I fixed the problem in the rewrite posted below. Note that I replaced the scalar $hej with the array #hej below.
I also fixed other problems. The values for key 1 from all three
2nd level hashes are now returned, because each of them contains a value for key 2, which contains e (the value to find). See below.
use strict;
use warnings;
my %HoH = (
Flintstones => { 1 => "Fred", 2 => "Barney" },
Jetsons => { 1 => "George", 2 => "Jane" },
Simpsons => { 1 => "Homer", 2 => "Marge" }
);
sub search_hash {
# Arguments:
# $hash: hash ref
# $search_key: key to search in each 2-nd level hash
# $search_string: value to find
my ( $hash, $search_key, $search_string ) = #_;
my #output;
foreach ( keys %{$hash} ) {
#print "Key: $_\n";
my $hash2 = $hash->{$_}; # 2-nd level hash (reference to)
my $search_val = $hash2->{$search_key}; # Value for key == parameter1
#print "Value: $search_val\n";
if ($search_val =~ /\Q$search_string/) {
my $id = $hash2->{'1'};
#print "Name: $id\n";
push #output, $id;
}
}
return #output;
}
my #hej = search_hash( \%HoH, '2', 'e' );
print "Result: #hej\n";

"=~" and "!~" operator makes odd hash count in Perl when passed in a subroutine

I am making a subroutine that accepts an array of hashes that have the same keys with different values. As a requirement, there are values that have conditional operations.
sample:
use strict;
use warnings;
use Data::Dumper;
sub test {
my #data = #_;
print Dumper(#data);
}
test(
{
'value' => 1 == 2
},
{
'value2' => 4 == 4
}
);
output:
$VAR1 = {
'value' => ''
};
$VAR2 = {
'value2' => 1
};
but when I use =~ or !~ operators, the interpreter outputs this error:
Odd number of elements in anonymous hash at ...
test(
{
'value' => 1 == 2
},
{
'value2' => 'a' =~ /b/
}
);
output:
$VAR1 = {
'value' => ''
};
$VAR2 = {
'value2' => undef
};
It seems that for false statements the hash value returns undef not ''.
I also tried putting undef directly on the hash value and it works ok.
Question:
Why does perl outputs this behavior?
What is the best solution for this?
The match operator in list context returns an empty list on failure. You could use
scalar( 'a' =~ /b/ )
or
'a' =~ /b/ ? 1 : 0

Get all values in a hash from string after equals sign Perl

I have a string like this "Test string has tes value like abc="123",bcd="345",or it it can be xyz="4567" and ytr="434"".
Now what i want to get the values after equals sign.The hash structure while be like this :
$hash->{abc} =123,
$hash->{bcd} =345,
$hash->{xyz} =4567,
i have tried this $str =~ / (\S+) \s* = \s* (\S+) /xg
The regex returns the captured pairs, which can be assigned to a hash, made anonymous.
use warnings 'all';
use strict;
use feature 'say';
my $str = 'Test string has tes value like abc="123",bcd="345",or it '
. 'it can be xyz="4567" and ytr="434"';
my $rh = { $str =~ /(\w+)="(\d+)"/g }
say "$_ => $rh->{$_}" for keys %$rh ;
Prints
bcd => 345
abc => 123
ytr => 434
xyz => 4567
Following a comment – for possible spaces around the = sign, change it to \s*=\s*.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $string = q{Test string has tes value like abc="123",bcd="345" and xyz="523"};
my %hash = $string =~ /(\w+)="(\d*)"/g;
print Dumper \%hash;
Output
$VAR1 = {
'xyz' => '523',
'abc' => '123',
'bcd' => '345'
};
Demo
Your test string looks like this (editing slightly to fix the quoting problems).
'Test string has tes value like abc="123",bcd="345",or it it can be xyz="4567" and ytr="434"'
I used this code to test your regex:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Data::Dumper;
my $text = 'Test string has tes value like abc="123",bcd="345",or it it can be xyz="4567" and ytr="434"';
my %hash = $text =~ /(\S+)\s*=\s*(\S+)/g;
say Dumper \%hash;
Which gives this output:
$VAR1 = {
'abc="123",bcd' => '"345",or'
'ytr' => '"434"',
'xyz' => '"4567"'
};
The problem is that \S+ matches any non-whitespace character. And that's too much. You need to be more descriptive about the valid characters.
Your keys appear to all be letters. And your values are all digits - but they are surrounded by quote characters that you don't want. So try this regex instead = /([a-z]+)\s*=\s*"(\d+)"/g.
That gives:
$VAR1 = {
'bcd' => '345',
'abc' => '123',
'ytr' => '434',
'xyz' => '4567'
};
Which looks correct to me.

Print the word after match from Perl array

Array #array has
Sex: M
Name: John, oliver
Age is 33
Has no experience
is 5 feet tall
I want to print the word after Name: which is John, oliver in this case.
Below code works on $string, how to do the same on an #array?
my ($name) = $string =~ /Name: (.+)$/;
print $name;
You will have to iterate over the array, apply the regex on each element. Something like:
my #array = ( 'Sex: M', 'Name: John, oliver', 'Age: 33' )
foreach my $item ( #array ) {
if( $item =~ /Name: (.+)$/ ) {
print $1;
}
}
I think hash is a better datastructure to store your data.
To find results in an array, the tool for the job is grep.
foreach my $name_lines ( grep { m/Name/ } #array ) {
my ($name) = /Name: (.+)$/;
print $name,"\n";
}
Here I've assumed there might be multiple matches - you don't have to do that particularly though and could instead:
my ($name) = map { m/Name: (.+$)/ } #stuff;
print $name;
This uses map to transform the array, but because we assign it to a list containing a single scalar, the second match is discarded. (if there is one).
Although I'd suggest this isn't the best approach to take - an array of keys and values isn't particularly useful compared to a hash.
If you have an array:
my #stuff = ( "Sex: M",
"Name: John, oliver",
"Age is 33",
"Has no experience",
"is 5 feet tall" );
Then you can transform it with map:
my %stuff_hash = map { /(\w+):? (.*)$/ } #stuff;
Which gives you a data structure looking like this:
$VAR1 = {
'Age' => 'is 33',
'Sex' => 'M',
'is' => '5 feet tall',
'Name' => 'John, oliver',
'Has' => 'no experience'
};
So you can:
print $stuff_hash{'Name'},"\n";
Or alternatively - stick your array back together into a string, and then multi line the regex:
my ($name) = join ( "\n", #stuff) =~ m/Name: (.*)$/m;
print $name;
Let the structure fit your data
my #arr = (
{
Sex => 'M',
Name => 'John Doe',
Age => 21,
},
{
Sex => 'F',
Name => 'Jane Doe',
Age => 20,
},
);
Then you can more easily find the data:
for (#arr){
print "$_->{Name}\n";
}

Parsing multiple keyword value pairs in one line - mixed delimiters - Text::Balanced

Have the the following file structure - see the DATA section:
#!/usr/bin/perl
use strict;
use warnings;
use Text::Balanced qw(extract_bracketed);
my($ALL, $name, $pairs);
while(defined($name = <DATA>) && defined($pairs = <DATA>)) {
$ALL->{$name} = parse_pairs($pairs);
}
sub parse_pairs {
my $str = shift;
my($extracted, $remainder) = extract_bracketed($str,'{}'); # how to?
}
__DATA__
name1
key1 val1 key2 {val2a val2b} key3 val3
name2
key2 val2 key3 val3
name3
key1 {val1a val1b val1c} key2 {val2a val2b}
e.g. every odd line contain an unique "name", and every even line contain multiple "key value" pairs - space delimited.
The key is always one word (\w+)
the value can be:
one string (\S+), or
multiple space delimited strings, enclosed with brackets { }
Need get the above file into an perl structure, either:
$ALL => {
name1 => {
key1 => ["val1"],
key2 => ["val2a", "val2b"],
key3 => ["val3"]
},
[.......]
or
$ALL => {
name1 => {
key1 => {
val1 => undef,
},
key2 => {
val2a => undef,
val2b => undef,
}
key3 => {
val3 => undef,
}
},
[.......]
This is probably a job for the Text::Ballanced, but havent any idea how to use it, because here are mixed values, some are only simple word and some are ballanced - bracket enclosed, and don't know how to repeat the extraction. ;(
Need some hints, how to write the parse_pairs sub in the above src.
Here's what I have. It's not using Text::Balanced, however. It's using Regexp::Common:
#!/usr/bin/perl
use strict;
use warnings;
use Regexp::Common;
my($ALL, $name, $pairs);
while(defined($name = <DATA>) && defined($pairs = <DATA>)) {
chomp $name;
chomp $pairs;
$ALL->{$name} = parse_pairs($pairs);
}
use Data::Dump; dd $ALL;
sub parse_pairs {
my $str = shift;
my #key_values = $str =~ /
(\w+) # key
\s*
(\w+|$RE{balanced}{-parens=>'{}'}) # value
\s*/xg;
my $r;
while (#key_values)
{
$key_values[1] =~ s/^\{//;
$key_values[1] =~ s/\}$//;
$r->{$key_values[0]} = [ split /\s+/, $key_values[1] ];
splice #key_values, 0, 2;
}
$r;
}
__DATA__
name1
key1 val1 key2 {val2a val2b} key3 val3
name2
key2 val2 key3 val3
name3
key1 {val1a val1b val1c} key2 {val2a val2b}
This seems to produce the output you're looking for (option 1).