Parsing HTML with Mojolicious User Agent - perl

I have html something like this
<h1>My heading</h1>
<p class="class1">
<strong>SOMETHING</strong> INTERESTING (maybe not).
</p>
<div class="mydiv">
<p class="class2">
interesting link </p>
<h2>Some other heading</h2>
The content between h1 and h2 varies - I know I can use css selectors in Mojo::Dom to, say, select the content of h1 or h2, or p tags - but how to select everything between h1 and h2? Or more generally, everything between any two given sets of tags?

It's pretty straightforward. You can just select all interesting elements in a Mojo::Collection object (this is what Mojo::DOM's children method does for example) and do some kind of a state-machine like match while iterating over that collection.
Probably the most magic way to do this
is to use Perl's range operator .. in scalar context:
In scalar context, ".." returns a boolean value. The operator is bistable, like a flip-flop, and emulates the line-range (comma) operator of sed, awk, and various editors. Each ".." operator maintains its own boolean state, even across calls to a subroutine that contains it. It is false as long as its left operand is false. Once the left operand is true, the range operator stays true until the right operand is true, AFTER which the range operator becomes false again. It doesn't become false till the next time the range operator is evaluated.
Here's a
simple example
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
use Mojo::DOM;
# slurp all DATA lines
my $dom = Mojo::DOM->new(do { local $/; <DATA> });
# select all children of <div id="yay"> into a Mojo::Collection
my $yay = $dom->at('#yay')->children;
# select interesting ('..' operator in scalar context: flip-flop)
my $interesting = $yay->grep(sub { my $e = shift;
$e->type eq 'h1' .. $e->type eq 'h2';
});
say $interesting->join("\n");
__DATA__
<div id="yay">
<span>This isn't interesting</span>
<h1>INTERESTING STARTS HERE</h1>
<strong>SOMETHING INTERESTING</strong>
<span>INTERESTING TOO</span>
<h2>END OF INTERESTING</h2>
<span>This isn't interesting</span>
</div>
Output
<h1>INTERESTING STARTS HERE</h1>
<strong>SOMETHING INTERESTING</strong>
<span>INTERESTING TOO</span>
<h2>END OF INTERESTING</h2>
Explanation
So I'm using Mojo::Collection's grep to filter the collection object $yay. Since it looks for truth it creates a scalar context for the given function's return value and so the .. operator acts like a flip-flop. It becomes true after it first saw a h1 element and becomes false after it first saw a h2 element, so you get all lines between that headlines including themselves.
Since I think you know some Perl and you can use arbitrary tests together with .. I hope this helps to solve your problem!

Related

How do map and grep work?

I came across this code in a script, can you please explain what map and grep does here?
open FILE, '<', $file or die "Can't open file $file: $!\n";
my #sets = map {
chomp;
$_ =~ m/use (\w+)/;
$1;
}
grep /^use/, ( <FILE> );
close FILE;
The file pointed by $file has:
use set_marvel;
use set_caprion;
and so on...
Despite the fact that your question doesn't show any research effort, I'm going to answer it anyway, because it might be helpful for future readers who come across this page.
According to perldoc, map:
Evaluates the BLOCK or EXPR for each element of LIST (locally setting
$_ to each element) and returns the list value composed of the results
of each such evaluation. In scalar context, returns the total number
of elements so generated. Evaluates BLOCK or EXPR in list context, so
each element of LIST may produce zero, one, or more elements in the
returned value.
The definition for grep, on the other hand:
Evaluates the BLOCK or EXPR for each element of LIST (locally setting
$_ to each element) and returns the list value consisting of those
elements for which the expression evaluated to true. In scalar
context, returns the number of times the expression was true.
So they're similar in their input values, their return values, and the fact that they both localize $_.
In your specific code, going from right to left:
<FILE> slurps the lines in the file pointed to by the FILE filehandle and returns a list
In the context of grep, /^use/ looks at each line and returns true for the ones that match the regular expression. The return value of grep, therefore, is a list of lines that that start with use.
In the BLOCK of your map (which is only considering lines that passed the earlier grep test):
chomp removes any trailing string from $_ that corresponds to the current value of $/ (i.e., the newline). This is unnecessary, because as you'll see below, \w will never match a newline.
$_ =~ m/use (\w+)/ is a regular expression that looks for use followed by a space, followed by one or more word characters ([0-9a-zA-Z_]) in a capture group. The $_ =~ is redundant, since the match operator m// binds to $_ by default.
$1 is the first matching capture group from the previous expression. Since it's the last expression in the BLOCK, it bubbles up as the return value for each list item that was evaluated.
The end result is stored in an array named #sets, which should contain 'set_marvel', 'set_caprion', etc.
Equivalently, your code could be rewritten without map and grep like this, which may make it easier for you to understand:
my #sets;
while (<FILE>) {
next unless /^use (\w+)/;
push(#sets, $1);
}
The grep takes the <FILE> as input and uses the regular expression ^use to copy all of the lines that start with use into an array that is passed to map.
The map loops through each array entry and puts each entry in $_, then calls chomp on $_ implicitly. Then $_ =~ m/use (\w+)/; performs a regular expression on $_ that captures the word after the use and puts it into $1. Then the $1 is called to put it in #set.

Perl dereferencing in non-strict mode

In Perl, if I have:
no strict;
#ARY = (58, 90);
To operate on an element of the array, say it, the 2nd one, I would write (possibly as part of a larger expression):
$ARY[1] # The most common way found in Perldoc's idioms.
Though, for some reason these also work:
#ARY[1]
#{ARY[1]}
Resulting all in the same object:
print (\$ARY[1]);
print (\#ARY[1]);
print (\#{ARY[1]});
Output:
SCALAR(0x9dbcdc)
SCALAR(0x9dbcdc)
SCALAR(0x9dbcdc)
What is the syntax rules that enable this sort of constructs? How far could one devise reliable program code with each of these constructs, or with a mix of all of them either? How interchangeable are these expressions? (always speaking in a non-strict context).
On a concern of justifying how I come into this question, I agree "use strict" as a better practice, still I'm interested at some knowledge on build-up non-strict expressions.
In an attemp to find myself some help to this uneasiness, I came to:
The notion on "no strict;" of not complaining about undeclared
variables and quirk syntax.
The prefix dereference having higher precedence than subindex [] (perldsc § "Caveat on precedence").
The clarification on when to use # instead of $ (perldata § "Slices").
The lack of "[]" (array subscript / slice) description among the Perl's operators (perlop), which lead me to think it is not an
operator... (yet it has to be something else. But, what?).
For what I learned, none of these hints, put together, make me better understand my issue.
Thanks in advance.
Quotation from perlfaq4:
What is the difference between $array[1] and #array[1]?
The difference is the sigil, that special character in front of the array name. The $ sigil means "exactly one item", while the # sigil means "zero or more items". The $ gets you a single scalar, while the # gets you a list.
Please see: What is the difference between $array[1] and #array[1]?
#ARY[1] is indeed a slice, in fact a slice of only one member. The difference is it creates a list context:
#ar1[0] = qw( a b c ); # List context.
$ar2[0] = qw( a b c ); # Scalar context, the last value is returned.
print "<#ar1> <#ar2>\n";
Output:
<a> <c>
Besides using strict, turn warnings on, too. You'll get the following warning:
Scalar value #ar1[0] better written as $ar1[0]
In perlop, you can read that "Perl's prefix dereferencing operators are typed: $, #, %, and &." The standard syntax is SIGIL { ... }, but in the simple cases, the curly braces can be omitted.
See Can you use string as a HASH ref while "strict refs" in use? for some fun with no strict refs and its emulation under strict.
Extending choroba's answer, to check a particular context, you can use wantarray
sub context { return wantarray ? "LIST" : "SCALAR" }
print $ary1[0] = context(), "\n";
print #ary1[0] = context(), "\n";
Outputs:
SCALAR
LIST
Nothing you did requires no strict; other than to hide your error of doing
#ARY = (58, 90);
when you should have done
my #ARY = (58, 90);
The following returns a single element of the array. Since EXPR is to return a single index, it is evaluated in scalar context.
$array[EXPR]
e.g.
my #array = qw( a b c d );
my $index = 2;
my $ele = $array[$index]; # my $ele = 'c';
The following returns the elements identified by LIST. Since LIST is to return 0 or more elements, it must be evaluated in list context.
#array[LIST]
e.g.
my #array = qw( a b c d );
my #indexes ( 1, 2 );
my #slice = $array[#indexes]; # my #slice = qw( b c );
\( $ARY[$index] ) # Returns a ref to the element returned by $ARY[$index]
\( #ARY[#indexes] ) # Returns refs to each element returned by #ARY[#indexes]
${foo} # Weird way of writing $foo. Useful in literals, e.g. "${foo}bar"
#{foo} # Weird way of writing #foo. Useful in literals, e.g. "#{foo}bar"
${foo}[...] # Weird way of writing $foo[...].
Most people don't even know you can use these outside of string literals.

Perl Hashes and regex

I am working on a code that splits sentence into individual words, the words are then searched against hash keys for their presence. My code returns terms that are 100% same, after a match I tag the word from the sentence with the value that corresponds to the matching key. The problem is the code tags terms but with random values not with what I expect. Also, there are situations where the term and the hash key are similar but not 100% identical,
how can I write a regular expression to match my terms with the keys.
Note: I have stemmed the hash keys to their root forms.
I cam provide some examples: If the term from the sentence is Synergistic or anti-synergistic, and my hash key is Synerg, how can I match the above term with Synerg.
My code is as follows:
open IN, "C:\\Users\\Desktop\\TM\\clean_cells.txt" or die "import file absent";
my %hash=();
use Tie::IxHash;
tie %hash => "Tie::IxHash";
while(<IN>)
{
chomp $_;
$line=lc $_;
#Organs=split/\t/, $line;
$hash{$Organs[0]}=$Organs[1];
}
$Sentence="Lymphoma is Lymph Heart and Lung";
#list=split/ /,$Sentence;
#array=();
foreach $term(#list)
{
chomp $term;
for $keys(keys %hash)
{
if($hash{$term})
{
$cell="<$hash{$keys}>$term<\/$hash{$keys}>";
push(#array, $cell);
}
elsif($term=~m/\b\Q$keys(\w+)\E\b/)
{
$cell="<$hash{$keys}>$term<\/$hash{$keys}>";
push(#array, $cell);
}
elsif($term=~m/\b\Q(\w+)$keys\E\b/)
{
$cell="<$hash{$keys}>$term<\/$hash{$keys}>";
push(#array, $cell);
}
elsif($term=~m/\b\Q(\w+)$keys(\w+)\E\b/)
{
$cell="<$hash{$keys}>$term<\/$hash{$keys}>";
push(#array, $cell);
}
}
}
print #array;
for example: hash looks like this: %hash={
TF1 => Lymph
Thoracic_duct => Lymph
SK-MEL-1 => Lymph
Brain => Brain
Cerebellum => Brain
};
So if the term TF1 is found it should be substituted to Lymph TF1 /Lymph
I found two big problems that were preventing your code from working:
You are making the keys to your hash lowercase, but you are not doing
the same for the terms in $Sentence. Thus, uppercase words from
$Sentence will never match.
The \Q...\E modifier disables regex meta-characters. While it is often good to do this when interpolating a variable, you cannot use expressions like (\w+) in there--that will look for the literal characters (\w+). Those regexes need to be rewritten like this: m/\b\Q$keys\E(\w+)\b/.
There are other design issues with your code, as well:
You are using undeclared global variables all over the place. You should declare all variables with my. Always turn on use strict; use warnings;, which will force you to do this correctly.
There doesn't appear to be any reason for Tie::IxHash, which causes your hash to be ordered. You don't use this ordering in any way in your code. The output is ordered by #list. I would do away with this unnecessary module.
Your if/elsif statements are redundant. if($term=~m/\b\Q(\w*)$keys(\w*)\E\b/) will accomplish the same thing as all of them combined. Note that I replaced \w+ with \w*. This allows the groups before and after to match zero or more characters instead of one or more characters.
Note: I didn't bother testing with Tie::IxHash, since I don't have that module and it appears unnecessary. It's possible that using this module is also introducing other problems in your code.

How to parse text that include something like table using PERL

Tables like this:
<p>
Porpertity History
<p>
class Rate data
<p>
A1 5% 10
<p>
B1 3.5% 8
How to parse them into a hash or array?
Thanks a lot.
There are three ways we could parse this table. First, we need to open it and get to the data. I'm assuming that if the second column ends in a %, it's a valid column:
#! /usr/bin/env perl
use strict;
use warnings;
open (MY_FILE, "data.txt")
or die qq(Can't open "data.txt" for reading\n);
my %myHash;
my %dataHash;
my %rateHash;
while (my $line = <MY_FILE>) {
my ($class, $rate, $data) = split (/\s+/, $line);
next (unless $rate =~ /%$/);
That part of the code will split the three items, and then the question is how to structure the hash. We could create two hashes (one for rate and one for data, and use the same key:
$rateHash{$class} = $rate;
$dataHash{$data} = $data;
Or, we could have our hash as a hash of hashes, and put both pieces in the same hash:
$myHash{$class}->{RATE} = $rate;
$myHash{$class}->{DATA} = $data;
You can now pull up either the rate or data in the same hash. You can also do it in one go:
%{$myHash{$class}} = ("RATE" => $rate, "DATA" => "$data");
I personally prefer the first one.
Another possibility is to combine the two into a single scalar:
$myHash{$class} = "$rate:$data"; #Assuming ":" doesn't appear in either.
My preference is to make it a hash of hashes (like in the second example). Then, create a class to handle the dirty work (Simple to do using Moose).
However, depending upon your programming skill, you might feel more comfortable with the double hash idea, or with combining the two values into a single hash.

What does the Perl split function return when there is no value between tokens?

I'm trying to split a string using the split function but there isn't always a value between tokens.
Ex: ABC,123,,,,,,XYZ
I don't want to skip the multiple tokens though. These values are in specific positions in the string. However, when I do a split, and then try to step through my resulting array, I get "Use of uninitialized value" warnings.
I've tried comparing the value using $splitvalues[x] eq "" and I've tried using defined($splitvalues[x]) , but I can't for the life of me figure out how to identify what the split function is putting in to my array when there is no value between tokens.
Here's the snippet of my code (now with more crunchy goodness):
my #matrixDetail = ();
#some other processing happens here that is based on matching data from the
##oldDetail array with the first field of the #matrixLine array. If it does
#match, then I do the split
if($IHaveAMatch)
{
#matrixDetail = split(',', $matrixLine[1]);
}
else
{
#matrixDetail = ('','','','','','','');
}
my $newDetailString =
(($matrixDetail[0] eq '') ? $oldDetail[0] : $matrixDetail[0])
. (($matrixDetail[1] eq '') ? $oldDetail[1] : $matrixDetail[1])
.
.
.
. (($matrixDetail[6] eq '') ? $oldDetail[6] : $matrixDetail[6]);
because this is just snippets, I've left some of the other logic out, but the if statement is inside a sub that technically returns the #matrixDetail array back. If I don't find a match in my matrix and set the array equal to the array of empty strings manually, then I get no warnings. It's only when the split populates the #matrixDetail.
Also, I should mention, I've been writing code for nearly 15 years, but only very recently have I needed to work with Perl. The logic in my script is sound (or at least, it works), I'm just being anal about cleaning up my warnings and trying to figure out this little nuance.
#!perl
use warnings;
use strict;
use Data::Dumper;
my $str = "ABC,123,,,,,,XYZ";
my #elems = split ',', $str;
print Dumper \#elems;
This gives:
$VAR1 = [
'ABC',
'123',
'',
'',
'',
'',
'',
'XYZ'
];
It puts in an empty string.
Edit: Note that the documentation for split() states that "by default, empty leading fields are preserved, and empty trailing ones are deleted." Thus, if your string is ABC,123,,,,,,XYZ,,,, then your returned list will be the same as the above example, but if your string is ,,,,ABC,123, then you will have a list with three empty strings in elements 0, 1, and 2 (in addition to 'ABC' and '123').
Edit 2: Try dumping out the #matrixDetail and #oldDetail arrays. It's likely that one of those isn't the length that you think it is. You might also consider checking the number of elements in those two lists before trying to use them to make sure you have as many elements as you're expecting.
I suggest to use Text::CSV from CPAN. It is a ready made solution which already covers all the weird edge cases of parsing CSV formatted files.
delims with nothing between them give empty strings when split. Empty strings evaluate as false in boolean context.
If you know that your "details" input will never contain "0" (or other scalar that evaluates to false), this should work:
my #matrixDetail = split(',', $matrixLine[1]);
die if #matrixDetail > #oldDetail;
my $newDetailString = "";
for my $i (0..$#oldDetail) {
$newDetailString .= $matrixDetail[$i] || $oldDetail[$i]; # thanks canSpice
}
say $newDetailString;
(there are probably other scalars besides empty string and zero that evaluate to false but I couldn't name them off the top of my head.)
TMTOWTDI:
$matrixDetail[$_] ||= $oldDetail[$_] for 0..$#oldDetail;
my $newDetailString = join("", #matrixDetail);
edit: for loops now go from 0 to $#oldDetail instead of $#matrixDetail since trailing ",,," are not returned by split.
edit2: if you can't be sure that real input won't evaluate as false, you could always just test the length of your split elements. This is safer, definitely, though perhaps less elegant ^_^
Empty fields in the middle will be ''. Empty fields on the end will be omitted, unless you specify a third parameter to split large enough (or -1 for all).