HTML::Parser handler sends undefined parameter to callback function?

HTML::Parser handler sends undefined parameter to callback function? - perl

How its being declared:
my $HTML_GRABBER = HTML::Parser->new('api_version' => 2,
'handlers' => {
'start' => [\&start_tag,"tagname,text"],
'text' => [\&read_text,"tagname, text"],
'end' => [\&end_tag,"tagname"]
}
);
callback function:
sub read_text {
print Dumper(#_);
die "\n";
my ($tag,$stuff) = #_;
if(($DO_NOTHING==0)&&($tag eq $current_tag))
{
push #{$data_queue}, $stuff;
}
}
result:
$VAR1 = undef;
$VAR2 = '
';
so it passes an undefined value and an empty string for tag and text, apparently. THis is reading from a saved HTML file on my harddrive. IDK
I had something like this in mind:
#DOC structure:
#(
# "title"=> {"text"=>#("text")}
# "div" => [
# {
# "p"=> [
# {
# "class" => string
# "id" => string
# "style" => string
# "data"=>["first line", "second line"]
# }
# ],
# "class" => string
# "id" => string
# "style" => string
# }
# ]
#)

You've told it to.
You specified which parameters should be passed to the text handler:
'text' => [\&read_text,"tagname, text"],
Well, there is no tagname for a text token, and therefore it passes you undef as the first paramter.
What exactly are you trying to do? If you describe your actual goal, we might be able to suggest a better solution instead of just pointing out the flaws in your current implementation. Check out: What is an XY Problem?
Addendum about Mojo::DOM
There are modern modules like Mojo::DOM that are much better for navigating a document structure and finding specific data. Check out Mojocast Episode 5 for a helpful 8 minute introductory video.
You appear to be prematurely worried about efficiency of the parse. Initially, I'd advise you to just store the raw html in the database, and reparse it whenever you need to pull new information.
If you Benchmark and decide this is too slow, then you can use Storable to save a serialized copy of the parsed $dom object. However, this should definitely be in addition to the saved html.
use strict;
use warnings;
use Mojo::DOM;
use Storable qw(freeze thaw);
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
# Serializing to memory - Can then put it into a DB if you want
my $serialized = freeze $dom;
my $newdom = thaw($serialized);
# Load Title from Serialized dom
print $newdom->find('title')->text;
__DATA__
<html>
<head><title>My Title</title></head>
<body>
<h1>My Header one</h1>
<p>My Paragraph One</p>
<p>My Paragraph Two</p>
</body>
</html>
Outputs:
My Title

Related

Using Web::Scraper to pull HTML tags with a data marker

The purpose of this code is to parse an HTML file and return content that is wrapped with tags that have the data-reader attribute.
This works as desired, but I would also like to get the associated HTML tag, but I don't know how to have it returned in the scrape data.
Is this possible?
#!/usr/bin/perl -T
use strict;
use warnings;
use Web::Scraper;
my $html = do { local $/; <DATA> };
my $s = scraper {
process '*', 'links[]' => '#data-reader';
process '*', 'content[]' => 'text';
};
my $res = $s->scrape($html);
for my $i (0 .. #{ $res->{links} } ) {
if ($res->{links}[$i]) {
print "<??>$res->{content}[$i]</??>\n";
}
}
exit;
__DATA__
<h1 data-reader="on">Hello <em>world</em></h1>
<h2>This is subheading</h2>
<h3 style="color:#000;" data-reader="on" class="phead">Paragraph Heading</h3>
Output:
<??>Hello world</??>
<??>Paragraph Heading</??>

#!/usr/bin/perl
use strict;
use warnings;
use Web::Scraper;
use HTML::Entities;
my $html = do { local $/; <DATA> };
my $s = scraper {
process '[data-reader]', 'list[]' => {
tag => sub { $_->tag },
content => 'TEXT',
};
result 'list';
};
my $results = $s->scrape($html);
for my $part (#$results) {
print "<$part->{tag}>" . encode_entities($part->{content}) . "</$part->{tag}>\n";
}
__DATA__
<h1 data-reader="on">Hello <em>world</em></h1>
<h2>This is subheading</h2>
<h3 style="color:#000;" data-reader="on" class="phead">Paragraph Heading</h3>
Output:
<h1>Hello world</h1>
<h3>Paragraph Heading</h3>
The ability to pass a raw subroutine as the extractor specification seems to be undocumented, but
the Web::Scraper documentation is spotty in general, and
it's used in at least one example,
so I don't feel too bad about using it.
I'm re-encoding $part->{content} as HTML to avoid issues in case someone does e.g.
<div data-reader="on"><script>alert(42)</script></div>
If you were to just print $part->{content}, it would give you <script>alert(42)</script>, which is probably not what you want.
In detail:
my $s = scraper {
process '[data-reader]', 'list[]' => {
tag => sub { $_->tag },
content => 'TEXT',
};
result 'list';
};
scraper takes a block of code and wraps it in an object. Every time the scrape method of this object is called, the block of code is run. In theory you can do anything you want there, but the only sensible things are calls to process and result.
process takes three (or more) arguments. The first argument is a CSS (or XPath if it starts with // or id() selector. In this case ([data-reader]) we're selecting all elements that have a data-reader attribute.
The remaining arguments are key/value pairs. scraper provides an implicit context (also known as "stash"), which is simply a hash were results are placed. The "key" argument specifies under which hash key the results of the extraction should be placed. If the "key" argument ends with [], it is stripped and the value is not a single result, but a reference to an array of results.
Here we use list[] as the "key" argument, which means that we're accumulating results under the list key of the stash.
The "value" argument specifies what value we want to store under our key. Possible values include TEXT (the text value of a node) and #foo (the value of the foo attribute of the element in question).
Here we're using a hash reference, which means we want to construct a nested subhash. Each key/value pair of our hash is interpreted as described above. We get entries for tag (containing the tag name as returned by the tag method) and content (containing the text value of our element).
The effect is as if scrape contained the following loop:
my %stash;
for my $node (#found_nodes) {
push #{$stash{list}}, {
tag => $node->tag,
content => get_plain_text_somehow($node),
};
}
Normally scrape returns the stash, but if the scrape block contains result (which must be the last statement in the block), you can make it return just a single key (or if you pass multiple strings to result, a hash containing just a subset of keys). That is, because of result 'list', instead of
return \%stash;
we effectively get
return $stash{list};

Iterate directories in Perl, getting introspectable objects as result

I'm about to start a script that may have some file lookups and manipulation, so I thought I'd look into some packages that would assist me; mostly, I'd like the results of the iteration (or search) to be returned as objects, which would have (base)name, path, file size, uid, modification time, etc as some sort of properties.
The thing is, I don't do this all that often, and tend to forget APIs; when that happens, I'd rather let the code run on an example directory, and dump all of the properties in an object, so I can remind myself what is available where (obviously, I'd like to "dump", in order to avoid having to code custom printouts). However, I'm aware of the following:
list out all methods of object - perlmonks.org
"Out of the box Perl doesn't do object introspection. Class wrappers like Moose provide introspection as part of their implementation, but Perl's built in object support is much more primitive than that."
Anyways, I looked into:
"Files and Directories Handling in Perl - Perl Beginners' Site" http://perl-begin.org/topics/files-and-directories/
... and started looking into the libraries referred there (also related link: rjbs's rubric: the speed of Perl file finders).
So, for one, File::Find::Object seems to work for me; this snippet:
use Data::Dumper;
#targetDirsToScan = ("./");
use File::Find::Object;
my $tree = File::Find::Object->new({}, #targetDirsToScan);
while (my $robh = $tree->next_obj()) {
#print $robh ."\n"; # prints File::Find::Object::Result=HASH(0xa146a58)}
print Dumper($robh) ."\n";
}
... prints this:
# $VAR1 = bless( {
# 'stat_ret' => [
# 2054,
# 429937,
# 16877,
# 5,
# 1000,
# 1000,
# 0,
# '4096',
# 1405194147,
# 1405194139,
# 1405194139,
# 4096,
# 8
# ],
# 'base' => '.',
# 'is_link' => '',
# 'is_dir' => 1,
# 'path' => '.',
# 'dir_components' => [],
# 'is_file' => ''
# }, 'File::Find::Object::Result' );
# $VAR1 = bless( {
# 'base' => '.',
# 'is_link' => '',
# 'is_dir' => '',
# 'path' => './test.blg',
# 'is_file' => 1,
# 'stat_ret' => [
# 2054,
# 423870,
# 33188,
# 1,
# 1000,
# 1000,
# 0,
# '358',
# 1404972637,
# 1394828707,
# 1394828707,
# 4096,
# 8
# ],
# 'basename' => 'test.blg',
# 'dir_components' => []
... which is mostly what I wanted, except the stat results are an array, and I'd have to know its layout (($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,$atime,$mtime,$ctime,$blksize,$blocks) stat - perldoc.perl.org) to make sense of the printout.
Then I looked into IO::All, which I like because of utf-8 handling (but also, say, socket functionality, which would be useful to me for an unrelated task in the same script); and I was thinking I'd use this package instead. The problem is, I have a very hard time discovering what the available fields in the object returned are; e.g. with this code:
use Data::Dumper;
#targetDirsToScan = ("./");
use IO::All -utf8;
$io = io(#targetDirsToScan);
#contents = $io->all(0);
for my $contentry ( #contents ) {
#print Dumper($contentry) ."\n";
# $VAR1 = bless( \*Symbol::GEN298, 'IO::All::File' );
# $VAR1 = bless( \*Symbol::GEN307, 'IO::All::Dir' ); ...
#print $contentry->uid . " -/- " . $contentry->mtime . "\n";
# https://stackoverflow.com/q/24717210/printing-ret-of-ioall-w-datadumper
print Dumper \%{*$contentry}; # doesn't list uid
}
... I get a printout like this:
# $VAR1 = {
# '_utf8' => 1,
# 'constructor' => sub { "DUMMY" },
# 'is_open' => 0,
# 'io_handle' => undef,
# 'name' => './test.blg',
# '_encoding' => 'utf8',
# 'package' => 'IO::All'
# };
# $VAR1 = {
# '_utf8' => 1,
# 'constructor' => sub { "DUMMY" },
# 'mode' => undef,
# 'name' => './testdir',
# 'package' => 'IO::All',
# 'is_absolute' => 0,
# 'io_handle' => undef,
# 'is_open' => 0,
# '_assert' => 0,
# '_encoding' => 'utf8'
... which clearly doesn't show attributes like mtime, etc. - even if they exist (which you can see if you uncomment the respective print line).
I've also tried Data::Printer's (How can I perform introspection in Perl?) p() function - it prints exactly the same fields as Dumper. I also tried to use print Dumper \%{ref ($contentry) . "::"}; (list out all methods of object - perlmonks.org), and this prints stuff like:
'O_SEQUENTIAL' => *IO::All::File::O_SEQUENTIAL,
'mtime' => *IO::All::File::mtime,
'DESTROY' => *IO::All::File::DESTROY,
...
'deep' => *IO::All::Dir::deep,
'uid' => *IO::All::Dir::uid,
'name' => *IO::All::Dir::name,
...
... but only if you use the print $contentry->uid ... line beforehand; else they are not listed! I guess that relates to this:
introspection - How do I list available methods on a given object or package in Perl? #911294
In general, you can't do this with a dynamic language like Perl. The package might define some methods that you can find, but it can also make up methods on the fly that don't have definitions until you use them. Additionally, even calling a method (that works) might not define it. That's the sort of things that make dynamic languages nice. :)
Still, that prints the name and type of the field - I'd want the name and value of the field instead.
So, I guess my main question is - how can I dump an IO::All result, so that all fields (including stat ones) are printed out with their names and values (as is mostly the case with File::Find::Object)?
(I noticed the IO::All results can be of type, say, IO::All::File, but its docs defer to "See IO::All", which doesn't discuss IO::All::File explicitly much at all. I thought, if I could "cast" \%{*$contentry} to a IO::All::File, maybe then mtime etc fields will be printed - but is such a "cast" possible at all?)
If that is problematic, are there other packages, that would allow introspective printout of directory iteration results - but with named fields for individual stat properties?

Perl does introspection in the fact that an object will tell you what type of object it is.
if ( $object->isa("Foo::Bar") ) {
say "Object is of a class of Foo::Bar, or is a subclass of Foo::Bar.";
}
if ( ref $object eq "Foo::Bar" ) {
say "Object is of the class Foo::Bar.";
}
else {
say "Object isn't a Foo::Bar object, but may be a subclass of Foo::Bar";
}
You can also see if an object can do something:
if ( $object->can("quack") ) {
say "Object looks like a duck!";
}
What Perl can't do directly is give you a list of all the methods that a particular object can do.
You might be able to munge some way.Perl objects are stored in package namespaces which are in the symbol table. Classes are implemented via Perl subroutines. It may be possible to go through the package namespace and then find all the subroutines.
However, I can see several issues. First private methods (the ones you're not suppose to use) and non-method subroutines would also be included. There's no way to know which is which. Also, parent methods won't be listed.
Many languages can generate such a list of methods for their objects (I believe both Python and Ruby can), but these usually give you a list without an explanation what these do. For example, File::Find::Object::Result (which is returned by the next_obj method of File::Find::Object) has a base method. What does it do? Maybe it's like basename and gives me the name of the file. Nope, it's like dirname and gives me the name of the directory.
Again, some languages could give a list of those methods for an object and a description. However, those descriptions depend upon the programmer to maintain and make sure they're correct. No guaranteed of that.
Perl doesn't have introspection, but all Perl modules stored in CPAN must be documented via POD embedded documentation, and this is printable from the command line:
$ perldoc File::Find::Object
This is the documentation you see in CPAN pages, in http://Perldoc.perl.org and in ActiveState's Perl documentation.
It's not bad. It's not true introspection, but the documentation is usually pretty good. After all, if the documentation stunk, I probably wouldn't have installed that module in the first place. I use perldoc all the time. I can barely remember my kids' names let alone the way to use Perl classes that I haven't used in a few months, but I find that using perldoc works pretty wall.
What you should not do is use Data::Dumper to dump out objects and try to figure out what they contain and possible methods. Some cleaver programmers are using Inside-Out Objects to thwart peeking toms.
So no, Perl doesn't list methods of a particular class like some languages can, but perldoc comes pretty close to doing what you need. I haven't use File::Find::Object in a long while, but going over the perldoc, I probably could write up such a program without much difficulty.

As I answered to your previous question, it is not a good idea to go relying on the guts of objects in Perl. Instead just call methods.
If IO::All doesn't offer a method that gives you the information that you need, you might be able to write your own method for it that assembles that information using just the documented methods provided by IO::All...
use IO::All;
# Define a new method for IO::All::Base to use, but
# define it in a lexical variable!
#
my $dump_info = sub {
use Data::Dumper ();
my $self = shift;
local $Data::Dumper::Terse = 1;
local $Data::Dumper::Sortkeys = 1;
return Data::Dumper::Dumper {
name => $self->name,
mtime => $self->mtime,
mode => $self->mode,
ctime => $self->ctime,
};
};
$io = io('/tmp');
for my $file ( $io->all(0) ) {
print $file->$dump_info();
}

Ok, this is more-less as an exercise (and reminder for me); below is some code, where I've tried to define a class (File::Find::Object::StatObj) with accessor fields for all of the stat fields. Then, I have the hack for IO::All::File from Replacing a class in Perl ("overriding"/"extending" a class with same name)?, where a mtimef field is added which corresponds to mtime, just as a reminder.
Then, just to see what sort of interface I could have between the two libraries, I have IO::All doing the iterating; and the current file path is passed to File::Find::Object, from which we obtain a File::Find::Object::Result - which has been "hacked" to also show the File::Find::Object::StatObj; but that one is only generated after a call to the hacked Result's full_components (that might as well have been a separate function). Notice that in this case, you won't get full_components/dir_components of File::Find::Object::Result -- because apparently it is not File::Find::Object doing the traversal here, but IO::All. Anyways, the result is something like this:
# $VAR1 = {
# '_utf8' => 1,
# 'mtimef' => 1403956165,
# 'constructor' => sub { "DUMMY" },
# 'is_open' => 0,
# 'io_handle' => undef,
# 'name' => 'img/test.png',
# '_encoding' => 'utf8',
# 'package' => 'IO::All'
# };
# img/test.png
# > - $VAR1 = bless( {
# 'base' => 'img/test.png',
# 'is_link' => '',
# 'is_dir' => '',
# 'path' => 'img/test.png',
# 'is_file' => 1,
# 'stat_ret' => [
# 2054,
# 426287,
# 33188,
# 1,
# 1000,
# 1000,
# 0,
# '37242',
# 1405023944,
# 1403956165,
# 1403956165,
# 4096,
# 80
# ],
# 'basename' => undef,
# 'stat_obj' => bless( {
# 'blksize' => 4096,
# 'ctime' => 1403956165,
# 'rdev' => 0,
# 'blocks' => 80,
# 'uid' => 1000,
# 'dev' => 2054,
# 'mtime' => 1403956165,
# 'mode' => 33188,
# 'size' => '37242',
# 'nlink' => 1,
# 'atime' => 1405023944,
# 'ino' => 426287,
# 'gid' => 1000
# }, 'File::Find::Object::StatObj' ),
# 'dir_components' => []
# }, 'File::Find::Object::Result' );
I'm not sure how correct this would be, but what I like about this is that I could forget where the fields are; then I could rerun the dumper, and see that I could get mtime via (*::Result)->stat_obj->size - and that seems to work (here I'd need just to read these, not to set them).
Anyways, here is the code:
use Data::Dumper;
my #targetDirsToScan = ("./");
use IO::All -utf8 ; # Turn on utf8 for all io
# try to "replace" the IO::All::File class
{ # https://stackoverflow.com/a/24726797/277826
package IO::All::File;
use IO::All::File; # -base; # just do not use `-base` here?!
# hacks work if directly in /usr/local/share/perl/5.10.1/IO/All/File.pm
# NB: field is a sub in /usr/local/share/perl/5.10.1/IO/All/Base.pm
field mtimef => undef; # hack
sub file {
my $self = shift;
bless $self, __PACKAGE__;
$self->name(shift) if #_;
$self->mtimef($self->mtime); # hack
#print("!! *haxx0rz'd* file() reporting in\n");
return $self->_init;
}
1;
}
use File::Find::Object;
# based on /usr/local/share/perl/5.10.1/File/Find/Object/Result.pm;
# but inst. from /usr/local/share/perl/5.10.1/File/Find/Object.pm
{
package File::Find::Object::StatObj;
use integer;
use Tie::IxHash;
#use Data::Dumper;
sub ordered_hash { # https://stackoverflow.com/a/3001400/277826
#my (#ar) = #_; #print("# ". join(",",#ar) . "\n");
tie my %hash => 'Tie::IxHash';
%hash = #_; #print Dumper(\%hash);
\%hash
}
my $fields = ordered_hash(
# from http://perldoc.perl.org/functions/stat.html
(map { $_ => $_ } (qw(
dev ino mode nlink uid gid rdev size
atime mtime ctime blksize blocks
)))
); #print Dumper(\%{$fields});
use Class::XSAccessor
#accessors => %{$fields}, # cannot - is seemingly late
# ordered_hash gets accepted, but doesn't matter in final dump;
#accessors => { (map { $_ => $_ } (qw(
accessors => ordered_hash( (map { $_ => $_ } (qw(
dev ino mode nlink uid gid rdev size
atime mtime ctime blksize blocks
))) ),
#))) },
;
use Fcntl qw(:mode);
sub new
{
#my $self = shift;
my $class = shift;
my #stat_arr = #_; # the rest
my $ic = 0;
my $self = {};
bless $self, $class;
for my $k (keys %{$fields}) {
$fld = $fields->{$k};
#print "$ic '$k' '$fld' ".join(", ",$stat_arr[$ic])." ; ";
$self->$fld($stat_arr[$ic]);
$ic++;
}
#print "\n";
return $self;
}
1;
}
# try to "replace" the File::Find::Object::Result
{
package File::Find::Object::Result;
use File::Find::Object::Result;
#use File::Find::Object::StatObj; # no, has no file!
use Class::XSAccessor replace => 1,
accessors => {
(map { $_ => $_ } (qw(
base
basename
is_dir
is_file
is_link
path
dir_components
stat_ret
stat_obj
)))
}
;
#use Fcntl qw(:mode);
#sub new # never gets called
sub full_components
{
my $self = shift; #print("NEWCOMP\n");
my $sobj = File::Find::Object::StatObj->new(#{$self->stat_ret()});
$self->stat_obj($sobj); # add stat_obj and its fields
return
[
#{$self->dir_components()},
($self->is_dir() ? () : $self->basename()),
];
}
1;
}
# main script start
my $io = io($targetDirsToScan[0]);
my #contents = $io->all(0); # Get all contents of dir
for my $contentry ( #contents ) {
print Dumper \%{*$contentry};
print $contentry->name . "\n"; # img/test.png
# get a File::Find::Object::Result - must instantiate
# a File::Find::Object; just item_obj() will return undef
# right after instantiation, so must give it "next";
# no instantition occurs for $tro, though!
#my $tffor = File::Find::Object->new({}, ($contentry->name))->next_obj();
my $tffo = File::Find::Object->new({}, ("./".$contentry->name));
my $tffos = $tffo->next(); # just a string!
$tffo->_calc_current_item_obj(); # unfortunately, this will not calculate dir_components ...
my $tffor = $tffo->item_obj();
# ->full_components doesn't call new, either!
# must call full_compoments, to generate the fields
# (assign to unused variable triggers it fine)
# however, $arrref_fullcomp will be empty, because
# File::Find::Object seemingly calcs dir_components only
# if it is traversing a tree...
$arrref_fullcomp = $tffor->full_components;
#print("# ".$tffor->stat_obj->size."\n"); # seems to work
print "> ". join(", ", #$arrref_fullcomp) ." - ". Dumper($tffor);
}

Perl HTML::Tokeparser get raw html between tags

i am using TokeParser to extract tag contents.
...
$text = $p->get_text("/td") ;
...
usually it will return the text cleaned up. What I want is to return everthing between td and /td but including all other html elements. How to do that.
I am using the example in this tutorial. thanks
In the example,
my( $tag, $attr, $attrseq, $rawtxt) = #{ $token };
I believe there is some trick to do with $rawtxt .

HTML::TokeParser does not have a built-in feature to do this. However, it's possible by looking at each token between <td>s individually.
#!/usr/bin/perl
use strictures;
use HTML::TokeParser;
use 5.012;
# dispatch table with subs to handle the different types of tokens
my %dispatch = (
S => sub { $_[0]->[4] }, # Start tag
E => sub { $_[0]->[2] }, # End tag
T => sub { $_[0]->[1] }, # Text
C => sub { $_[0]->[1] }, # Comment
D => sub { $_[0]->[1] }, # Declaration
PI => sub { $_[0]->[2] }, # Process Instruction
);
# create the parser
my $p = HTML::TokeParser->new( \*DATA ) or die "Can't open: $!";
# fetch all the <td>s
TD: while ( $p->get_tag('td') ) {
# go through all tokens ...
while ( my $token = $p->get_token ) {
# ... but stop at the end of the current <td>
next TD if ( $token->[0] eq 'E' && $token->[1] eq 'td' );
# call the sub corresponding to the current type of token
print $dispatch{$token->[0]}->($token);
}
} continue {
# each time next TD is called, print a newline
print "\n";
}
__DATA__
<html><body><table>
<tr>
<td><strong>foo</strong></td>
<td><em>bar</em></td>
<td><font size="10"><font color="#FF0000">frobnication</font></font>
<p>Lorem ipsum dolor set amet fooofooo foo.</p></td>
</tr></table></body></html>
This program will parse the HTML document in the __DATA__ section and print everything including HTML between <td> and </td>. It will print one line per <td>. Let's go through it step by step.
After reading the documentation, I learned that each token from HTML::TokeParser has a type associated with it. There are six types: S, E, T, C, D and PI. The doc says:
This method will return the next token found in the HTML document, or
undef at the end of the document. The token is returned as an array
reference. The first element of the array will be a string denoting
the type of this token: "S" for start tag, "E" for end tag, "T" for
text, "C" for comment, "D" for declaration, and "PI" for process
instructions. The rest of the token array depend on the type like
this:
["S", $tag, $attr, $attrseq, $text]
["E", $tag, $text]
["T", $text, $is_data]
["C", $text]
["D", $text]
["PI", $token0, $text]
We want to access the $text stored in these tokens, because there is no other way to grab stuff that looks like HTML tags. I therefore created a dispatch table to handle them in %dispatch. It stores a bunch of code refs that get called later.
We read the document from __DATA__, which is convenient for this example.
First of all, we need to fetch the <td>s by using the get_tag method. #nrathaus's comment pointed me that way. It will move the parser to the next token after the opening <td>. We don't care about what get_tag returns since we only want the tokens after the <td>.
We use the method get_token to fetch the next token and do stuff with it:
But we only want to do that until we find the corresponding closing </td>. If we see that, we next the outer while loop labelled TD.
At that point, the continue block gets called and prints a newline.
If we are not at the end, the magic happens: the dispatch table; As we saw earlier, the first element in the token array ref holds the type. There is a code ref for each of these types in %dispatch. We call it and pass the complete array ref $token by going $coderef->(#args). We print the result on the current line.
This will produce stuff like <strong>, foo, </strong> and so on in each run.
Please note that this will only work for one table. If there is a table within a table (something like <td> ... <td></td> ... </td>) this will break. You would have to adjust it to take remember of how many levels deep it is.
Another approach would be to use miyagawa's excellent Web::Scraper. That way, we have a lot less code:
#!/usr/bin/perl
use strictures;
use Web::Scraper;
use 5.012;
my $s = scraper {
process "td", "foo[]" => 'HTML'; # grab the raw HTML for all <td>s
result 'foo'; # return the array foo where the raw HTML is stored
};
my $html = do { local $/ = undef; <DATA> }; # read HTML from __DATA__
my $res = $s->scrape( $html ); # scrape
say for #$res; # print each line of HTML
This approach can also handle multi-dimensional tables like a charm.

Perl - reading in multi-line records from config file

I'm trying to read in a multi-line config file with records into a perl hash array
Example Config File:
record_1
phone=5551212
data=1234234
end_record_1
record_2
people_1=bob
people_2=jim
data=1234
end_record_2
record_3
people_1=sue
end_record_3
here's what I'm looking for:
$myData{1}{"phone"} <--- 5551212
$myData{1}{"data"} <--- 1234234
$myData{2}{"people_1"} <--- bob
... etc
What's the best way to read this in? Module? Regex with multi-line match? Brute force? I'm up in the air on where to head next.

Here's one option with your data set:
use strict;
use warnings;
use Data::Dumper;
my %hash;
{
local $/ = '';
while (<DATA>) {
my ($rec) = /record_(\d+)/;
$hash{$rec}{$1} = $2 while /(\S+)=(.+)/g;
}
}
print Dumper \%hash;
__DATA__
record_1
phone=5551212
data=1234234
end_record_1
record_2
people_1=bob
people_2=jim
data=1234
end_record_2
record_3
people_1=sue
end_record_3
Output:
$VAR1 = {
'1' => {
'data' => '1234234',
'phone' => '5551212'
},
'3' => {
'people_1' => 'sue'
},
'2' => {
'people_1' => 'bob',
'data' => '1234',
'people_2' => 'jim'
}
};
Setting local $/ = '' results in an empty line being treated as a "record separator" in your data set, so we can use regexs on those records to grab the information for the hash keys/values.
Hope this helps!

There are a number of modules for this, so the best practice is (as usual) to use them rather than re-invent the wheel.
From the snippet of the config file you posted, it looks like Config::Simple may be the best choice. If you can simplify the config format, then Config::Tiny would be easier to use. If things get more complicated, then you may have to use Config::General.
http://metacpan.org/pod/Config::Tiny
http://metacpan.org/pod/Config::Simple
http://metacpan.org/pod/Config::General

Read it in one line at a time.
When you see a new record, add a new empty associative array to myData and grab a reference to it - this will be your "current record".
Now when you see key/value pairs on a line, you can add that to the current record array (if there is one)
When you see the end of a record, you just clear the reference to the current record.

perl: how to parse an xml file sequentially

I have an XML file which describes the data-structure that I can exchange on a UDP channel.
For example:
Here is my input XML file describing my data-structure.
<ds>
<uint32 name='a'/>
<uint32 name='b'/>
<string name='c'/>
<int16 name='d'/>
<uint32 name='e'/>
</ds>
Parsing this XML file using perl's XML:Simple, allows me to generate the following hash
$VAR1 = {
'uint32' => {
'e' => {},
'a' => {},
'b' => {}
},
'int16' => {
'name' => 'd'
},
'string' => {
'name' => 'c'
}
};
As you can see, after parsing I have no way to figure out what will be the relative position of field 'e' relative to the start of the datastructure.
I would like to find out offsets of each of these elements.
I tried searching for a perl XML parser which allows me to parse an XML file sequentially, something like a 'getnexttag()' kind of a functionality, but could not find any.
What is the best way to do this programmatically? If not perl, then which other language is best suited to do this work?

You'll need to use a streaming parser with the appropriate callbacks, this will also improve parsing speed (and give you less memory consumption, if done correctly) when it comes to larger sets of data, which is a good/awesome thing.
I recommend you to use XML::SAX, an introducation to the module is available under the following link:
XML::SAX::Intro
Provide callbacks for start_element, this way you can read the value of each element one at a time.
Could you write me an easy example?
Yes, and I already have! ;-)
The below snippet will parse the data OP provided and print the name of each element, as well as the attributes key/value.
It should be quite easy to comprehend but if you got any questions feel free to add them as a comment and I'll update this post with more detailed information.
use warnings;
use strict;
use XML::SAX;
my $parser = XML::SAX::ParserFactory->parser(
Handler => ExampleHandler->new
);
$parser->parse_string (<<EOT
<ds>
<uint32 name='a'/>
<uint32 name='b'/>
<string name='c'/>
<int16 name='d'/>
<uint32 name='e'/>
</ds>
EOT
);
# # # # # # # # # # # # # # # # # # # # # # # #
package ExampleHandler;
use base ('XML::SAX::Base');
sub start_element {
my ($self, $el) = #_;
print "found element: ", $el->{Name}, "\n";
for my $attr (values %{$el->{Attributes}}) {
print " '", $attr->{Name}, "' = '", $attr->{Value}, "'\n";
}
print "\n";
}
output
found element: ds
found element: uint32
'name' = 'a'
found element: uint32
'name' = 'b'
found element: string
'name' = 'c'
found element: int16
'name' = 'd'
found element: uint32
'name' = 'e'
I'm not satisfied with XML::SAX, are there any other modules available?
Yes, there are plenty to choose from. Read the following list and choose the one that you find fitting for your specific problem:
perl-xml.sourceforge.net/faq - cpan modules
What is the difference between different methods of parsing?
I also recommend reading the following FAQ regarding XML-parsing. It will bring up the Pro's and Con's of using a tree-parser (such as XML::Parser::Simple) or a streaming parser:
Perl-XML Frequently Asked Questions, Tree VS Stream

It most certainly is possible with Perl.
Here's an example with XML::LibXML :
use strict;
use warnings;
use feature 'say';
use XML::LibXML;
my $xml = XML::LibXML->load_xml( location => 'test.xml' );
my ( $dsNode ) = $xml->findnodes( '/ds' );
my #kids = $dsNode->nonBlankChildNodes; # The indices of this array will
# give the offset
my $first_kid = shift #kids; # Pull off the first kid
say $first_kid->toString; # "<uint32 name='a'/>"
my $second = $first_kid->nextNonBlankSibling();
my $third = $second->nextNonBlankSibling();
say $third->toString; # "<string name="c"/>"

Here is an example using XML::Twig
use XML::Twig;
XML::Twig->new( twig_handlers => { 'ds/*' => \&each_child } )
->parse( $your_xml_data );
sub each_child {
my ($twig, $child) = #_;
printf "tag %s : name = %s\n", $child->name, $child->{att}->{name};
}
This outputs:
tag uint32 : name = a
tag uint32 : name = b
tag string : name = c
tag int16 : name = d
tag uint32 : name = e

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

HTML::Parser handler sends undefined parameter to callback function? - perl

Related

Using Web::Scraper to pull HTML tags with a data marker

Iterate directories in Perl, getting introspectable objects as result

Perl HTML::Tokeparser get raw html between tags

Perl - reading in multi-line records from config file

perl: how to parse an xml file sequentially

Categories

Resources