Trouble pinpointing child elements while using Mojo::DOM - perl

I'm trying to extract text from an old vBulletin forum using WWW::Mechanize and Mojo::DOM.
vBulletin doesn't use HTML and CSS for semantic markup, and I'm having trouble using Mojo::DOM->children to get at certain elements.
These vBulletin posts are structured differently depending on their content.
Single message:
<div id="postid_12345">The quick brown fox jumps over the lazy dog.<div>
Single message quoting another user:
<div id="postid_12345">
<div>
<table>
<tr>
<td>
<div>Quote originally posted by Bob</div>
<div>Everyone knows the sky is blue.</div>
</td>
</tr>
</table>
</div>
I disagree with you, Bob. It's obviously green.
</div>
Single message with spoilers:
<div id="postid_12345">
<div class="spoiler">Yoda is Luke's father!</div>
</div>
Single message quoting another user, with spoilers:
<div id="postid_12345">
<div>
<table>
<tr>
<td>
<div>Quote originally posted by Fred</div>
<div class="spoiler">Yoda is Luke's father!</div>
</td>
</tr>
</table>
</div>
<div class="spoiler">No waaaaay!</div>
</div>
Assuming the above HTML and an array packed with the necessary post IDs:
for (#post_ids) {
$mech->get($full_url_of_specific_forum_post);
my $dom = Mojo::DOM->new($mech->content);
my $div_id = 'postid_' . $_;
say $dom->at($div_id)->children('div')->first;
say $dom->at($div_id)->text;
}
Using $dom->at($div_id)->all_text gives me everything in an unbroken line, which makes it difficult to tell what's quoted and what's original in the post.
Using $dom->at($div_id)->text skips all of the child elements, so quoted text and spoilers are not picked up.
I've tried variations of $dom->at($div_id)->children('div')->first, but this gives me everything, including the HTML.
Ideally, I'd like to be able to pick up all the text in each post, with each child element on its own line, e.g.
POSTID12345:
+ Quote originally posted by Bob
+ Everyone knows the sky is blue.
I disagree with you, Bob. It's obviously green.
I'm new to Mojo and rusty with Perl. I wanted to solve this on my own, but after looking over the documentation and fiddling with it for a few hours, my brain is mush and I'm at a loss. I'm just not getting how Mojo::DOM and Mojo::Collections work.
Any help will be greatly appreciated.

Looking at the source of Mojo::DOM, basically the all_text method recursively walks the DOM and extracts all text. Use that source to write your own walking the DOM function. Its recursive function depends on returning a single string, in yours you might have it return an array with whatever context you need.
EDIT:
After some discussion on IRC, the web scraping example has been updated, it might help you guide you. http://mojolicio.us/perldoc/Mojolicious/Guides/Cookbook#Web_scraping

There is a module to flattern HTML tree, HTML::Linear.
The explanation of purpose for flatterning HTML tree is a bit long and boring, so here's a picture showing the output of the xpathify tool, bound with that module:
As you see, HTML tree nodes become single key/value list, where the key is the XPath for that node, and the value is the node's text attribute.
In a few keystrokes, this is how you use HTML::Linear:
#!/usr/bin/env perl
use strict;
use utf8;
use warnings;
use Data::Printer;
use HTML::Linear;
my $hl = HTML::Linear->new;
$hl->parse_file(q(vboard.html));
for my $el ($hl->as_list) {
my $hash = $el->as_hash;
next unless keys %{$hash};
p $hash;
}

Related

Html coders + zend form

So I use a zend framework, but for all commercial projects I am not using zend form component.
HTML+Js coders are working on frontend and I don't want to have nothing with it, I just connect the final view in my zf based application.
So for example I have in view a html form which looks like this:
http://pastie.org/1668143
So form design, elements and classes can be changed by frontend coders, and I want to know is there easy way for me, to use zend form with this concept (in each project form code looks different of course )?
My main problem here is code duplication in my controllers, which in this case looks something like this:
http://pastie.org/1668023 (please don't take exceptions and loggedMember seriously, they are just used for this snippet, to demonstrate a problem, which I have)
So, what would be the best solution for problem which I have, what do u think :) ?
If you have absolutely no control over the form's html structure, but still want to maximize the use of Zend_Form's features, use Zend_Form_Decorator_ViewScript.
http://framework.zend.com/manual/en/zend.form.standardDecorators.html (last section)
$element->setDecorators(array(array('ViewScript', array(
'viewScript' => '_element.phtml',
'class' => 'form element'
))));
I would do it like this:
create a form class that has all elements, validators and filters
create an instance of the form in your action and set the view script(s) (this way you can change them per controller and still have very little duplicated definition code.
Splendid, I don't understand why you would have a problem with code duplication, in your 2nd link, you are performing your checks, then check if the page is a post request, then performing the checks again, yes its duplicated, but I don't understand what you are trying to explain by doing this?
As for the form, its up to you how you use it, you could create the form object, then instead of ever out putting the form, simply pass it the data from your designers form, and use it to validate things.
Or you could use custom templates for the form, OK it means you don't give the designers quite as much freedom of them designing a form and you sorting the results, but they can still do their best at it.
This is the setup I use, after all I am in charge of the functionality as the programmer, the designers just make it look good what the user see's.
So for example, if I want to create an input element I can:
$arrival_time = $this->createElement('text', 'arrival_time', array(
'required' => true,
'label' => 'Arrival Time:',
));
$arrival_time->removeDecorator('HtmlTag');
$this->addElement($arrival_time);
Notice I have removed the HtmlTag decorators here - I don't need them for the markup as the designers will be arranging things for me.
Next thing to do is tell the form to use the layout the designers have made:
$this->setDecorators(array(array('ViewScript', array('viewScript' => 'NewArrivalsForm.phtml'))));
Here my template is within the view's, script's directory.
Now the designers have a few options. They could do:
<table border="0" width="100%">
<tr>
<td>
<?php echo $this->element->arrival_time; ?>
</td>
This will give you the following output:
<td>
<dt id="arrival_time"><label for="arrival_time" class="required">Arrival Time:</label></dt>
<input type="text" name="arrival_time" id="arrival_time" value="" />
</td>
If there we're an error, that would be presented as well. You could remove the decorators 'Label', 'Description' & 'Errors' as well, to make it simply an input box:
<td>
<input type="text" name="arrival_time" id="arrival_time" value="" />
</td>
Even once you have removed the decorators, the designers could still use for example:
<tr>
<td>
<?php echo $this->element->time_on_site->getLabel(); ?>
</td>
</tr>
<tr>
<td>
<?php echo $this->element->time_on_site ?>
</td>
This will allow them to lay the form out exactly as they want to. But it will still allow you to use the full power of Zend_Form for your final validation checks. Take a look inside the Zend/form/element.php for all the methods you and your designers can use on the form element.

customizing size of form fields created using field.custom.widget

I used the following code to generate a form in attached image.
Is it possible to change the size of the fields in the form.
I want to decrease size of input field of Estimated time and the dropbox field to the right of it
{{=form.custom.begin}}
<table>
<table><tr>
<td><b>Type :</b></td><td><div>{{=form.custom.widget.type}}</div></td>
</tr><tr>
<td><b>Title :</b></td><td><div>{{=form.custom.widget.title}}</div></td>
</tr><tr>
<td><b>Description :</b></td><td><div>{{=form.custom.widget.description}}</div></td>
</tr><tr>
<td><b>Estimated Time :</b></td><div'><td>{{=form.custom.widget.estimated_time}}{{=form.custom.widget.estimated_time_unit}}</td> </div>
</tr>
<tr>
<td></td><td><div align='center'>{{=form.custom.submit}}</div></td>
</tr>
</table>
{{=form.custom.end}}
Yes. You can and there are many ways.
The recommended way is to look at the generates JS. You will find it follows a naming convention described in the book. You can use CSS to change the look-and feel of every widget.
input[name=estimate_time] { width:50px }
Similarly you can use JS/jQuery (I would recommend you do this in the view).
jQuery(document).ready(function(){ jQuery('input[name=estimate_time]').css('width','50px');});
You can also use jQuery-like syntax in python in the controller:
form.element('input[name=estimate_time]')['_style']='width:50px'

Why are some of my tags being removed (GWT)?

I'm adding an element to a document with the following:
Element parent = getParentElement(); // Returns the right thing.
HTML html = new HTML();
html.setHTML( "<td>BLAH</td>" );
parent.appendChild( html.getElement() );
When I view the resulting document with FireBug though, the parent's child looks like this:
<div class="gwt-HTML"> BLAH </div>
I can use FireBug to add in the <td> elements manually, and all my formatting applies, etc. Does anyone know why the HTML element seems to be removing my <td> tags?
It turns out that it's FireFox that's stripping it out. If I just use plain old javascript to create a div, or a tr, and set innerHTML to be <td>BLAH</td>, it still gets stripped. A couple of others have noticed this as well: http://www.jtanium.com/2009/10/28/firefox-gotcha-innerhtml-strips-td-tags/
If I use javascript to create a <table> tag, and add it to the DOM, I can then place the <td> in that. Of course, it helpfully creates a <tbody><tr> for me as well, so I'm not really getting back what I put in....

xPath Groupings how?

OK So, I'm learning/using xpath for a basic application that's effectively ripping data off another website.
I need to gain the knowledge of each persons Country/Suburb/area.
In some instances you can get Australia/Victoria/Melbourne for instance.
Others may just be Australia/Melbourne.
Or even just Melbourne OR just Australia.
So I'm current able to view the below code and rip all of the information with the string xpath //table/tr/td/table/tr/td/font/a. This returns every entry, but what I really want is to group each lot separately.
I hope someone out there on planet earth knows what I just tried to explain... and can help...
Good day!
The source document contains data like this:
<tr>
<td>
<font face="arial" size="2">
<strong>Location:</strong>
Australia,
<a href='http://maps.google.com/maps?q=Australia%20Victoria'target="mapblast" style='text-decoration:none'>Victoria</a>,
<a href='http://maps.google.com/maps?q=Australia%20Melbourne%20Victoria'target="mapblast" style='text-decoration:none'>Melbourne</a>
</font>
</td>
</tr>
To find each person's record, the XPath query is //table/tr/td/table/tr/td/font, or you could use //td/font[strong = 'Location:']. This will return a collection containing 1 element for each person.
To find the a elements under a particular font you could use XPath a from the font. This can also be done by iterating the children collection of the element.

php - splitting a string with HTML by the first instance of a table cell

I am checking on HTML content on my page, and I've got the split down to have the variable left with this content:
">
<td>Oklahoma City</td>
<td>Oklahoma</td>
<td>OK</td>
<td>405</td>
<td>CST</td>
</tr>
</table>
<div id="
Those are dynamic pages I'm checking, so the data will always be different, but the layout the same...
How can I get the value out of the second <td> if that html is in 1 variable(string)?
It was a full page, I've used explode twice to remove everything above a div field and everything below the last dive field id... so it has some open html tags left because I did not know how to get rid of that along the way to be left with just this:
<td>Oklahoma City</td>
<td>Oklahoma</td>
<td>OK</td>
<td>405</td>
<td>CST</td>
</tr>
</table>
Can you tell me how to get that out? I just need the second one because it is the county and that is what I'm checking on...