Extracting text and HTML from the contents of an element using Web::Scraper - perl

Given the following HTML snippet:
<span itemprop="streetAddress">59 Court St.<br>City Hall</span>
I want to extract the contents of the span including the <br> tag. I can extract the text with the following:
process 'span[itemprop="streetAddress"]', address => 'TEXT';
But this leaves out the <br> tag.

OK, I cracked open the source code Web::Scraper and saw that you can set the value of the second argument to 'RAW' like so:
process 'span[itemprop="streetAddress"]', address => 'RAW';

Related

tinymce <span> gets removed when containing <br />

I'm using Tiny 4.9.10 to dynamically generate reports based on templates. Users can create templates which contain placeholders. These placeholders then get swapped out for their actual values when generating the actual report. The placeholders get their style (including font, which is the main issue here) from their enclosing <span>-tag.
When replacing the placeholder with their actual value, we use <br />-tags to insert new lines, since some of the placeholders are almost full reports on their own which need to be structured.
After the placeholders have all been replaced, we inject this dynamically generated content back into a Tiny editor, so as to allow users to make ad hoc changes to the content.
At this point however we noticed that the <span>-tag around a piece of generated content containing <br />-tags gets removed. This is a problem, because the style info that was enclosed in this tag gets removed as well, resulting in problems further down the line when generating a PDF.
What I've tried to work around this:
setting verify_html to false
adding +span[br]/+span[br /] to valid_children
setting forced_root_bloc to div
The first two options did nothing to help me, and while the last one looked promising, it didn't help, because even when using <div>, font info gets enclosed into a child <span>.
I know this is expected behavior, because <span> is an inline tag and so it shouldn't have <br /> tags as children, but I'm currently at a loss for a workaround which allows me to include <br /> tags into my dynamically generated content without losing the style (most importantly the font) of the parent tag.
So I solved this by replacing the <span> tags by <div> tags when we swap out the placeholders by using some regex looking for spans that enclose a <p>...<\p> or a <b />. This stops Tiny from throwing away the <span> tags when they contain either of these enclosed tags
TinyMCE considers the <span> <br /> </span> construct an empty space and deletes it in favor of optimization.
I may be late, but you can also try using this callback in the setup option to stop the editor from removing empty spans:
setup: function(editor) {
editor.on('PreInit', function() {
editor.schema.getElementRule('span').removeEmpty = false;
});
}

ValidForm Builder file type validation

Does anybody know how to validate the file type input.
I have modified (hard coded) the class.vf_file.php input.
$strOutput .= "<input accept=\".pdf,.doc,.docx\" etc----/>\n";
This helps with Google Chrome, but Safari, Firefox ignore the modifications
Preventing users to submit the form if any other type of file is detected would be the ideal solution.
Thank you
I would recommend using a third party file uploading library like Plupload. We always use ValidForm Builder together with Plupload; works like a charm.
However you can use the meta array to implement custom attributes in the <input> tag without having to hardcode anything:
$objForm->addField(
"upload-document",
"Upload Document",
ValidForm::VFORM_FILE,
array(), // Validation array
array(), // Error handling array
array( // Meta array
"fielddata-extensions" => "pdf,doc,docx"
)
);
By prefixing meta keys with the 'field' prefix, you add that specific meta to the <input> field itself instead of it's wrapping <div class='vf__optional'></div>
The above example will output:
<div class="vf__optional">
<label for="upload-image">Upload Image</label>
<input type="hidden" name="MAX_FILE_SIZE" value="2097152">
<input type="file" value="" name="upload-image[]" id="upload-image" class="vf__file" data-extensions="pdf,doc,docx">
</div>
So using a combination of meta and a third party file upload handler, you can actually to pretty cool stuff.
That being said -- I must admit that the file upload field didn't get as much attention as the other field types lately.

rendering description saved using tinymce editor

I have a problem with pagination.
In my website , I have a function productList which generates a paginated list of products.
The code for the paginate function is shown below:
$this->paginate = array('conditions'=>array($otherconditions,$statuscondition,$active_condition,$catcondition),'order' => $order, 'limit' => 10);
In the view file,
the paginator helper is used as follows:
<div class="paginator">
<span class="info">
<?php echo $this->Paginator->counter(array('format' => __('Page <strong>{:page}</strong> of <strong>{:pages}</strong>')), array('escape'=>false));
?>
</span>
<ul>
<?php echo $this->Paginator->numbers(array('separator' => '', 'tag'=>'li'));?>
</ul>
</div>
Suppose, there are 150 products, so with a limit of 10, there will be 15 pages.
The problem is that the page numbers are not displayed in the end pages like page 15 i.e when I click on the page number 15, the products are displayed but the paginator counter and paginator number links are not displayed.
I have looked all over the net for a solution but could not find one. Please guide me.
PS: the paginate variable depends on my parameters like the category selected , and other conditions.
I dont think its a problem with the paginate syntax because the pagination works for small number of pages like 2 or 3.
The problem is with paginator helper I think.
Thanks in advance.
I have found the following line when I inspected the HTML. the page number and counter where commented automatically inside the following comments.
<!--[if gte mso 9]><xml>
<w:WordDocument>
<w:View>Normal</w:View>
<w:Zoom>0</w:Zoom>
<w:TrackMoves/>
<w:TrackFormatting/>
<w:PunctuationKerning/>
<w:ValidateAgainstSchemas/>
<w:SaveIfXMLInvalid>false</w:SaveIfXMLInvalid>
<w:IgnoreMixedContent>false</w:IgnoreMixedContent>
<w:AlwaysShowPlaceholderText>false</w:AlwaysShowPlaceholderText>
<w:DoNotPromoteQF/>
<w:LidThemeOther>EN-US</w:LidThemeOther>
<w:LidThemeAsian>X-NONE</w:LidThemeAsian>
<w:LidThemeComplexScript>X-NONE</w:LidThemeComplexScript>
<w:Compatibility>
<w:BreakWrappedTables/>
<w:SnapToGridInCell/>
<w:WrapTextWithPunct/>
< ...</p></div>
then after this I have the html for the page counter , page number links and also the script files. what is happening. I looked for its meaning and I found that mso 9 is microsoft office 9 or something. Which is not very reasonable. please guid me.
I solved this by putting a comment before the closing p tag and div tag.
So the comment tag <!--[if gte mso 9] > closes before </p></div>.
I also used the php function strip_tags to strip the body of text of all tags.
This is a duct tape method. But the problem I found lies in the mce editor.When a user copy pastes a paragraph from a word document to a mce editor thats running in firefox browser, the above mentioned conditional comment is added to the text. So, when displaying the user text, the conditional comment comments out rest of the html after the text.
IT is better to work with the mce editor callback to remove the conditional comment which i will be working on now.
Hope this helps someone...

Trouble pinpointing child elements while using Mojo::DOM

I'm trying to extract text from an old vBulletin forum using WWW::Mechanize and Mojo::DOM.
vBulletin doesn't use HTML and CSS for semantic markup, and I'm having trouble using Mojo::DOM->children to get at certain elements.
These vBulletin posts are structured differently depending on their content.
Single message:
<div id="postid_12345">The quick brown fox jumps over the lazy dog.<div>
Single message quoting another user:
<div id="postid_12345">
<div>
<table>
<tr>
<td>
<div>Quote originally posted by Bob</div>
<div>Everyone knows the sky is blue.</div>
</td>
</tr>
</table>
</div>
I disagree with you, Bob. It's obviously green.
</div>
Single message with spoilers:
<div id="postid_12345">
<div class="spoiler">Yoda is Luke's father!</div>
</div>
Single message quoting another user, with spoilers:
<div id="postid_12345">
<div>
<table>
<tr>
<td>
<div>Quote originally posted by Fred</div>
<div class="spoiler">Yoda is Luke's father!</div>
</td>
</tr>
</table>
</div>
<div class="spoiler">No waaaaay!</div>
</div>
Assuming the above HTML and an array packed with the necessary post IDs:
for (#post_ids) {
$mech->get($full_url_of_specific_forum_post);
my $dom = Mojo::DOM->new($mech->content);
my $div_id = 'postid_' . $_;
say $dom->at($div_id)->children('div')->first;
say $dom->at($div_id)->text;
}
Using $dom->at($div_id)->all_text gives me everything in an unbroken line, which makes it difficult to tell what's quoted and what's original in the post.
Using $dom->at($div_id)->text skips all of the child elements, so quoted text and spoilers are not picked up.
I've tried variations of $dom->at($div_id)->children('div')->first, but this gives me everything, including the HTML.
Ideally, I'd like to be able to pick up all the text in each post, with each child element on its own line, e.g.
POSTID12345:
+ Quote originally posted by Bob
+ Everyone knows the sky is blue.
I disagree with you, Bob. It's obviously green.
I'm new to Mojo and rusty with Perl. I wanted to solve this on my own, but after looking over the documentation and fiddling with it for a few hours, my brain is mush and I'm at a loss. I'm just not getting how Mojo::DOM and Mojo::Collections work.
Any help will be greatly appreciated.
Looking at the source of Mojo::DOM, basically the all_text method recursively walks the DOM and extracts all text. Use that source to write your own walking the DOM function. Its recursive function depends on returning a single string, in yours you might have it return an array with whatever context you need.
EDIT:
After some discussion on IRC, the web scraping example has been updated, it might help you guide you. http://mojolicio.us/perldoc/Mojolicious/Guides/Cookbook#Web_scraping
There is a module to flattern HTML tree, HTML::Linear.
The explanation of purpose for flatterning HTML tree is a bit long and boring, so here's a picture showing the output of the xpathify tool, bound with that module:
As you see, HTML tree nodes become single key/value list, where the key is the XPath for that node, and the value is the node's text attribute.
In a few keystrokes, this is how you use HTML::Linear:
#!/usr/bin/env perl
use strict;
use utf8;
use warnings;
use Data::Printer;
use HTML::Linear;
my $hl = HTML::Linear->new;
$hl->parse_file(q(vboard.html));
for my $el ($hl->as_list) {
my $hash = $el->as_hash;
next unless keys %{$hash};
p $hash;
}

php - splitting a string with HTML by the first instance of a table cell

I am checking on HTML content on my page, and I've got the split down to have the variable left with this content:
">
<td>Oklahoma City</td>
<td>Oklahoma</td>
<td>OK</td>
<td>405</td>
<td>CST</td>
</tr>
</table>
<div id="
Those are dynamic pages I'm checking, so the data will always be different, but the layout the same...
How can I get the value out of the second <td> if that html is in 1 variable(string)?
It was a full page, I've used explode twice to remove everything above a div field and everything below the last dive field id... so it has some open html tags left because I did not know how to get rid of that along the way to be left with just this:
<td>Oklahoma City</td>
<td>Oklahoma</td>
<td>OK</td>
<td>405</td>
<td>CST</td>
</tr>
</table>
Can you tell me how to get that out? I just need the second one because it is the county and that is what I'm checking on...