Is there a (already implemented) solution that decodes already encoded syntax from googles code-prettify?
<?php
$foo = "bar";
gets to
<pre class="prettyprint prettyprinted"><span class="pun"><?</span><span class="pln">php
$foo </span><span class="pun">=</span><span class="pln"> </span><span class="str">"bar"</span><span class="pun">;</span></pre>
Now I want decode it again to
<?php
$foo = "bar";
The textContent DOM property will let you access the plain text content of a DOM node, so just get the prettified <pre> element.
document.getElementsByClassName('prettyprinted')[0].textContent
Since textContent just concatenates text nodes, it works for prettyprinted <pre> and <xmp> elements and <code> elements that have CSS style whitespace:pre.
It should work for elements with linenums where prettify inserts an <li> element per line because the text still contains the original line-breaks, but I would test that separately if you rely upon that.
Related
Given the following HTML snippet:
<span itemprop="streetAddress">59 Court St.<br>City Hall</span>
I want to extract the contents of the span including the <br> tag. I can extract the text with the following:
process 'span[itemprop="streetAddress"]', address => 'TEXT';
But this leaves out the <br> tag.
OK, I cracked open the source code Web::Scraper and saw that you can set the value of the second argument to 'RAW' like so:
process 'span[itemprop="streetAddress"]', address => 'RAW';
I'm trying to extract text from an old vBulletin forum using WWW::Mechanize and Mojo::DOM.
vBulletin doesn't use HTML and CSS for semantic markup, and I'm having trouble using Mojo::DOM->children to get at certain elements.
These vBulletin posts are structured differently depending on their content.
Single message:
<div id="postid_12345">The quick brown fox jumps over the lazy dog.<div>
Single message quoting another user:
<div id="postid_12345">
<div>
<table>
<tr>
<td>
<div>Quote originally posted by Bob</div>
<div>Everyone knows the sky is blue.</div>
</td>
</tr>
</table>
</div>
I disagree with you, Bob. It's obviously green.
</div>
Single message with spoilers:
<div id="postid_12345">
<div class="spoiler">Yoda is Luke's father!</div>
</div>
Single message quoting another user, with spoilers:
<div id="postid_12345">
<div>
<table>
<tr>
<td>
<div>Quote originally posted by Fred</div>
<div class="spoiler">Yoda is Luke's father!</div>
</td>
</tr>
</table>
</div>
<div class="spoiler">No waaaaay!</div>
</div>
Assuming the above HTML and an array packed with the necessary post IDs:
for (#post_ids) {
$mech->get($full_url_of_specific_forum_post);
my $dom = Mojo::DOM->new($mech->content);
my $div_id = 'postid_' . $_;
say $dom->at($div_id)->children('div')->first;
say $dom->at($div_id)->text;
}
Using $dom->at($div_id)->all_text gives me everything in an unbroken line, which makes it difficult to tell what's quoted and what's original in the post.
Using $dom->at($div_id)->text skips all of the child elements, so quoted text and spoilers are not picked up.
I've tried variations of $dom->at($div_id)->children('div')->first, but this gives me everything, including the HTML.
Ideally, I'd like to be able to pick up all the text in each post, with each child element on its own line, e.g.
POSTID12345:
+ Quote originally posted by Bob
+ Everyone knows the sky is blue.
I disagree with you, Bob. It's obviously green.
I'm new to Mojo and rusty with Perl. I wanted to solve this on my own, but after looking over the documentation and fiddling with it for a few hours, my brain is mush and I'm at a loss. I'm just not getting how Mojo::DOM and Mojo::Collections work.
Any help will be greatly appreciated.
Looking at the source of Mojo::DOM, basically the all_text method recursively walks the DOM and extracts all text. Use that source to write your own walking the DOM function. Its recursive function depends on returning a single string, in yours you might have it return an array with whatever context you need.
EDIT:
After some discussion on IRC, the web scraping example has been updated, it might help you guide you. http://mojolicio.us/perldoc/Mojolicious/Guides/Cookbook#Web_scraping
There is a module to flattern HTML tree, HTML::Linear.
The explanation of purpose for flatterning HTML tree is a bit long and boring, so here's a picture showing the output of the xpathify tool, bound with that module:
As you see, HTML tree nodes become single key/value list, where the key is the XPath for that node, and the value is the node's text attribute.
In a few keystrokes, this is how you use HTML::Linear:
#!/usr/bin/env perl
use strict;
use utf8;
use warnings;
use Data::Printer;
use HTML::Linear;
my $hl = HTML::Linear->new;
$hl->parse_file(q(vboard.html));
for my $el ($hl->as_list) {
my $hash = $el->as_hash;
next unless keys %{$hash};
p $hash;
}
I'm adding an element to a document with the following:
Element parent = getParentElement(); // Returns the right thing.
HTML html = new HTML();
html.setHTML( "<td>BLAH</td>" );
parent.appendChild( html.getElement() );
When I view the resulting document with FireBug though, the parent's child looks like this:
<div class="gwt-HTML"> BLAH </div>
I can use FireBug to add in the <td> elements manually, and all my formatting applies, etc. Does anyone know why the HTML element seems to be removing my <td> tags?
It turns out that it's FireFox that's stripping it out. If I just use plain old javascript to create a div, or a tr, and set innerHTML to be <td>BLAH</td>, it still gets stripped. A couple of others have noticed this as well: http://www.jtanium.com/2009/10/28/firefox-gotcha-innerhtml-strips-td-tags/
If I use javascript to create a <table> tag, and add it to the DOM, I can then place the <td> in that. Of course, it helpfully creates a <tbody><tr> for me as well, so I'm not really getting back what I put in....
Greetings.
Here is an XML object returned by my server in the responseXML object:
<tableRoot>
<table>
<caption>howdy!</caption>
<tr>
<td>hello</td>
<td>world</td>
</tr>
<tr>
<td>another</td>
<td>line</td>
</tr>
</table>
Now I attach this fragment to my document tree like so:
getElementById('entryPoint').appendChild(responseXML.firstChild.firstChild);
But instead of being rendered as a table, I get the following text:
howdy! helloworldanotherline
The same result occurs of I replace firstChild.firstChild with just firstChild.
It seems like I'm just getting the nodeValues, and all of the tags are stripped out?!
Am I fundamentally misunderstanding what the responseXML object is supposed to represent?
This works, BTW, if I take out the 'root' tags, and set innerHTML to responseText.
Can someone please enlighten me on the correct way to use responseXML?
You get the text instead of a table, because you use pure DOM for manipulations and your response XML doesn't have the namespaces declarations. So when appending an XML element browser doesn't know whether your "table" tag is from HTML, XUL, SVG or else from.
1) Add namespace declaration:
<table xmlns="http://www.w3.org/1999/xhtml">
2) Instead of directly inserting a reffered XML DOM Element, you should first import that node into your HTML DOM Document:
var element = document.importNode(responseXML.firstChild.firstChild, true);
document.getElementById('entryPoint').appendChild(element);
Hope this helps!
You can create an element at the position you want to insert and than do
element.innerHTML = request.responseText
I have the following Sightly expression:
<li data-sly-call="${linkTemplate.dynamicLink # section='education',
url='/en/life-career-events.html', text=${'comp.masthead.navigation.home' # i18n}}">
</li>
The dynamiclink template is as follows:
<div data-sly-template.dynamicLink="${# section, url, text}"
data-sly-use.membersNav="${'com.comp.cms.component.masthead.MembersNavigation' # section=section}">
<a data-sly-attribute.class="${membersNav.cssClass}" href="${url}">${text}</a>
</div>
This doesn't work because text=${'comp.masthead.navigation.home' # i18n} isn't evaluated as a string and then passed into the dynamiclink.
Is this possible? Can I evaluate and assign to a variable or do I have to create a new template when I want to evaluate i18n lookups?
Sightly 1.1 doesn't allow to have expressions within expressions, and there's no plan to change that for now.
Hacking the solution:
There's a trick: data-sly-test can be (ab)used to set variables. It's not really a recommended way to do though, unless you have a real condition, because this will mislead someone who reads the template in thinking that the intention was to have an actual condition.
The trick goes like that: an identifier can be provided to data-sly-test, which will expose the result of the test as a variable. Additionally, data-sly-test will be considered true, unless the resulting string is empty.
For e.g.:
<p data-sly-test.spanishAsset="${'Asset' # i18n, locale='es'}">${spanishAsset}</p>
Outputs:
<p>Recurso</p>
So in your case, you could write:
<li data-sly-test.linkText="${'comp.masthead.navigation.home' # i18n}"
data-sly-call="${linkTemplate.dynamicLink # section='education',
url='/en/life-career-events.html', text=linkText}">
</li>
A cleaner solution
As you probably don't want to explain to all the users of this template that they have to write such a hack, and instead of having two separated templates for translated and non-translated texts, you could instead leverage optional templates parameters. So you might for e.g. have a noI18n optional parameter.
The call would then be as simple as it can be:
<!--/* Translating would be the default behavior */-->
<li data-sly-call="${linkTemplate.dynamicLink #
section='education',
url='/en/life-career-events.html',
text='comp.masthead.navigation.home'}"></li>
<!--/* Or translating could be turned off */-->
<li data-sly-call="${linkTemplate.dynamicLink #
section='education',
url='/en/life-career-events.html',
text='my text...',
noI18n=true}"></li>
The template would then have two data-sly-test conditions for the two cases (note that the data-sly-unwrap attributes can be dropped in AEM 6.1+):
<div data-sly-template.dynamicLink="${# section, url, text, noI18n}"
data-sly-use.membersNav="${'com.comp.cms.component.masthead.MembersNavigation'
# section=section}">
<a href="${url}" data-sly-attribute.class="${membersNav.cssClass}">
<sly data-sly-test="${noI18n}" data-sly-unwrap>${membersNav.text}</sly>
<sly data-sly-test="${!noI18n}" data-sly-unwrap>${membersNav.text # i18n}</sly>
</a>
</div>
Optionally, to keep the template as simple as possible and to remove those conditions, you could also make the Use-API do the translation, depending on the noI18n optional parameter:
<div data-sly-template.dynamicLink="${# section, url, text, noI18n}"
data-sly-use.membersNav="${'com.comp.cms.component.masthead.MembersNavigation'
# section=section, noI18n=noI18n}">
<a href="${url}" data-sly-attribute.class="${membersNav.cssClass}">
${membersNav.text}
</a>
</div>
The proper code for the logic to translate a string is:
Locale pageLang = currentPage.getLanguage(false);
I18n i18n = new I18n(slingRequest.getResourceBundle(pageLang));
String text = i18n.get("Enter a search keyword");