Explicit HTML is incorrectly parsed by JSoup - macros

I'm using JSoup for parsing HTML. It generally works OK, but in some explicit example it changes element sequence in HTML after parsing. Here is the simple code:
String str1 = originalHtmlFragment;
Document doc = Jsoup.parseBodyFragment(str1);
String str2 = doc.html();
Here are the values of str1 and str2.
str1:
<table>
<tbody>
<tr>
<th>
<p> </p>
<p>10</p>
</th>
</tr>
<tr>
<td colspan="1">
<p>
<ac:macro ac:name="my-macro">
<ac:parameter ac:name="outer-values">Page content</ac:parameter>
<ac:parameter ac:name="atlassian-macro-output-type">INLINE</ac:parameter>
<ac:rich-text-body>
<p>a1</p>
</ac:rich-text-body>
</ac:macro>
</p>
</td>
</tr>
</tbody>
</table>
str2:
<html>
<head></head>
<body>
<table>
<tbody>
<tr>
<th>
<p> </p>
<p>10</p>
</th>
</tr>
<tr>
<td colspan="1">
<p>
<ac:macro ac:name="my-macro">
<ac:parameter ac:name="outer-values">Page content</ac:parameter>
<ac:parameter ac:name="atlassian-macro-output-type">INLINE</ac:parameter>
<ac:rich-text-body></ac:rich-text-body>
</ac:macro>
</p>
<p>a1</p>
<p>
</p>
</td>
</tr>
</tbody>
</table>
</body>
</html>
Note that a1 is outside ac:macro tag in the second code example.
How can I work around that in JSoup?

You try to parse a String that is not real HTML, since ac:macro is not an allowed tag name. JSoup tries to do something sensible but in your case it obviously fails in this attempt. If you can switch to the XMLparser implementation, you get the results as expected:
Document doc = Jsoup.parse(str1,"",Parser.xmlParser());

Related

How to scrape all texts from <a href> to List with net.ruippeixotog.scalascraper

This is the html:
<tr class="countries" valign="top">
<td nowrap> </td>
<td nowrap>
<img src="/images/flags/ar.png">
Argentina <br>
<img src="/images/flags/au.png">
Australia <br>
<img src="/images/flags/at.png">
Austria <br>
</td>
</tr>
I want to get list of text elements between <a href ...> and </a>. When I write:
items >> allText("a")
then I get a List of 1 element:
ArgentinaAustraliaAustria
How to get those texts as n element List?
You can use texts method as follows:
(items >> texts("a")).filter(_.nonEmpty)
Which produces:
List(Argentina, Australia, Austria)
Filtering is used for the cases like
<img src="/images/flags/at.png">
since they have an empty text in <a> tag

Perl Mechanize identify content between span tag within specific div tag

Perl WWW::Mechanize::Firefox has successfully retrieved the contents of the web page, and stored in the scalar variable $content.
my $url = 'http://finance.yahoo.com/quote/AAPL/financials?p=AAPL';
$mech->get($url);
my $content= $mech->content();
In examining $content, I'm interested in identifying and saving all the information between the span tags inside the table.
There a varies classes that I have no interest in.
Attempt # 1 did not work.
my $tree = HTML::TreeBuilder->new_from_content($txtRawData);
my #list = $mech->find('span');
foreach ( #list ) {
print $_->as_HTML();
}
Attempt # 2 did not work.
foreach my $tag ($tree->look_down(_tag => 'span')) {
my $value = $tag->as_text;
}
The HTML table of interest is:
<div class="Mt(10px)">
<table class="Lh(1.7) W(100%) M(0)">
<tbody>
<tr class="Bdbw(1px) Bdbc($lightGray) Bdbs(s) H(36px)">
<td class="Fw(b) Fz(15px)">
<span>Revenue</span>
</td>
<td class="C($gray) Ta(end)">
<span>9/24/2016</span>
</td>
<td class="C($gray) Ta(end)">
<span>9/26/2015</span>
</td>
<td class="C($gray) Ta(end)">
<span>9/27/2014</span>
</td>
</tr>
<tr class="Bdbw(1px) Bdbc($lightGray) Bdbs(s) H(36px)">
<td class="Fz(s) H(35px) Va(m)">
<span>Total Revenue</span>
</td>
<td class="Fz(s) Ta(end)">
<span>
<span>215,639,000</span>
</span>
</td>
<td class="Fz(s) Ta(end)">
<span>
<span>233,715,000</span>
</span>
</td>
<td class="Fz(s) Ta(end)">
<span>
<span>182,795,000</span>
</span>
</td>
</tr>
<tr class="Bdbw(1px) Bdbc($lightGray) Bdbs(s) H(36px)">
<td class="Fz(s) H(35px) Va(m)">
<span>Cost of Revenue</span>
</td>
<td class="Fz(s) Ta(end)">
<span>
<span>131,376,000</span>
</span>
</td>
<td class="Fz(s) Ta(end)">
<span>
<span>140,089,000</span>
</span>
</td>
<td class="Fz(s) Ta(end)">
<span>
<span>112,258,000</span>
</span>
</td>
</tr>
<tr class="Bdbw(0px)! H(36px)">
<td class="Fw(b) Fz(s) Pb(20px)">
<span>Gross Profit</span>
</td>
<td class="Fw(b) Fz(s) Ta(end) Pb(20px)">
<span>
<span>84,263,000</span>
</span>
</td>
<td class="Fw(b) Fz(s) Ta(end) Pb(20px)">
<span>
<span>93,626,000</span>
</span>
</td>
<td class="Fw(b) Fz(s) Ta(end) Pb(20px)">
<span>
<span>70,537,000</span>
</span>
</td>
</tr>
</tbody>
</table>
</div>
What is the best way to select (set focus upon) one specific table (their could be multiple tables inside the $content variable), and save the text between the span tags to an array (to be passed to the next procedure - to be inserted into a database table)?
I also would like to highlight that:
Sometimes, the text is inside a two (double) sets of span tags.
There is no table header row (or th tags).
Your first attempt works if you actually do it on $tree and not on $mech.
Combined with as_text from your second attempt is pretty nice.
use HTML::TreeBuilder;
my $tree = HTML::TreeBuilder->new_from_content(my #foo = <DATA>);
my #list = $tree->find('span');
foreach ( #list ) {
say $_->as_text();
}
__DATA__
<div class="Mt(10px)">
<table class="Lh(1.7) W(100%) M(0)">
...
This outputs a list of span contents. You should be able to clean those up and work with them.
Revenue
9/24/2016
9/26/2015
9/27/2014
...
Of course as an actual table (array-of-arrays) it would probably make more sense, but for that we'd have to know what it is you are trying to do.

Magento order email missing style

I have a question regarding the magento order emails.
I have created my own order template by loading the default template and modifying it. When i take a look via the 'preview template' button on there, the email appears without order information (of course) but with every styling etc.
However when my customer gets the email it is pure text with no styling at all.
What could i have been doing wrong
Idk if it helps but here is my email template:
{{template config_path="design/email/header"}}
{{inlinecss file="email-inline.css"}}
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td>
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td style='background: #fff' class="email-heading">
<h1 style='color: #68883e;'>Thank you for your order</h1>
<p>Your order is being process right now</p>
</td>
</tr>
</table>
</td>
</tr>
<tr>
<td class="order-details">
<h3 style='display: inline; float: left'>Your order <span class="no-link">#{{var order.increment_id}}</span></h3>
<p style='display: inline; float: right'>Placed on {{var order.getCreatedAtFormated('long')}}</p>
</td>
</tr>
<tr class="order-information">
<td>
{{if order.getEmailCustomerNote()}}
<table cellspacing="0" cellpadding="0" class="message-container">
<tr>
<td>{{var order.getEmailCustomerNote()}}</td>
</tr>
</table>
{{/if}}
{{layout handle="sales_email_order_items" order=$order}}
<table cellpadding="0" cellspacing="0" border="0">
<tr>
<td class="address-details">
<h6>Bill to:</h6>
<p><span class="no-link">{{var order.getBillingAddress().format('html')}}</span></p>
</td>
<td class="address-details">
<h6>Client:</h6>
<p><span class="no-link">{{var order.getclientfirstname().format('html')}}</span></p>
</td>
</tr>
<tr>
{{depend order.getIsNotVirtual()}}
<td class="method-info">
<h6>Shipping method:</h6>
<p>{{var order.shipping_description}}</p>
</td>
{{/depend}}
<td class="method-info">
<h6>Payment method:</h6>
{{var payment_html}}
</td>
</tr>
</table>
</td>
</tr>
</table><!-- asd-->
{{layout handle="sales_email_order_items" order=$order}}
{{template config_path="design/email/footer"}}
You shouldn't use the CSS classes to stylize the templates.
Also, you shouldn't use the file with the styles info, that is located anywhere else, as they are not supported by the majority of the email services.
Instead, you can directly use the attribute tags (to implement the stylazation) or use the styles right in the email code (using the tag <style>)
And here is the list of tags you can use:
https://www.campaignmonitor.com/css/

Using TinyMCE with handlebars template

I am trying to use TinyMCE to allow my users to modify a handlebars report template. The template contains several elements that are not valid to TinyMCE and they are being moved around. See the {{#each data}} and {{/each}}
Here is good HTML code for my handlebars template:
<table class="table table-bordered">
<thead>
<tr>
<th><h4>Item</h4></th>
<th><h4 class="text-right">Quantity</h4></th>
<th><h4 class="text-right">Rate/Price</h4></th>
<th><h4 class="text-right">Sub Total</h4></th>
</tr>
</thead>
<tbody>
{{#each Details}}
<tr>
<td>{{Item}}<br><small>{{Description}}</small></td>
<td class="text-right">{{Quantity}}</td>
<td class="text-right">{{Rate}} {{UnitOfMeasure}}</td>
<td class="text-right">{{Amount}}</td>
</tr>
{{/each}}
</tbody>
</table>
After I past the code into TinyMCE, it results to the following:
{{#each Details}}{{/each}}
<table class="table table-bordered">
<thead>
<tr><th>
<h4>Item</h4>
</th><th>
<h4 class="text-right">Quantity</h4>
</th><th>
<h4 class="text-right">Rate/Price</h4>
</th><th>
<h4 class="text-right">Sub Total</h4>
</th></tr>
</thead>
<tbody>
<tr>
<td>{{Item}}<br /><small>{{Description}}</small></td>
<td class="text-right">{{Quantity}}</td>
<td class="text-right">{{Rate}} {{UnitOfMeasure}}</td>
<td class="text-right">{{Amount}}</td>
</tr>
</tbody>
</table>
Has anyone ran across a plugin or something else that may help me?
I just ran into this... I have an order confirmation email that I need to be configurable with a list of order items in a table; same issue.
I did just realize I probably shouldn't be using tables anyway, since they are not responsive, but I ultimately was able to solve the problem with HTML comments, like this:
<tr style="font-weight: bold;">
<td style="width: 145px;">Qty</td>
<td>Item</td>
<td>Unit Price</td>
<td>Total</td>
</tr>
<!--{{#order.line_items}} -->
<tr repeat="">
<td style="width: 145px;">{{quantity}}</td>
<td>{{product.name}}</td>
<td>{{currency unit_price}}</td>
<td>{{currency total}}</td>
</tr>
<!--{{/order.line_items}} -->
<tr>
<td style="width: 145px;"> </td>
<td> </td>
<td><strong>Subtotal:</strong></td>
<td>{{currency order.subtotal}}</td>
</tr>
I was able to use a custom attribute on my Element and use:
<tr repeat="{{#each Details}}">
</tr repeat="{{/each}}">

How Do I Display Document ID Using Meteor Spacebars?

I am new to Meteor, and I have an app that outputs data into rows of a table, I want to have a column for the ObjectID just for testing purposes, and I will disable this in production, but my handlebars template does not seem to output the _id at all. any ideas are appreciated!
Here is my template:
<template name="Fillup">
{{#each FillupArr}}
<tr class="fillup row">
<td> <div class="btnEdit">edit</div> <div class="btnSave" >save</div></td>
<td class="">{{Fillup_id.toHexString}}</td>
<td class="dateResult">{{Date}}</td>
<td class="mpg">{{MPG}}</td>
<td class="tripResult">{{Trip}}</td>
<td class="ppg">{{PPG}}</td>
<td class="ppm">{{PPM}}</td>
<td class="galResult">{{Gal}}</td>
<td class="priceResult">{{Price}}</td>
<td class="stationResult">{{Station}}</td>
<td> <div class="btnRemove">remove</div> <div class="btnCancel">cancel</div></td>
</tr>
{{/each}}
</template>
I figured it out, you dont need to incude the collection name in the handlebars template so it changed from:
<td class="">{{Fillup_id.toHexString}}</td>
to
<td class="">{{_id}}</td>
and now it works!