I'm needing some help with Sed. I'm using it on Windows and Mac OSX. I need to Sed to add a
</tr>
<tr>
every 4 lines, after the first <tr> found, and stop doing it on </tr>
i Just can't find a way to doing this.
Every file will have up to 20 tables, so i need to do it automatically...
changing from this
<div class="titulo"> TERMINAL CAPAO DA IMBUIA</div>
<div class="dataedia">
Válido a partir de: 30/07/2012 -
DIA ÚTIL</div>
<table>
<tr>
<td>05:50</td>
<td>05:58</td>
<td>06:04</td>
<td>06:08</td>
<td>06:12</td>
<td>06:15</td>
<td>06:17</td>
<td>06:20</td>
<td>06:22</td>
<td>06:25</td>
<td>06:27</td>
<td>06:30</td>
<td>06:32</td>
<td>06:35</td>
<td>06:37</td>
<td>06:39</td>
<td>06:42</td>
<td>06:44</td>
<td>06:47</td>
<td>06:49</td>
<td>06:52</td>
<td>06:54</td>
<td>06:57</td>
<td>06:59</td>
<td>07:01</td>
<td>07:04</td>
<td>07:06</td>
<td>07:09</td>
<td>07:11</td>
<td>07:14</td>
<td>07:16</td>
<td>07:18</td>
<td>07:21</td>
<td>07:23</td>
<td>07:26</td>
<td>07:28</td>
<td>07:31</td>
<td>07:33</td>
<td>07:36</td>
<td>07:38</td>
</tr>
</table>
</div>
to this
<div class="titulo"> TERMINAL CAPAO DA IMBUIA</div>
<div class="dataedia">
Válido a partir de: 30/07/2012 -
DIA ÚTIL</div>
<table>
<tr>
<td>05:50</td>
<td>05:58</td>
<td>06:04</td>
<td>06:08</td>
</tr>
<tr>
<td>06:12</td>
<td>06:15</td>
<td>06:17</td>
<td>06:20</td>
</tr>
<tr>
<td>06:22</td>
<td>06:25</td>
<td>06:27</td>
<td>06:30</td>
</tr>
<tr>
<td>06:32</td>
<td>06:35</td>
<td>06:37</td>
<td>06:39</td>
</tr>
<tr>
<td>06:42</td>
<td>06:44</td>
<td>06:47</td>
<td>06:49</td>
</tr>
<tr>
<td>06:52</td>
<td>06:54</td>
<td>06:57</td>
<td>06:59</td>
</tr>
<tr>
<td>07:01</td>
<td>07:04</td>
<td>07:06</td>
<td>07:09</td>
</tr>
<tr>
<td>07:11</td>
<td>07:14</td>
<td>07:16</td>
<td>07:18</td>
</tr>
<tr>
<td>07:21</td>
<td>07:23</td>
<td>07:26</td>
<td>07:28</td>
</tr>
<tr>
<td>07:31</td>
<td>07:33</td>
<td>07:36</td>
<td>07:38</td>
</tr>
</table>
</div>
Is it possible with sed? If not, what tool should i use?
Thanks
I don't like the idea of using sed to handle HTML code. Said that, try with this:
Content of script.sed:
## For every line between '<tr>' and '</tr>' do ...
/<tr>/,/<\/tr>/ {
## Omit range edges.
/<\/\?tr>/ b;
## Append '<td>...</td>' to Hold Space (HS).
H;
## Get HS to Pattern Space (PS) to work with it.
x;
## If there are at least four newline characters means that exists four
## '<td>' tags too, so add a '<tr>' before them and a '</tr>' after them,
## print, and delete them (already processed).
/\(\n[^\n]*\)\{4\}/ {
s/^\(\n\)/<tr>\1/;
s/$/\n<\/tr>/;
p
s/^.*$//;
}
## Save the '<td>'s to HS again and read next line.
x;
b;
}
## Print all lines out of the range.
p;
Assuming infile with the data posted in the question, run the script like:
sed -nf script.sed infile
That yields:
<div class="titulo"> TERMINAL CAPAO DA IMBUIA</div>
<div class="dataedia">
Válido a partir de: 30/07/2012 -
DIA ÚTIL</div>
<table>
<tr>
<td>05:50</td>
<td>05:58</td>
<td>06:04</td>
<td>06:08</td>
</tr>
<tr>
<td>06:12</td>
<td>06:15</td>
<td>06:17</td>
<td>06:20</td>
</tr>
<tr>
<td>06:22</td>
<td>06:25</td>
<td>06:27</td>
<td>06:30</td>
</tr>
<tr>
<td>06:32</td>
<td>06:35</td>
<td>06:37</td>
<td>06:39</td>
</tr>
<tr>
<td>06:42</td>
<td>06:44</td>
<td>06:47</td>
<td>06:49</td>
</tr>
<tr>
<td>06:52</td>
<td>06:54</td>
<td>06:57</td>
<td>06:59</td>
</tr>
<tr>
<td>07:01</td>
<td>07:04</td>
<td>07:06</td>
<td>07:09</td>
</tr>
<tr>
<td>07:11</td>
<td>07:14</td>
<td>07:16</td>
<td>07:18</td>
</tr>
<tr>
<td>07:21</td>
<td>07:23</td>
<td>07:26</td>
<td>07:28</td>
</tr>
<tr>
<td>07:31</td>
<td>07:33</td>
<td>07:36</td>
<td>07:38</td>
</tr>
</table>
</div>
try awk
awk '{print}; /<td>/ && ++i==4 {print "</tr>\n<tr>"; i=0}' file
print the line
if it's a <td> then increase i
if i is 4 print </tr><tr> and reset i
Testing with given input the desired output is returned,
with the only "problem" that an extra <tr></tr> appears at the end of the list.
This is fixable but I'm running out of time here.
When I get back I can look into it if you think it is needed.
... part of the end of the result file
<td>07:26</td>
<td>07:28</td>
</tr>
<tr>
<td>07:31</td>
<td>07:33</td>
<td>07:36</td>
<td>07:38</td>
</tr>
<tr> <-- extra <tr></tr> here
</tr>
</table>
you can try with regular expressions. You can test following expression on:
http://gskinner.com/RegExr/
Catch expression:
?</td>.<td>.*?</td>.<td>.*?</td>.<td>.*?</td>)(?!.</tr>)
Replace expression:
$1\n</tr>\n<tr>
Flags checked:
global, ignorecase, dotall
Result:
<table>
<tr>
<td>05:50</td>
<td>05:58</td>
<td>06:04</td>
<td>06:08</td>
</tr>
<tr>
<td>06:12</td>
<td>06:15</td>
<td>06:17</td>
<td>06:20</td>
</tr>
<tr>
<td>06:22</td>
<td>06:25</td>
<td>06:27</td>
<td>06:30</td>
</tr>
<tr>
<td>06:32</td>
<td>06:35</td>
<td>06:37</td>
<td>06:39</td>
</tr>
<tr>
<td>06:42</td>
<td>06:44</td>
<td>06:47</td>
<td>06:49</td>
</tr>
<tr>
<td>06:52</td>
<td>06:54</td>
<td>06:57</td>
<td>06:59</td>
</tr>
<tr>
<td>07:01</td>
<td>07:04</td>
<td>07:06</td>
<td>07:09</td>
</tr>
<tr>
<td>07:11</td>
<td>07:14</td>
<td>07:16</td>
<td>07:18</td>
</tr>
<tr>
<td>07:21</td>
<td>07:23</td>
<td>07:26</td>
<td>07:28</td>
</tr>
<tr>
<td>07:31</td>
<td>07:33</td>
<td>07:36</td>
<td>07:38</td>
</tr>
</table>
</div>
You can use editor like Notepad++ for batch replace on many files at once (syntax will be little different).
sed '\!<td>!,\!</table!{N;N;N;i\
</tr>\
<tr>
}' input_file
Perl solution, still using regular expression instead of parsing HTML:
perl -pe '
undef $inside if m{</tr>};
if ($inside and ($. % 4) == $tr_line) {
print "</tr>\n<tr>\n";
}
$inside = 1 if defined $tr_line;
$tr_line = ($. + 1) % 4 if /<tr>/;
' file
Using xsh:
open :F html file ; # Open as html.
while //table/tr[count(td)>4] wrap :U position()=8 tr //table/tr/td ; # Wrap four td's into a tr.
xmove :r //table/tr/tr before .. ; # Unwrap the extra tr.
remove //table/tr[last()] ; # Remove the extra tr.
Related
I'm trying to figure out why html2text is breaking my HTML:
<div><table> <tbody> <tr> <td> <span><strong><span>About</span></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><span>Contact</span></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a><span>Maths Games Order</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><span>FAQ</span></strong></span></td> </tr> </tbody> </table>s<div> <span><strong>Broadbent Maths Ltd<br> 3 High Street, Welbourn, Lincoln, LN5 0NH </strong></span></div> </div>
Processing it with:
cat "/home/spider/original-file.txt" | html2text -utf8 -nobs -style pretty
When I run that, I get:
nput recoding failed due to invalid input sequence. Unconverted part
of text follows. ▒Contact ▒Maths Games Order ▒FAQ
s Broadbent Maths Ltd 3 High Street, Welbourn, Lincoln, LN5 0NH
When I run Devel::Peek::Dump() (Perl), I see the string as:
SV = PV(0x564c0a72c860) at 0x564c09967c80
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK,UTF8)
PV = 0x564c0a58bc60 "\n<div><table> <tbody> <tr> <td> <span><strong><span>About</span></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><span>Contact</span></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a><span>Maths Games Order</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><span>FAQ</span></strong></span></td> </tr> </tbody> </table>s<div> <span><strong>Broadbent Maths Ltd<br> 3 High Street, Welbourn, Lincoln, LN5 0NH </strong></span></div> </div>\n"\0 [UTF8 "\n<div><table> <tbody> <tr> <td> <span><strong><span>About</span></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><span>Contact</span></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a><span>Maths Games Order</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><span>FAQ</span></strong></span></td> </tr> </tbody> </table>s<div> <span><strong>Broadbent Maths Ltd<br> 3 High Street, Welbourn, Lincoln, LN5 0NH </strong></span></div> </div>\n"]
CUR = 725
LEN = 736
COW_REFCNT = 1
If I remove the first bit:
<div><table>
It works fine! I don't get why its breaking there though - all seems ok to me?
Ok I think I've worked it out. In this case, for some reason `• was breaking it. I replaced that with "-", and it works now
html2text -utf8 -nobs -o test-out.txt test.co.uk.txt
It's a bit weird that html2text breaks with HTML entities though?
UPDATE: The problem turned out to be that while they were serving the page as utf-8 with the meta, it was being passed along as iso-8859-1 from the server. So what I did was parse out the server header and compare it before saving - then if it was windows-1252, then I would use this command instead of parse it out:
html2text -ansi -nobs -o test-out.txt test.co.uk.txt
I am making a report for the invoice line, I have purchased a module in the third-party odoo store and it performs its function well.
But I can't see the discount on the invoice line.
I think this is because the module prevents me, but I already have no developer support.
What I need is that the discount (price list) can be seen on the invoice line.
What table or what element of the invoice line discount?
I leave you the code that I have in the report
''''
<tbody class="invoice_tbody">
<tr t-foreach="invoice_lines[0]" t-as="line">
<td><b><span t-esc="line['client_ref']"/></b>
<span t-esc="line['description']"/></td>
<td class="text-right">
<span t-esc="line['qty']"/>
</td>
<td class="text-right">
<span t-esc="line['price_unit']"/>
</td>
<td t-if="display_discount" class="text-right">
</td>
<td class="text-right" id="subtotal">
<t t-if="line['price_subtotal']">
<span t-esc = "line ['price_subtotal']" t-options = "{& quot; widget & quot ;: & quot; monetario & quot ;, & quot; display_currency & quot ;: o.currency_id}" /> </t>
</td>
</tr>
<tr t-foreach = "range (max (5-len (o.invoice_line_ids), 0))" t-as = "l">
<td t-translation = "off"> & amp; nbsp; </td>
<td class = "hidden" />
<td />
<td />
<td t-if = "display_discount" />
<td />
<td />
</tr>
</tbody>
</t>
'''
Yes, this parameter is in the report
"view / report_invoice_document"
But the report that I try to modify is this
report_invoice_document_inherit
<?xml version="1.0"?>
<data inherit_id="account.report_invoice_document">
<xpath expr="//table[#name='invoice_line_table']/tbody" position="replace">
<t t-if="res_company.is_group_by_so">
<t t-set="invoice_lines" t-value="o.get_invoice_lines()"/>
<tbody class="invoice_tbody">
<tr t-foreach="invoice_lines[0]" t-as="line">
<td><b><span t-esc="line['client_ref']"/></b>
<span t-esc="line['description']"/></td>
<!-- <td class="hidden"><span t-esc="line['client_ref']"/></td> -->
<td class="text-right">
<span t-esc="line['qty']"/>
<!-- <span t-field="l.uom_id" groups="product.group_uom"/> -->
</td>
<td class="text-right">
<span t-esc="line['price_unit']"/>
</td>
</td>
<td t-if="display_discount" class="text-right">
<!-- <span t-esc="line['price_unit']"/> -->
</td>
<td class="text-right" id="subtotal">
<t t-if="line['price_subtotal']">
<span t-esc="line['price_subtotal']" t-options="{"widget": "monetary", "display_currency": o.currency_id}"/></t>
</td>
</tr>
<tr t-foreach="range(max(5-len(o.invoice_line_ids),0))" t-as="l">
<td t-translation="off"> </td>
<td class="hidden"/>
<td/>
<td/>
<td t-if="display_discount"/>
<td/>
<td/>
</tr>
</tbody>
</t>
<t t-else="">
<tbody class="invoice_tbody">
<tr t-foreach="o.invoice_line_ids" t-as="l">
<td><span t-field="l.name"/></td>
<td class="hidden"><span t-field="l.origin"/></td>
<td class="text-right">
<span t-field="l.quantity"/>
<span t-field="l.uom_id" groups="product.group_uom"/>
</td>
<td class="text-right">
<span t-field="l.price_unit"/>
</td>
<td t-if="display_discount" class="text-right">
<span t-field="l.discount"/>
</td>
<td class="text-right">
<span t-esc="', '.join(map(lambda x: (x.description or x.name), l.invoice_line_tax_ids))"/>
</td>
<td class="text-right" id="subtotal">
<span t-field="l.price_subtotal" t-options="{"widget": "monetary", "display_currency": o.currency_id}"/>
</td>
</tr>
<tr t-foreach="range(max(5-len(o.invoice_line_ids),0))" t-as="l">
<td t-translation="off"> </td>
<td class="hidden"/>
<td/>
<td/>
<td t-if="display_discount"/>
<td/>
<td/>
</tr>
</tbody>
</t>
</xpath>
</data>
I have tried to modify the second report, and put and have looked at the python code in case something
invoice_report_grouped_by \ report \ account_invoice.py
# -*- coding: utf-8 -*-
from odoo import api, models
from datetime import datetime
class AccountInvoice(models.Model):
_inherit = "account.invoice"
def get_notation_amt(self, amt):
'''This method help us to return the value of the product pricing'''
amount = str(amt).split('.')
if len(amount) == 2:
amount = amount[0] + "," + amount[1]
return amount
return amt
#api.multi
def get_product_invoice_lines(self, client_ref=False):
'''This method helps to get the data for the following Invoice Line.'''
product_invoices = []
client_order_ref = []
for line in self.invoice_line_ids:
sale_line = (False, line)
if line.sale_line_ids:
sale_line = (line.sale_line_ids[0].order_id, line)
client_order_ref.append(sale_line)
if client_order_ref:
for ref in client_order_ref:
if (client_ref == ref[0]):
product_invoices.append({'price_subtotal': ref[1].price_unit * ref[1].quantity,
'default_code': ref[1].product_id.default_code,
'client_ref': False,
'discount': ref[1].discount,
'taxes': ",".join(map(lambda x: (x.description or x.name), ref[1].invoice_line_tax_ids)),
'description': ref[1].name,
'qty': self.get_notation_amt(ref[1].quantity),
'price_unit': self.get_notation_amt("{0:.3f}".format(ref[1].price_unit)),
})
else:
for line in self.invoice_line_ids:
product_invoices.append({'price_subtotal': line.price_unit * line.quantity,
'default_code': line.product_id.default_code,
'client_ref': False,
'discount': line.discount,
'taxes': ",".join(map(lambda x: (x.description or x.name), ref[1].invoice_line_tax_ids)),
'description': line.name,
'qty': self.get_notation_amt(line.quantity),
'price_unit': self.get_notation_amt("{0:.3f}".format(line.price_unit)),
})
return product_invoices
#api.multi
def get_invoice_lines(self):
'''This method help to get the invoice line group by Sale order'''
vals = []
sale_order_lines = []
false_sale_order_lines = []
for line in self.invoice_line_ids:
sale_line = False
if line.sale_line_ids:
sale_line = line.sale_line_ids[0].order_id
if sale_line:
sale_order_lines.append(sale_line)
else:
false_sale_order_lines.append(sale_line)
sale_order_lines = list(set(sale_order_lines))
false_sale_order_lines = list(set(false_sale_order_lines))
for sale_order in sale_order_lines:
if sale_order and self.origin:
confirmation_date = str(
sale_order.confirmation_date, '%d-%m-%Y %H:%M:%S').strftime('%d/%m/%Y')
client_ref = sale_order.name + ' - ' + confirmation_date
if sale_order.client_order_ref:
client_ref = client_ref + ' - ' + sale_order.client_order_ref
vals.append({'price_subtotal': False, 'default_code': False,
'client_ref': client_ref, 'description': False,
'qty': False, 'price_unit': False, 'taxes': False, 'discount': False})
vals.extend(self.get_product_invoice_lines(client_ref=sale_order))
# for sort false sale order, display manually invoice line at last
for so in false_sale_order_lines:
vals.extend(self.get_product_invoice_lines(client_ref=so))
return [vals, len(vals)]
You can see the default report here:
https://github.com/odoo/odoo/blob/06f9baae968674547cb2592b1c22147bfb2e8ba9/addons/account/views/report_invoice.xml#L49
<t t-set="display_discount" t-value="any([l.discount for l in o.invoice_line_ids])"/>
This means that if any line has a discount, it should display it.
I think there are two options to disable it. One is to remove that line from the report, or the second option is to set display_discount to false.
Knowing the module that breaks your report, the problem should be easy to find.
But the exact reason is hard to tell without seeing your module.
this is div content i want to get from a web page.
<div class="result clearfix table-responsive">
<table class="table table-striped">
<thead>
<tr>
<th>Giải thưởng</th>
<th>Trùng khớp</th>
<th>Số lượng giải</th>
<th style="text-align: left; width: 22%;">Giá trị giải (đồng)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Jackpot</td>
<td>Trùng 6 số</td>
<td>0</td>
<td style="text-align: left"><span>27.868.784.500</span></td>
</tr>
<tr>
<td>Giải nhất</td>
<td>Trùng 5 số</td>
<td>18</td>
<td style="text-align: left"><span>10.000.000</span></td>
</tr>
<tr>
<td>Giải nhì</td>
<td>Trùng 4 số</td>
<td>613</td>
<td style="text-align: left"><span>300.000</span></td>
</tr>
<tr>
<td>Giải ba</td>
<td>Trùng 3 số</td>
<td>11047</td>
<td style="text-align: left"><span>30.000</span></td>
</tr>
</tbody>
</table>
<p class="role-result">
<span>Thời hạn lĩnh thưởng của vé trúng thưởng: là 60 (sáu mươi) ngày, kể từ ngày xác định kết quả trúng thưởng hoặc kể từ ngày hết hạn lưu hành của vé xổ số tự chọn số điện toán (nếu có). Quá thời hạn trên, các vé trúng thưởng không còn giá trị lĩnh thưởng.</span>
</p>
<div>
<a class="view-more" href="winning-numbers">Các lần quay trước</a>
</div>
</div>
this is my code to get div content and echo to my site:
$kqxsmega = file_get_contents ("http://vietlott.vn/vi/trung-thuong/ket-qua-trung-thuong/mega-6-45/");
$dom = new DomDocument();
$dom->loadHTML($kqxsmega);
$finder = new DomXPath($dom);
$classname="result clearfix table-responsive";
$divContent = $finder->query("//*[contains(#class, '$classname')]");
My code is running good, and i want to convert $divContent become string and i can echo it.
now to echo $divContent will show nothing
echo $divContent ;
Please help me.
Thank you.
My tables' rows in HTML are as follows,
<TR bgcolor="#FFFFFF" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#FFFFFF';">
<TD class="dlfont">07/01/2011 10:33 AM EDT</B> </TD>
<TD class="dlfont">DRB</B> </TD><TD class="dlfont">Blah</B> </TD>
<TD class="dlfont">PPD</B> </TD><TD class="dlfont"> </B> </TD>
<TD class="dlfont">07/01/2011</B> </TD>
<TD width=50 align=center><IMG border='0' src='/images/view.gif' height=10 width=19></TD>
</TR>
<TR bgcolor="#EEEEEE" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#EEEEEE';">
<TD class="dlfont">07/01/2011 10:33 AM EDT</B> </TD>
<TD class="dlfont">WHPSF</B> </TD>
<TD class="dlfont">Blah</B> </TD>
<TD class="dlfont"> </B> </TD>
<TD class="dlfont"> </B> </TD>
<TD class="dlfont">07/01/2011</B> </TD>
<TD width=50 align=center><IMG border='0' src='/images/view.gif' height=10 width=19></TD>
</TR>
When I extract the rows using HTML::TableExtract, the extra characters </B> also appear at the end and form some kind of special character. How can I get rid of this?
I would keep in mind two things when using HTML::TableExtract with the badly formatted HTML in your question
use keep_html=>1 in the HTML::TableExtract constructor
use a regex to remove the </B> , carefully
Here's some Perl code I wrote to prune the </B> out of the table cells, but note, this could change validly formatted HTML to badly formatted HTML if you blindly apply it in all cases.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
my($f) = #ARGV;
open F,$f;
my $html = join '',<F>;
close F;
### your html didn't include headers, so I added a first table row with td text, time a b c d e f, to help HTML::TableExtract find the table in file, $f
my $te = HTML::TableExtract->new(
keep_html=>1,
headers=>[qw/ time a b c d e f/]);
$te->parse($html);
for my $ts($te->tables)
{
print "Table(",join(',',$ts->coords),":\n";
for my $row ($ts->rows)
{
for my $cell (#$row)
{
next unless $cell;
## maybe add $ at end of regex or other test here to make sure valid cases of <B>...</B> are not affected
$cell =~ s/<\/B> //i;
print $cell."\n";
}
}
}
I tried to run the following Perl script on the HTML further below. My problem is how to define the correct hash reference, with attribs that specify attributes of interest within my HTML <table> tag itself.
#!/usr/bin/perl
use strict; use warnings;
use HTML::TableExtract;
use YAML;
my $table = HTML::TableExtract->new(keep_html=>0, depth => 1, count => 1, br_translate => 0 );
$table->parse($html);
foreach my $row ($table->rows)
sub cleanup {
for ( #_ ) {
s/\s+//;
s/[\xa0 ]+\z//;
s/\s+/ /g;
}
}
{ print join("\t", #$row), "\n"; }
I want to apply this code on the HTML-document you see further below.
My first approach is to do this with the columns method. But i am not able to figure out how to use the columns method on the below HTML-file: My intuition makes me think it should be something like the following (but my intuition is wrong):
foreach my $column ($table->columns) {
print join("\t", #$column), "\n";
}
The HTML::TableExtract documentation doesn't shed much light (for me anyway).
I can see in the code of the module that the columns method belongs to HTML::TableExtract::Table, but I can't figure out how to use it. I appreciate any help.
Background:
I try to get the table extracted and I have a very very small document of tables that i want to parse with the HTML::TableExtract module I am trying to search for keywords in the HTML - so that i can take them for the attribs I have to print only the necessary data.
I tried going CPAN but could not really find how to search through it for particular keywords. One way to do it would be HTML::TableExtract - the other way would be to parse with HTML::TokeParser I have very little experience with HTML::TokeParser.
Well - one or the other way i need to do this parsing: I want to output the result of the parsed tables into some .text - or even better store it into a database. The problem here is I cant find anyway to search through the resulting parsed table and get necessary data.
The HTML
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<meta name="GENERATOR" content="Microsoft FrontPage 3.0">
<link rel="stylesheet" href="jspsrc/css/bp_style.css" type="text/css">
<title>Weitere Schulinformationen</title>
</head>
<body class="bodyclass">
<div style="text-align:center;"><center>
<!-- <fieldset><legend> general information </legend>
-->
<br/>
<table border="1" cellspacing="0" bordercolordark="white" bordercolorlight="black" width="80%" class='bp_result_tab_info'>
<!-- <table border="0" cellspacing="0" bordercolordark="white" bordercolorlight="black" width="80%" class='bp_search_info'>
-->
<tr>
<td width="100%" colspan="2" class="ldstabTitel"><strong>data_one </strong></td>
</tr>
<tr>
<td width="27%"><strong>data_two</strong></td>
<td width="73%"> 116439
</td>
</tr>
<tr>
<td width="27%"><strong>official_description</strong></td>
<td width="73%">the name </td>
</tr>
<tr>
<td width="27%"><strong>name of the street</strong></td>
<td width="73%">champs elysee</td>
</tr>
<tr>
<td width="27%"><strong>number and town</strong></td>
<td width="73%"> 75000 paris </td>
</tr>
<tr>
<td width="27%"><strong>telefon</strong></td>
<td width="73%"> 000241 49321
</td>
</tr>
<tr>
<td width="27%"><strong>fax</strong></td>
<td width="73%"> 000241 4093287
</td>
</tr>
<tr>
<td width="27%"><strong>e-mail-adresse</strong></td>
<td width="73%"> <a href=mailto:1111116439#my_domain.org>1222216439#site.org</a>
</td>
</tr>
<tr>
<td width="27%"><strong>internet-site</strong></td>
<td width="73%"> <a href=http://www.thesite.org>http://www.thesite.org</td>
</tr>
<!--
<tr>
<td width="27%"> </td>
<td width="73%" align="right"><a href="schule_aeinfo.php?SNR=<? print $SCHULNR ?>" target="_blank">
[Schuldaten ändern] </a>
</tr>
</td> -->
<tr>
<td width="27%"> </td>
<td width="73%">the department</td>
</tr>
<tr>
<td width="100%" colspan=2><strong> </strong></td>
</tr>
<tr>
<td width="27%"><strong>number of indidviduals</strong></td>
<td width="73%"> 192</td>
<tr>
<td width="100%" colspan=2><strong> </strong></td>
</tr>
<!-- if (!fsp.isEmpty()){
ztext = " ";
int i = 0;
Iterator it = fsp.iterator();
while (it.hasNext()){
String[] zwert = new String[2];
zwert = (String[])it.next();
if (i==0){
if (zwert[1].equals("0")){
ztext = ztext+zwert[0];
}else{
ztext = ztext+zwert[0]+" mit "+zwert[1];
if (zwert[1].equals("1")){
ztext = ztext+" Schüler";
}else{
ztext = ztext+" Schülern";
}
}
i++;
}else{
if (zwert[1].equals("0")){
ztext = ztext+"<br> "+zwert[0];
}else{
ztext = ztext+"<br> "+zwert[0]+" mit "+zwert[1];
if (zwert[1].equals("1")){
ztext = ztext+" Schüler";
}else{
ztext = ztext+" Schülern";
}
}
}
}
-->
</table>
<!-- </fieldset> -->
<br>
</body>
</html>
Thanks for any and all help.
You need to provide something that uniquely identifies the table in question. This can be the content of its headers or the HTML attributes. In this case, there is only one table in the document, so you don't even need to do that. But, if I were to provide anything to the constructor, I would provide the class of the table.
Also, I do not think you want the columns of the table. The first column of this table consists of labels and the second column consists of values. To get the labels and values at the same time, you should process the table row-by-row.
#!/usr/bin/perl
use strict; use warnings;
use HTML::TableExtract;
use YAML;
my $te = HTML::TableExtract->new(
attribs => { class => 'bp_result_tab_info' },
);
$te->parse_file('t.html');
for my $table ( $te->tables ) {
print Dump $table->columns;
}
Output:
---
- 'data_one '
- data_two
- official_description
- name of the street
- number and town
- telefon
- fax
- e-mail-adresse
- internet-site
- á
- á
- number of indidviduals
- á
---
- ~
- "á116439\r\n "
- 'the name '
- champs elysee
- ' 75000 paris '
- "á000241 49321\r\n"
- "á000241 4093287\r\n"
- "á1222216439#site.org\r\n"
- áhttp://www.thesite.org
- the department
- ~
- á192
- ~
Finally, a word of advice: It is clear that you do not have much of an understanding of Perl (or HTML for that matter). It would be better for you to try to learn some of the basics first. This way, all you are doing is incorrectly copying and pasting code from one answer into another and not learning anything.