perl HTML::TableExtract get stripped text - perl

My tables' rows in HTML are as follows,
<TR bgcolor="#FFFFFF" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#FFFFFF';">
<TD class="dlfont">07/01/2011 10:33 AM EDT</B> </TD>
<TD class="dlfont">DRB</B> </TD><TD class="dlfont">Blah</B> </TD>
<TD class="dlfont">PPD</B> </TD><TD class="dlfont"> </B> </TD>
<TD class="dlfont">07/01/2011</B> </TD>
<TD width=50 align=center><IMG border='0' src='/images/view.gif' height=10 width=19></TD>
</TR>
<TR bgcolor="#EEEEEE" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#EEEEEE';">
<TD class="dlfont">07/01/2011 10:33 AM EDT</B> </TD>
<TD class="dlfont">WHPSF</B> </TD>
<TD class="dlfont">Blah</B> </TD>
<TD class="dlfont"> </B> </TD>
<TD class="dlfont"> </B> </TD>
<TD class="dlfont">07/01/2011</B> </TD>
<TD width=50 align=center><IMG border='0' src='/images/view.gif' height=10 width=19></TD>
</TR>
When I extract the rows using HTML::TableExtract, the extra characters </B> also appear at the end and form some kind of special character. How can I get rid of this?

I would keep in mind two things when using HTML::TableExtract with the badly formatted HTML in your question
use keep_html=>1 in the HTML::TableExtract constructor
use a regex to remove the </B> , carefully
Here's some Perl code I wrote to prune the </B> out of the table cells, but note, this could change validly formatted HTML to badly formatted HTML if you blindly apply it in all cases.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
my($f) = #ARGV;
open F,$f;
my $html = join '',<F>;
close F;
### your html didn't include headers, so I added a first table row with td text, time a b c d e f, to help HTML::TableExtract find the table in file, $f
my $te = HTML::TableExtract->new(
keep_html=>1,
headers=>[qw/ time a b c d e f/]);
$te->parse($html);
for my $ts($te->tables)
{
print "Table(",join(',',$ts->coords),":\n";
for my $row ($ts->rows)
{
for my $cell (#$row)
{
next unless $cell;
## maybe add $ at end of regex or other test here to make sure valid cases of <B>...</B> are not affected
$cell =~ s/<\/B> //i;
print $cell."\n";
}
}
}

Related

How to extract a column of a table from html page using perl modules?

I have the following html code of a part of a webpage.
<h2 id="failed_process">Failed Process</h2>
<table border="1">
<thead>
<tr>
<th>
<b>pid</b>
</th>
<th>
<b>Priority</b>
</th>
</tr>
</thead>
<tbody>
<tr>
<td id="90">p_201211162334</td>
<td id="priority_90">NORMAL</td>
</tr>
<tr>
<td id="91">p_201211163423</td>
<td id="priority_91">NORMAL</td>
</tr>
<tr>
<td id="98">p_201211166543</td>
<td id="priority_98">NORMAL</td>
</tr>
</tbody>
</table>
<hr>
I need to extract the pid column . The output should look like
pid
p_201211162334
p_201211163423
p_201211166543
The table count for "Failed Process" table is 4. But the problem is if I mention the table count as 4 and if there are no failed tasks in the webpage, it'll go to the next table and fetch the pid's of next table resulting in wrong pid's.
I am using the below code to get the result.
#!/usr/bin/perl
use strict;
use warnings;
use lib qw(..);
use HTML::TableExtract;
my $content = get("URL");
my $te = HTML::TableExtract->new(
headers => [qw(pid)], attribs => { id => 'failed_process' },
);
$te->parse($content);
foreach my $col ($te->rows) {
print ("\t", #$col), "\n";
}
But I am getting the following error:
Can't call method "rows" on an undefined value
With my favourite DOM parser Mojo::DOM from the Mojolicious suite it would look like that:
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say';
use Mojo::DOM;
# instantiate with all DATA lines
my $dom = Mojo::DOM->new(do { local $/; <DATA> });
# extract all first column cells
$dom->find('table tr')->each(sub {
my $cell = shift->children->[0];
say $cell->all_text;
});
__DATA__
<h2 id="failed_process">Failed Process</h2>
<table border="1">
...
Output:
pid
p_201211162334
p_201211163423
p_201211166543
After $te->parse($html) you may add some like foreach my $table ($te->tables) .. then you can get rows $table->rows. You may also use Data::Dumper to analyze $te.

parsing with Mojo::DOM

I am using Mojo::UserAgent->new to fetch some XML which has the following format:
<row>
<td> content1 </td>
<td> content2 </td>
<td> content3 </td>
</row>
<row>
<td> content4 </td>
<td> content5 </td>
<td> content6 </td>
</row>
Is it possible to view the results like this:
content1,content2,content3
content4,content5,content6
below are the query i am using which get different resutls
$ua->get($url)->res->dom->at->(row)->children->each(sub {print "$_\t"})
Sure, that's absolutely possible and not hard with Mojo::Collection working behind the scenes.
Code
# replace this line by your existing $ua->get($url)->res->dom code
my $dom = Mojo::DOM->new(do { local $/ = undef; <DATA> });
# pretty-print rows
$dom->find('row')->each(sub {
my $row = shift;
say $row->children->pluck('text')->join(', ');
});
Data
__DATA__
<row>
<td> content1 </td>
<td> content2 </td>
<td> content3 </td>
</row>
<row>
<td> content4 </td>
<td> content5 </td>
<td> content6 </td>
</row>
Output
content1, content2, content3
content4, content5, content6
Some comments
each evaluates a code ref for each element of a collection (which is what find returns).
pluck returns a Mojo::Collection object with the return values of the given method name (text in this case). This is just a fancy way to map simple stuff.
text automagically trims the element content.
join joins all elements of the Mojo::Collection object together, all td elements of a row in this case.
Your code doesn't even compile, but using at won't work anyway because it returns just the first matching DOM element, not all. You want to iterate all rows.
HTH!

Add two lines every four lines between patterns - SED

I'm needing some help with Sed. I'm using it on Windows and Mac OSX. I need to Sed to add a
</tr>
<tr>
every 4 lines, after the first <tr> found, and stop doing it on </tr>
i Just can't find a way to doing this.
Every file will have up to 20 tables, so i need to do it automatically...
changing from this
<div class="titulo"> TERMINAL CAPAO DA IMBUIA</div>
<div class="dataedia">
Válido a partir de: 30/07/2012 -
DIA ÚTIL</div>
<table>
<tr>
<td>05:50</td>
<td>05:58</td>
<td>06:04</td>
<td>06:08</td>
<td>06:12</td>
<td>06:15</td>
<td>06:17</td>
<td>06:20</td>
<td>06:22</td>
<td>06:25</td>
<td>06:27</td>
<td>06:30</td>
<td>06:32</td>
<td>06:35</td>
<td>06:37</td>
<td>06:39</td>
<td>06:42</td>
<td>06:44</td>
<td>06:47</td>
<td>06:49</td>
<td>06:52</td>
<td>06:54</td>
<td>06:57</td>
<td>06:59</td>
<td>07:01</td>
<td>07:04</td>
<td>07:06</td>
<td>07:09</td>
<td>07:11</td>
<td>07:14</td>
<td>07:16</td>
<td>07:18</td>
<td>07:21</td>
<td>07:23</td>
<td>07:26</td>
<td>07:28</td>
<td>07:31</td>
<td>07:33</td>
<td>07:36</td>
<td>07:38</td>
</tr>
</table>
</div>
to this
<div class="titulo"> TERMINAL CAPAO DA IMBUIA</div>
<div class="dataedia">
Válido a partir de: 30/07/2012 -
DIA ÚTIL</div>
<table>
<tr>
<td>05:50</td>
<td>05:58</td>
<td>06:04</td>
<td>06:08</td>
</tr>
<tr>
<td>06:12</td>
<td>06:15</td>
<td>06:17</td>
<td>06:20</td>
</tr>
<tr>
<td>06:22</td>
<td>06:25</td>
<td>06:27</td>
<td>06:30</td>
</tr>
<tr>
<td>06:32</td>
<td>06:35</td>
<td>06:37</td>
<td>06:39</td>
</tr>
<tr>
<td>06:42</td>
<td>06:44</td>
<td>06:47</td>
<td>06:49</td>
</tr>
<tr>
<td>06:52</td>
<td>06:54</td>
<td>06:57</td>
<td>06:59</td>
</tr>
<tr>
<td>07:01</td>
<td>07:04</td>
<td>07:06</td>
<td>07:09</td>
</tr>
<tr>
<td>07:11</td>
<td>07:14</td>
<td>07:16</td>
<td>07:18</td>
</tr>
<tr>
<td>07:21</td>
<td>07:23</td>
<td>07:26</td>
<td>07:28</td>
</tr>
<tr>
<td>07:31</td>
<td>07:33</td>
<td>07:36</td>
<td>07:38</td>
</tr>
</table>
</div>
Is it possible with sed? If not, what tool should i use?
Thanks
I don't like the idea of using sed to handle HTML code. Said that, try with this:
Content of script.sed:
## For every line between '<tr>' and '</tr>' do ...
/<tr>/,/<\/tr>/ {
## Omit range edges.
/<\/\?tr>/ b;
## Append '<td>...</td>' to Hold Space (HS).
H;
## Get HS to Pattern Space (PS) to work with it.
x;
## If there are at least four newline characters means that exists four
## '<td>' tags too, so add a '<tr>' before them and a '</tr>' after them,
## print, and delete them (already processed).
/\(\n[^\n]*\)\{4\}/ {
s/^\(\n\)/<tr>\1/;
s/$/\n<\/tr>/;
p
s/^.*$//;
}
## Save the '<td>'s to HS again and read next line.
x;
b;
}
## Print all lines out of the range.
p;
Assuming infile with the data posted in the question, run the script like:
sed -nf script.sed infile
That yields:
<div class="titulo"> TERMINAL CAPAO DA IMBUIA</div>
<div class="dataedia">
Válido a partir de: 30/07/2012 -
DIA ÚTIL</div>
<table>
<tr>
<td>05:50</td>
<td>05:58</td>
<td>06:04</td>
<td>06:08</td>
</tr>
<tr>
<td>06:12</td>
<td>06:15</td>
<td>06:17</td>
<td>06:20</td>
</tr>
<tr>
<td>06:22</td>
<td>06:25</td>
<td>06:27</td>
<td>06:30</td>
</tr>
<tr>
<td>06:32</td>
<td>06:35</td>
<td>06:37</td>
<td>06:39</td>
</tr>
<tr>
<td>06:42</td>
<td>06:44</td>
<td>06:47</td>
<td>06:49</td>
</tr>
<tr>
<td>06:52</td>
<td>06:54</td>
<td>06:57</td>
<td>06:59</td>
</tr>
<tr>
<td>07:01</td>
<td>07:04</td>
<td>07:06</td>
<td>07:09</td>
</tr>
<tr>
<td>07:11</td>
<td>07:14</td>
<td>07:16</td>
<td>07:18</td>
</tr>
<tr>
<td>07:21</td>
<td>07:23</td>
<td>07:26</td>
<td>07:28</td>
</tr>
<tr>
<td>07:31</td>
<td>07:33</td>
<td>07:36</td>
<td>07:38</td>
</tr>
</table>
</div>
try awk
awk '{print}; /<td>/ && ++i==4 {print "</tr>\n<tr>"; i=0}' file
print the line
if it's a <td> then increase i
if i is 4 print </tr><tr> and reset i
Testing with given input the desired output is returned,
with the only "problem" that an extra <tr></tr> appears at the end of the list.
This is fixable but I'm running out of time here.
When I get back I can look into it if you think it is needed.
... part of the end of the result file
<td>07:26</td>
<td>07:28</td>
</tr>
<tr>
<td>07:31</td>
<td>07:33</td>
<td>07:36</td>
<td>07:38</td>
</tr>
<tr> <-- extra <tr></tr> here
</tr>
</table>
you can try with regular expressions. You can test following expression on:
http://gskinner.com/RegExr/
Catch expression:
?</td>.<td>.*?</td>.<td>.*?</td>.<td>.*?</td>)(?!.</tr>)
Replace expression:
$1\n</tr>\n<tr>
Flags checked:
global, ignorecase, dotall
Result:
<table>
<tr>
<td>05:50</td>
<td>05:58</td>
<td>06:04</td>
<td>06:08</td>
</tr>
<tr>
<td>06:12</td>
<td>06:15</td>
<td>06:17</td>
<td>06:20</td>
</tr>
<tr>
<td>06:22</td>
<td>06:25</td>
<td>06:27</td>
<td>06:30</td>
</tr>
<tr>
<td>06:32</td>
<td>06:35</td>
<td>06:37</td>
<td>06:39</td>
</tr>
<tr>
<td>06:42</td>
<td>06:44</td>
<td>06:47</td>
<td>06:49</td>
</tr>
<tr>
<td>06:52</td>
<td>06:54</td>
<td>06:57</td>
<td>06:59</td>
</tr>
<tr>
<td>07:01</td>
<td>07:04</td>
<td>07:06</td>
<td>07:09</td>
</tr>
<tr>
<td>07:11</td>
<td>07:14</td>
<td>07:16</td>
<td>07:18</td>
</tr>
<tr>
<td>07:21</td>
<td>07:23</td>
<td>07:26</td>
<td>07:28</td>
</tr>
<tr>
<td>07:31</td>
<td>07:33</td>
<td>07:36</td>
<td>07:38</td>
</tr>
</table>
</div>
You can use editor like Notepad++ for batch replace on many files at once (syntax will be little different).
sed '\!<td>!,\!</table!{N;N;N;i\
</tr>\
<tr>
}' input_file
Perl solution, still using regular expression instead of parsing HTML:
perl -pe '
undef $inside if m{</tr>};
if ($inside and ($. % 4) == $tr_line) {
print "</tr>\n<tr>\n";
}
$inside = 1 if defined $tr_line;
$tr_line = ($. + 1) % 4 if /<tr>/;
' file
Using xsh:
open :F html file ; # Open as html.
while //table/tr[count(td)>4] wrap :U position()=8 tr //table/tr/td ; # Wrap four td's into a tr.
xmove :r //table/tr/tr before .. ; # Unwrap the extra tr.
remove //table/tr[last()] ; # Remove the extra tr.

Extracting an HTML table with Perl

I have the following table :
<table cellpadding="4" cellspacing="0" border="0">
<tr>
<td>
<span class="label">Label1< /pan>
</td>
<td>
label1_value1
</td>
</tr>
<tr>
<td>
<span class="label">Label2</span>
</td>
<td>
Label2_value1 <br/>
Label2_value2 <br/>
</td>
</tr>
<tr valign="top">
<td>
<span class="label">Label3</span>
</td>
<td>
Result 1<br/>
Result 2<br/>
<span class="related"> -
Result 1 SP2<br/> </span>
</td>
<\tr>
</table>
I want to use HTML::TableExtract in order to extract this table
I use the following code in order to extract the table :
$te->parse($table_content);
foreach my $row ($te->rows) {
if (defined($row->[1])) {
$row->[1]=~s/^\s+//gm;
$row->[1]=~s/\s+$/;/gm;
print $row->[1],"\n";
}
}
I want the result on this format :
label1_value1,label1_value1;label1_value2,result1;result2-result3
but i get wrong results could someone help what the problem with my code or if its possilbe to parse spans with HTML::TableExtract
I get the following:
label1_value1
Label2_value1
Label2_value2;
result1
result2
-
reuslt1;
First, there are some errors in your HTML such as </pan> and <\tr> and an unclosed span tag. Once those are fixed, the code
#!/usr/bin/env perl
use warnings; use strict;
use HTML::TableExtract;
my $te = HTML::TableExtract->new(
attribs => {
cellpadding => '4', cellspacing => '0', border => '0'
}
);
my ($table) = $te->parse(do { local $/; <DATA>} );
for my $row ( $table->rows ) {
for my $cell (#$row) {
$cell =~ s/^\s+//;
$cell =~ s/\s+\z/;/;
$cell =~ s/\s+/ /g;
}
print join("|", #$row), "\n";
}
will give you:
Label1;|label1_value1;
Label2;|Label2_value1 Label2_value2;
Label3;|Result 1 Result 2 - Result 1 SP2;
Now, I do not know what logic would lead from that to your desired output of:
label1_value1,label1_value1;label1_value2,result1;result2-result3
Could you please provide more information on what you are trying to do?

HTML::TableExtract: applying the right attribs to specify the attributes of interest

I tried to run the following Perl script on the HTML further below. My problem is how to define the correct hash reference, with attribs that specify attributes of interest within my HTML <table> tag itself.
#!/usr/bin/perl
use strict; use warnings;
use HTML::TableExtract;
use YAML;
my $table = HTML::TableExtract->new(keep_html=>0, depth => 1, count => 1, br_translate => 0 );
$table->parse($html);
foreach my $row ($table->rows)
sub cleanup {
for ( #_ ) {
s/\s+//;
s/[\xa0 ]+\z//;
s/\s+/ /g;
}
}
{ print join("\t", #$row), "\n"; }
I want to apply this code on the HTML-document you see further below.
My first approach is to do this with the columns method. But i am not able to figure out how to use the columns method on the below HTML-file: My intuition makes me think it should be something like the following (but my intuition is wrong):
foreach my $column ($table->columns) {
print join("\t", #$column), "\n";
}
The HTML::TableExtract documentation doesn't shed much light (for me anyway).
I can see in the code of the module that the columns method belongs to HTML::TableExtract::Table, but I can't figure out how to use it. I appreciate any help.
Background:
I try to get the table extracted and I have a very very small document of tables that i want to parse with the HTML::TableExtract module I am trying to search for keywords in the HTML - so that i can take them for the attribs I have to print only the necessary data.
I tried going CPAN but could not really find how to search through it for particular keywords. One way to do it would be HTML::TableExtract - the other way would be to parse with HTML::TokeParser I have very little experience with HTML::TokeParser.
Well - one or the other way i need to do this parsing: I want to output the result of the parsed tables into some .text - or even better store it into a database. The problem here is I cant find anyway to search through the resulting parsed table and get necessary data.
The HTML
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<meta name="GENERATOR" content="Microsoft FrontPage 3.0">
<link rel="stylesheet" href="jspsrc/css/bp_style.css" type="text/css">
<title>Weitere Schulinformationen</title>
</head>
<body class="bodyclass">
<div style="text-align:center;"><center>
<!-- <fieldset><legend> general information </legend>
-->
<br/>
<table border="1" cellspacing="0" bordercolordark="white" bordercolorlight="black" width="80%" class='bp_result_tab_info'>
<!-- <table border="0" cellspacing="0" bordercolordark="white" bordercolorlight="black" width="80%" class='bp_search_info'>
-->
<tr>
<td width="100%" colspan="2" class="ldstabTitel"><strong>data_one </strong></td>
</tr>
<tr>
<td width="27%"><strong>data_two</strong></td>
<td width="73%"> 116439
</td>
</tr>
<tr>
<td width="27%"><strong>official_description</strong></td>
<td width="73%">the name </td>
</tr>
<tr>
<td width="27%"><strong>name of the street</strong></td>
<td width="73%">champs elysee</td>
</tr>
<tr>
<td width="27%"><strong>number and town</strong></td>
<td width="73%"> 75000 paris </td>
</tr>
<tr>
<td width="27%"><strong>telefon</strong></td>
<td width="73%"> 000241 49321
</td>
</tr>
<tr>
<td width="27%"><strong>fax</strong></td>
<td width="73%"> 000241 4093287
</td>
</tr>
<tr>
<td width="27%"><strong>e-mail-adresse</strong></td>
<td width="73%"> <a href=mailto:1111116439#my_domain.org>1222216439#site.org</a>
</td>
</tr>
<tr>
<td width="27%"><strong>internet-site</strong></td>
<td width="73%"> <a href=http://www.thesite.org>http://www.thesite.org</td>
</tr>
<!--
<tr>
<td width="27%"> </td>
<td width="73%" align="right"><a href="schule_aeinfo.php?SNR=<? print $SCHULNR ?>" target="_blank">
[Schuldaten ändern] </a>
</tr>
</td> -->
<tr>
<td width="27%"> </td>
<td width="73%">the department</td>
</tr>
<tr>
<td width="100%" colspan=2><strong> </strong></td>
</tr>
<tr>
<td width="27%"><strong>number of indidviduals</strong></td>
<td width="73%"> 192</td>
<tr>
<td width="100%" colspan=2><strong> </strong></td>
</tr>
<!-- if (!fsp.isEmpty()){
ztext = " ";
int i = 0;
Iterator it = fsp.iterator();
while (it.hasNext()){
String[] zwert = new String[2];
zwert = (String[])it.next();
if (i==0){
if (zwert[1].equals("0")){
ztext = ztext+zwert[0];
}else{
ztext = ztext+zwert[0]+" mit "+zwert[1];
if (zwert[1].equals("1")){
ztext = ztext+" Schüler";
}else{
ztext = ztext+" Schülern";
}
}
i++;
}else{
if (zwert[1].equals("0")){
ztext = ztext+"<br> "+zwert[0];
}else{
ztext = ztext+"<br> "+zwert[0]+" mit "+zwert[1];
if (zwert[1].equals("1")){
ztext = ztext+" Schüler";
}else{
ztext = ztext+" Schülern";
}
}
}
}
-->
</table>
<!-- </fieldset> -->
<br>
</body>
</html>
Thanks for any and all help.
You need to provide something that uniquely identifies the table in question. This can be the content of its headers or the HTML attributes. In this case, there is only one table in the document, so you don't even need to do that. But, if I were to provide anything to the constructor, I would provide the class of the table.
Also, I do not think you want the columns of the table. The first column of this table consists of labels and the second column consists of values. To get the labels and values at the same time, you should process the table row-by-row.
#!/usr/bin/perl
use strict; use warnings;
use HTML::TableExtract;
use YAML;
my $te = HTML::TableExtract->new(
attribs => { class => 'bp_result_tab_info' },
);
$te->parse_file('t.html');
for my $table ( $te->tables ) {
print Dump $table->columns;
}
Output:
---
- 'data_one '
- data_two
- official_description
- name of the street
- number and town
- telefon
- fax
- e-mail-adresse
- internet-site
- á
- á
- number of indidviduals
- á
---
- ~
- "á116439\r\n "
- 'the name '
- champs elysee
- ' 75000 paris '
- "á000241 49321\r\n"
- "á000241 4093287\r\n"
- "á1222216439#site.org\r\n"
- áhttp://www.thesite.org
- the department
- ~
- á192
- ~
Finally, a word of advice: It is clear that you do not have much of an understanding of Perl (or HTML for that matter). It would be better for you to try to learn some of the basics first. This way, all you are doing is incorrectly copying and pasting code from one answer into another and not learning anything.