parsing with Mojo::DOM - perl

I am using Mojo::UserAgent->new to fetch some XML which has the following format:
<row>
<td> content1 </td>
<td> content2 </td>
<td> content3 </td>
</row>
<row>
<td> content4 </td>
<td> content5 </td>
<td> content6 </td>
</row>
Is it possible to view the results like this:
content1,content2,content3
content4,content5,content6
below are the query i am using which get different resutls
$ua->get($url)->res->dom->at->(row)->children->each(sub {print "$_\t"})

Sure, that's absolutely possible and not hard with Mojo::Collection working behind the scenes.
Code
# replace this line by your existing $ua->get($url)->res->dom code
my $dom = Mojo::DOM->new(do { local $/ = undef; <DATA> });
# pretty-print rows
$dom->find('row')->each(sub {
my $row = shift;
say $row->children->pluck('text')->join(', ');
});
Data
__DATA__
<row>
<td> content1 </td>
<td> content2 </td>
<td> content3 </td>
</row>
<row>
<td> content4 </td>
<td> content5 </td>
<td> content6 </td>
</row>
Output
content1, content2, content3
content4, content5, content6
Some comments
each evaluates a code ref for each element of a collection (which is what find returns).
pluck returns a Mojo::Collection object with the return values of the given method name (text in this case). This is just a fancy way to map simple stuff.
text automagically trims the element content.
join joins all elements of the Mojo::Collection object together, all td elements of a row in this case.
Your code doesn't even compile, but using at won't work anyway because it returns just the first matching DOM element, not all. You want to iterate all rows.
HTH!

Related

html2text command line breaking html

I'm trying to figure out why html2text is breaking my HTML:
<div><table> <tbody> <tr> <td> <span><strong><span>About</span></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><span>Contact</span></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a><span>Maths Games Order</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><span>FAQ</span></strong></span></td> </tr> </tbody> </table>s<div> <span><strong>Broadbent Maths Ltd<br> 3 High Street, Welbourn, Lincoln, LN5 0NH </strong></span></div> </div>
Processing it with:
cat "/home/spider/original-file.txt" | html2text -utf8 -nobs -style pretty
When I run that, I get:
nput recoding failed due to invalid input sequence. Unconverted part
of text follows. ▒Contact ▒Maths Games Order ▒FAQ
s Broadbent Maths Ltd 3 High Street, Welbourn, Lincoln, LN5 0NH
When I run Devel::Peek::Dump() (Perl), I see the string as:
SV = PV(0x564c0a72c860) at 0x564c09967c80
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK,UTF8)
PV = 0x564c0a58bc60 "\n<div><table> <tbody> <tr> <td> <span><strong><span>About</span></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><span>Contact</span></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a><span>Maths Games Order</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><span>FAQ</span></strong></span></td> </tr> </tbody> </table>s<div> <span><strong>Broadbent Maths Ltd<br> 3 High Street, Welbourn, Lincoln, LN5 0NH </strong></span></div> </div>\n"\0 [UTF8 "\n<div><table> <tbody> <tr> <td> <span><strong><span>About</span></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><span>Contact</span></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><a><span>Maths Games Order</span></a></strong></span></td> <td> <span><strong><span>•</span></strong></span></td> <td> <span><strong><span>FAQ</span></strong></span></td> </tr> </tbody> </table>s<div> <span><strong>Broadbent Maths Ltd<br> 3 High Street, Welbourn, Lincoln, LN5 0NH </strong></span></div> </div>\n"]
CUR = 725
LEN = 736
COW_REFCNT = 1
If I remove the first bit:
<div><table>
It works fine! I don't get why its breaking there though - all seems ok to me?
Ok I think I've worked it out. In this case, for some reason `• was breaking it. I replaced that with "-", and it works now
html2text -utf8 -nobs -o test-out.txt test.co.uk.txt
It's a bit weird that html2text breaks with HTML entities though?
UPDATE: The problem turned out to be that while they were serving the page as utf-8 with the meta, it was being passed along as iso-8859-1 from the server. So what I did was parse out the server header and compare it before saving - then if it was windows-1252, then I would use this command instead of parse it out:
html2text -ansi -nobs -o test-out.txt test.co.uk.txt

How to find text for multiple elements using find_elements

I'm fairly new to using Selenium Remote Driver and Perl. What I'd like to do is to have Selenium find all elements on a page using a partial match of text. Then store the full text of those elements into an array.
I've tried using:
#elements = $driver->find_elements("//tbody/tr[td[2]/div/span[2][contains(text(),'matching text')]]")->get_text;
However, this doesn't seem to work.
I've also tried:
#elements = $driver->find_elements("//tbody/tr[td[2]/div/span[2][contains(text(),'matching text')]]");
This does populate the array with webelements.
my #elements;
my #elementtext;
my $elementtext;
#elements = $driver->find_elements("//tbody/tr[td[2]/div/span[2][contains(text(),'matching text')]]");
foreach my $currentelement (#elements) {
$elementtext = $driver->find_element($currentelement)->get_text();
push #elementtext, $elementtext;
}
This causes perl to generate an error because webdriver can't find the element. Any ideas on what I'm doing wrong and how to fix it? I suspect that the problem is with the contents of the #elements array not actually being xpath elements.
Here is an example of the html:
<td>
<div class='cellContent'>Atlanta</div>
</td>
<td>
<div class='cellContent'>City</div>
</td>
<td>
<div class='cellContent'>Georgia</div>
</td>
<td class='sort_column'>
<div class='cellContent'>USA</div>
</td>
</tr>
<tr>
<td>
<div class='cellContent'>Joe Passenger</div>
</td>
<td>
<div class='cellContent'> <span>NFL</span>
<span>matching text: Atlanta.Falcons.team</span>
</div>
</td>
I want to get 'matching text: Atlanta.Falcons.team' stored into the array.
you can directly point to span tag using this xpath.
//tbody/tr/td/div/span[contains(text(),'matching text')]

want to write an or condition in if loop of perl template tool kit

I want to write and if condition with or conditions :
[% IF bug.product == 'CustomerCare' or bug.product =='Alerts' or bug.product =='Chatlog' %]
<tr><td colspan="2"> <h3 align="center">Have you verified the Checklist ?</h3></td></tr>
<tr>
<td>
<input type="checkbox" id="chck1" name="greet" value="1" [% FOREACH gre = chk_greet%] checked [% END%] />
</td>
<td>
<label for = "chck1"> Greet the customer ?</label>
</td>
</tr>
<tr>
<td> <input type="checkbox" id="chck2" name="issue_status" value="1" [% FOREACH iss = chk_issustat%] checked [% END%] />
</td>
<td> <label for = "chck2">Issue under concern and its status (whether resolved or not)</label> </td>
</tr>
<tr>
<td>
<input type="checkbox" id="chck3" name="done_fix" value="1" [% FOREACH don = chk_done%] checked [% END%] [% END %]/>
</td>
</tr>
What is the correct format to write this condition?
Read the fine manual. It includes examples for your very case.
[% IF (bug.product == 'CustomerCare') || (bug.product =='Alerts') ... %]
If your lists of values starts to get a bit big, using a hashref is another way to simplify this logic - particularly if you are going to end up writing it over and over and over. It also makes the logic clearer and less verbose.
[%- # Do this once, near the top.
SET checklistable = { CustomerCare => 1, Alerts => 1, Chatlog => 1 }; -%]
[%- # then later on, as required;
IF checklistable.item(bug.product);
....
END; -%]

perl HTML::TableExtract get stripped text

My tables' rows in HTML are as follows,
<TR bgcolor="#FFFFFF" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#FFFFFF';">
<TD class="dlfont">07/01/2011 10:33 AM EDT</B> </TD>
<TD class="dlfont">DRB</B> </TD><TD class="dlfont">Blah</B> </TD>
<TD class="dlfont">PPD</B> </TD><TD class="dlfont"> </B> </TD>
<TD class="dlfont">07/01/2011</B> </TD>
<TD width=50 align=center><IMG border='0' src='/images/view.gif' height=10 width=19></TD>
</TR>
<TR bgcolor="#EEEEEE" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#EEEEEE';">
<TD class="dlfont">07/01/2011 10:33 AM EDT</B> </TD>
<TD class="dlfont">WHPSF</B> </TD>
<TD class="dlfont">Blah</B> </TD>
<TD class="dlfont"> </B> </TD>
<TD class="dlfont"> </B> </TD>
<TD class="dlfont">07/01/2011</B> </TD>
<TD width=50 align=center><IMG border='0' src='/images/view.gif' height=10 width=19></TD>
</TR>
When I extract the rows using HTML::TableExtract, the extra characters </B> also appear at the end and form some kind of special character. How can I get rid of this?
I would keep in mind two things when using HTML::TableExtract with the badly formatted HTML in your question
use keep_html=>1 in the HTML::TableExtract constructor
use a regex to remove the </B> , carefully
Here's some Perl code I wrote to prune the </B> out of the table cells, but note, this could change validly formatted HTML to badly formatted HTML if you blindly apply it in all cases.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
my($f) = #ARGV;
open F,$f;
my $html = join '',<F>;
close F;
### your html didn't include headers, so I added a first table row with td text, time a b c d e f, to help HTML::TableExtract find the table in file, $f
my $te = HTML::TableExtract->new(
keep_html=>1,
headers=>[qw/ time a b c d e f/]);
$te->parse($html);
for my $ts($te->tables)
{
print "Table(",join(',',$ts->coords),":\n";
for my $row ($ts->rows)
{
for my $cell (#$row)
{
next unless $cell;
## maybe add $ at end of regex or other test here to make sure valid cases of <B>...</B> are not affected
$cell =~ s/<\/B> //i;
print $cell."\n";
}
}
}

HTML::TableExtract: applying the right attribs to specify the attributes of interest

I tried to run the following Perl script on the HTML further below. My problem is how to define the correct hash reference, with attribs that specify attributes of interest within my HTML <table> tag itself.
#!/usr/bin/perl
use strict; use warnings;
use HTML::TableExtract;
use YAML;
my $table = HTML::TableExtract->new(keep_html=>0, depth => 1, count => 1, br_translate => 0 );
$table->parse($html);
foreach my $row ($table->rows)
sub cleanup {
for ( #_ ) {
s/\s+//;
s/[\xa0 ]+\z//;
s/\s+/ /g;
}
}
{ print join("\t", #$row), "\n"; }
I want to apply this code on the HTML-document you see further below.
My first approach is to do this with the columns method. But i am not able to figure out how to use the columns method on the below HTML-file: My intuition makes me think it should be something like the following (but my intuition is wrong):
foreach my $column ($table->columns) {
print join("\t", #$column), "\n";
}
The HTML::TableExtract documentation doesn't shed much light (for me anyway).
I can see in the code of the module that the columns method belongs to HTML::TableExtract::Table, but I can't figure out how to use it. I appreciate any help.
Background:
I try to get the table extracted and I have a very very small document of tables that i want to parse with the HTML::TableExtract module I am trying to search for keywords in the HTML - so that i can take them for the attribs I have to print only the necessary data.
I tried going CPAN but could not really find how to search through it for particular keywords. One way to do it would be HTML::TableExtract - the other way would be to parse with HTML::TokeParser I have very little experience with HTML::TokeParser.
Well - one or the other way i need to do this parsing: I want to output the result of the parsed tables into some .text - or even better store it into a database. The problem here is I cant find anyway to search through the resulting parsed table and get necessary data.
The HTML
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">
<meta name="GENERATOR" content="Microsoft FrontPage 3.0">
<link rel="stylesheet" href="jspsrc/css/bp_style.css" type="text/css">
<title>Weitere Schulinformationen</title>
</head>
<body class="bodyclass">
<div style="text-align:center;"><center>
<!-- <fieldset><legend> general information </legend>
-->
<br/>
<table border="1" cellspacing="0" bordercolordark="white" bordercolorlight="black" width="80%" class='bp_result_tab_info'>
<!-- <table border="0" cellspacing="0" bordercolordark="white" bordercolorlight="black" width="80%" class='bp_search_info'>
-->
<tr>
<td width="100%" colspan="2" class="ldstabTitel"><strong>data_one </strong></td>
</tr>
<tr>
<td width="27%"><strong>data_two</strong></td>
<td width="73%"> 116439
</td>
</tr>
<tr>
<td width="27%"><strong>official_description</strong></td>
<td width="73%">the name </td>
</tr>
<tr>
<td width="27%"><strong>name of the street</strong></td>
<td width="73%">champs elysee</td>
</tr>
<tr>
<td width="27%"><strong>number and town</strong></td>
<td width="73%"> 75000 paris </td>
</tr>
<tr>
<td width="27%"><strong>telefon</strong></td>
<td width="73%"> 000241 49321
</td>
</tr>
<tr>
<td width="27%"><strong>fax</strong></td>
<td width="73%"> 000241 4093287
</td>
</tr>
<tr>
<td width="27%"><strong>e-mail-adresse</strong></td>
<td width="73%"> <a href=mailto:1111116439#my_domain.org>1222216439#site.org</a>
</td>
</tr>
<tr>
<td width="27%"><strong>internet-site</strong></td>
<td width="73%"> <a href=http://www.thesite.org>http://www.thesite.org</td>
</tr>
<!--
<tr>
<td width="27%"> </td>
<td width="73%" align="right"><a href="schule_aeinfo.php?SNR=<? print $SCHULNR ?>" target="_blank">
[Schuldaten ändern] </a>
</tr>
</td> -->
<tr>
<td width="27%"> </td>
<td width="73%">the department</td>
</tr>
<tr>
<td width="100%" colspan=2><strong> </strong></td>
</tr>
<tr>
<td width="27%"><strong>number of indidviduals</strong></td>
<td width="73%"> 192</td>
<tr>
<td width="100%" colspan=2><strong> </strong></td>
</tr>
<!-- if (!fsp.isEmpty()){
ztext = " ";
int i = 0;
Iterator it = fsp.iterator();
while (it.hasNext()){
String[] zwert = new String[2];
zwert = (String[])it.next();
if (i==0){
if (zwert[1].equals("0")){
ztext = ztext+zwert[0];
}else{
ztext = ztext+zwert[0]+" mit "+zwert[1];
if (zwert[1].equals("1")){
ztext = ztext+" Schüler";
}else{
ztext = ztext+" Schülern";
}
}
i++;
}else{
if (zwert[1].equals("0")){
ztext = ztext+"<br> "+zwert[0];
}else{
ztext = ztext+"<br> "+zwert[0]+" mit "+zwert[1];
if (zwert[1].equals("1")){
ztext = ztext+" Schüler";
}else{
ztext = ztext+" Schülern";
}
}
}
}
-->
</table>
<!-- </fieldset> -->
<br>
</body>
</html>
Thanks for any and all help.
You need to provide something that uniquely identifies the table in question. This can be the content of its headers or the HTML attributes. In this case, there is only one table in the document, so you don't even need to do that. But, if I were to provide anything to the constructor, I would provide the class of the table.
Also, I do not think you want the columns of the table. The first column of this table consists of labels and the second column consists of values. To get the labels and values at the same time, you should process the table row-by-row.
#!/usr/bin/perl
use strict; use warnings;
use HTML::TableExtract;
use YAML;
my $te = HTML::TableExtract->new(
attribs => { class => 'bp_result_tab_info' },
);
$te->parse_file('t.html');
for my $table ( $te->tables ) {
print Dump $table->columns;
}
Output:
---
- 'data_one '
- data_two
- official_description
- name of the street
- number and town
- telefon
- fax
- e-mail-adresse
- internet-site
- á
- á
- number of indidviduals
- á
---
- ~
- "á116439\r\n "
- 'the name '
- champs elysee
- ' 75000 paris '
- "á000241 49321\r\n"
- "á000241 4093287\r\n"
- "á1222216439#site.org\r\n"
- áhttp://www.thesite.org
- the department
- ~
- á192
- ~
Finally, a word of advice: It is clear that you do not have much of an understanding of Perl (or HTML for that matter). It would be better for you to try to learn some of the basics first. This way, all you are doing is incorrectly copying and pasting code from one answer into another and not learning anything.