How can I remove an attribute from all DOM elements with Mojolicious? - perl

I want to remove the bgcolor attribute from all elements of a page I am scraping via Mojolicious.
My attempt has been the following:
$dom->all_contents->each(sub { $_->attr('bgcolor' => undef) });
but this seems not to work.
How do I do it right?

The following uses Mojo::DOM to delete the bgcolor attribute for every node:
use strict;
use warnings;
use Mojo::DOM;
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
for my $node ($dom->find('*')->each) {
delete $node->{bgcolor};
}
print $dom;
__DATA__
<html>
<head>
<title>Hello background color</title>
</head>
<body bgcolor="white">
<h1>Hello world</h1>
<table>
<tr><td bgcolor="blue">blue</td></tr>
<tr><td bgcolor="green">green</td></tr>
</table>
</body>
</html>
Outputs:
<html>
<head>
<title>Hello background color</title>
</head>
<body>
<h1>Hello world</h1>
<table>
<tr><td>blue</td></tr>
<tr><td>green</td></tr>
</table>
</body>
</html>
Notes:
It's possible to use CSS Selectors to limit the returned nodes to only those containing the specific attribute:
for my $node ($dom->find('[bgcolor]')->each) {
One can also let Mojo handle the iteration like the following:
$dom->find('*')->each(sub {
delete $_->{bgcolor};
});

As I understand it, the DOM attribute you're looking for isn't bgcolor but background-color, the css variety. bgcolor fell out of popularity a while ago, in favor of defining classes and using CSS to set the styling on an object (including its background color). Try background-color instead.

Related

Perl - geturls with WWW::Mechanize

I am trying to submit a form on http://bioinfo.noble.org/TrSSP/ and want to extract the result.
My query data looks like this
>ATCG00270
MTIALGKFTKDEKDLFDIMDDWLRRDRFVFVGWSGLLLFPCAYFALGGWFTGTTFVTSWYTHGLASSYLEGCNFLTAA VSTPANSLAHSLLLLWGPEAQGDFTRWCQLGGLWAFVALHGAFALIGFMLRQFELARSVQLRPYNAIAFSGPIAVFVSVFLIYPLGQSGWFFAPSFGVAAIFRFILFFQGFHNWTLNPFHMMGVAGVLGAALLCAIHGATVENTLFEDGDGANTFRAFNPTQAEETYSMVTANRFWSQIFGVAFSNKRWLHFFMLFVPVTGLWMSALGVVGLALNLRAYDFVSQEIRAAEDPEFETFYTKNILLNEGIRAWMAAQDQPHENLIFPEEVLPRGNAL
My script looks like this
use strict;
use warnings;
use File::Slurp;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
my $sequence = $ARGV[0];
$mech->get( 'http://bioinfo.noble.org/TrSSP' );
$mech->submit_form( fields => { 'query_file' => $sequence, }, );
print $mech->content;
#sleep (10);
open( OUT, ">out.txt" );
my #a = $mech->find_all_links();
print OUT "\n", $a[$_]->url for ( 0 .. $#a );
print $mech->content gives a result like this
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>The job is running, please wait...</title>
<meta http-equiv="refresh" content="4;url=/TrSSP/?sessionid=1492435151653763">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<link rel="stylesheet" href="interface/style.css" type="text/css">
</head>
<body>
<table width="90%" align="center" border="0" cellpadding="0" cellspacing="0" class="table1">
<tr align="center">
<td width="50"> </td>
<td></td>
<td> </td>
</tr>
<tr align="left" height="30" valign="middle">
<td width="30"> </td>
<td bgColor="#CCCCFF"> Your sequences have been submitted to backend pipeline, please wait for result:</td>
<td width="30"> </td>
</tr>
<tr align="left">
<td> </td>
<td>
<br><br><font color="#0000FF"><strong>
</strong></font>
<BR><BR><BR><BR><BR><BR><br><br><BR><br><br><hr>
If you don't want to wait online, please copy and keep the following link to retrieve your result later:<br>
<strong>http://bioinfo.noble.org/TrSSP/?sessionid=1492435151653763</strong>
<script language="JavaScript" type="text/JavaScript">
function doit()
{
window.location.href="/TrSSP/?sessionid=1492435151653763";
}
setTimeout("doit()",9000);
</script>
</td>
<td> </td>
</tr>
</table>
</body>
</html>
I want to extract this link
http://bioinfo.noble.org/TrSSP/?sessionid=1492435151653763
and download the result when the job is completed. But find_all_links() is recognizing /TrSSP/?sessionid=1492434554474809 as a link.
We don't know how long this is backend process there is going to take. If it's minutes, you could have your program wait. Even if it's hours, waiting is reasonable.
In a browser, the page is going to refresh on its own. There are two auto-refresh mechanisms implemented in the response you are showing.
<script language="JavaScript" type="text/JavaScript">
function doit()
{
window.location.href="/TrSSP/?sessionid=1492435151653763";
}
setTimeout("doit()",9000);
</script>
The javascript setTimeout takes an argument in milliseconds, so this will be done after 9 seconds.
There is also a meta tag that tells the browser to auto-refresh:
<meta http-equiv="refresh" content="4;url=/TrSSP/?sessionid=1492435151653763">
Here, the 4 in the content means 4 seconds. So this would be done earlier.
Of course we also don't know how long they keep the session around. It might be a safe approach to reload that page every ten seconds (or more often, if you want).
You can do that by building a simple while loop and checking if the refresh is still in the response.
# do the initial submit here
...
# assign this by grabbing it from the page
$mech->content =~ m{<strong>(\Qhttp://bioinfo.noble.org/TrSSP/?sessionid=\E\d+)</strong>};
my $url = $1; # in this case, regex on HTML is fine
print "Waiting for $url\n";
while (1) {
$mech->get($url);
last unless $mech->content =~ m/refresh/;
sleep 10; # or whatever number of seconds
}
# process the final response ...
We first submit the data. We then extract the URL that you're supposed to call until they are done processing. Since this is a pretty straight-forward document, we can safely use a pattern match. The URL is always the same, and it's clearly marked with the <strong> tag. In general it's not a good idea to use regex to parse HTML, but we're not really parsing, we are just screenscraping a single value. The \Q and \E are the same as quotemeta and make sure that we don't have to escape the . and ?, which is then easier to read than having a buch of backslashes \ in the pattern.
The script will sleep for ten seconds after every attempt before trying again. Once it matches, it breaks out of the endless loop, so you can put the processing of the actual response that has the data you wanted behind that loop.
It might make sense to add some output into the loop so you can see that it's still running.
Note that this needs to really keep running until it's done. Don't stop the process.

Replacing XML nodes using perl and Mojo::DOM

I would like to exchange node in an XML file using Mojo::DOM.
I'm pretty sure it is possible but I didn't find a way yet.
Given the following XML:
my $xml = q~
<html>
<div>
<p>1</p>
<p>2</p>
<img />
</div>
</html>
~;
I would like to remove the div and instead insert a body tag, so that the result looks like this:
my $xml = q~
<html>
<body>
<p>1</p>
<p>2</p>
<img />
</body>
</html>
~;
I thought about replace, but I didn't find an example where the replacement is the $dom of the replaced tag.
It's very simple to just find the <div> element and use the tag method to change its tag
This program demonstrates. The CSS selector html > div finds the (first) <div> element that is a child of an <html> element
use strict;
use warnings;
use Mojo::DOM;
my $xml = q~
<html>
<div>
<p>1</p>
<p>2</p>
<img />
</div>
</html>
~;
my $dom = Mojo::DOM->new($xml);
$dom->at('html > div')->tag('body');
print $dom, "\n";
output
<html>
<body>
<p>1</p>
<p>2</p>
<img>
</body>
</html>

why this code hide the whole Div when I use childNodes[x] method?

I want to only hide the P0 paragraph using childNodes[x] . I wonder how it works because it hides the whole div with in this code:
<html>
<body>
<div id="myDiv">
<p>P0</p>
<p>P1</p>
</div>
<button onclick="hideFn();">hide</button>
<script>
function hideFn()
{
document.childNodes[0].childNodes[1].childNodes[1].style.display = "none";
}
</script>
</body>
</html>
</html>
You easily could have found the reason yourself by simply doing the traversal step by step:
document
the document
.childNodes[0]
the documentElement, also known as the root node (<html>)
.childNodes[1]
the <body>
.childNodes[1]
the <div>

Search and replace the content between a specific tag

#!/usr/bin/perl
use strict;
use warnings;
my $html = q|
<html>
<head>
<style>
.classname{
color: red;
}
</style>
</head>
<body>
classname will have a color property.
</body>
</html>
|;
$html=~s/classname/NEW/g;
print $html;
This replaces classname in both places. How can I limit the replacement only to content of <body>? I'd like to see it done using HTML::Parser or HTML::TreeBuilder.
I believe this does what you want, replaces classname with your regexp on all children of body element, using HTML::TreeBuilder.
I added another dummy div to input to make sure it was being processed correctly.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TreeBuilder;
my $html = q|
<html>
<head>
<style>
.classname{
color: red;
}
</style>
</head>
<body>
classname will have a color property.
<div>more text with classname in it</div>
</body>
</html>
|;
my $tree = HTML::TreeBuilder->new_from_content($html);
replace_text( $tree->find_by_tag_name("body") );
print $tree->as_HTML."\n";
sub replace_text {
my $html_element = shift;
for my $el ( $html_element->content_refs_list ){
if ( ref( $$el ) ){
replace_text( $$el );
next;
}
$$el =~ s /classname/NEW/g;
}
return $html_element;
}

In Tritium, how do I set a CSS class as a variable?

I'm using the Moovweb SDK, and writing Tritium to modify my HTML.
How do I save a CSS class as a variable?
I want to grab an existing class and apply it to other elements.
You can use the fetch tritium function to get the value of the class attribute in the element you're looking for and store it in a variable.
So given the following html:
<html>
<head>
<title> Tritium Tester </title>
</head>
<body>
<div id="one" class="random"></div>
<div id="two"></div>
</body>
</html>
You could write the following Tritium:
html() {
$("/html/body") {
$class_name = fetch("./div[#id='one']/#class")
$("./div[#id='two']") {
add_class($class_name)
}
}
}
Here's a live example link: http://play.tritium.io/331dfa6d01a7dd52261a9eaf812bdc5c7fb8c293