HTML::TreeBuilder::XPath: identifing the xpath-expression - perl

Hello good evening dear -Coder on Stackoverflow!
finally back again!
i am currently workin on a parser script: I have to parse all the detail-pages of this site here: [link text][1] Note: a very big and powerful suiss site - a governmental server with lots of power!
There are several ways to do it. i have to get rid of a lot of crap by only using the text data out of the page... See the page - wich is very very simple - take this examplepage - eg.
Altes Schulhaus Ossingen
Guntibachstrasse 10
8475 Ossingen
sekretariat.psossingen#bluewin.ch
Tel:052 317 15 45
Fax:052 317 04 42
Well we see - i need a little PERL-script to get this [B]six-lines[/B] of text out of the HTML-page. Well - how we do that: Personally I like HTML::TreeBuilder::XPath that we would have to install from CPAN.
Here is how we would then extract the name from one of the files with it:
[B]Note:[/B] i am not sure about the Arguments that i have to take! See below my trials:
use strict;
use HTML::TreeBuilder::XPath;
my $tree = HTML::TreeBuilder::XPath->new;
#use real file name here
open(my $fh, "<", "file.html") or die $!;
$tree->parse_file($fh);
my ($name) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
print $name->as_text;
Note - you see that i have some problems with the arguments.
As we can see we simply use an xpath-expression to indentify the node we want.
[B]So how to determine that?[/B] Hmm - i tried to use a Firefox-plugin called XPather, that allows us to simply click on a html-element and extract the corresponding xpath. So we load the file we want to parse in Firefox, click on the stuff we want, get the xpath and use that in the perl-script.
Well i am not very sure, that i did the job with XPather very well. I tired to find the arguments for the follwing page: See the page - wich is very very simple: see the details of a result page - derived from this site - very big and powerful suiss site - a governmental server with lots of power [see above the link]
See below [B]my trials[/B]: the arguments that i found with XPather ... are they really arguments -that help me to parse the above mentioned detail-result-page: [see above the link]
/html/body/div[3]/text()
/html/body/div[4]/text()
/html/body/div[6]/text()
/html/body/div[7]/text()
/html/body/div[9]/a/text()
/html/body/div[10]/text()
/html/body/div[11]/text()[1]
/html/body/div[11]/text()[2]
/html/body/div[12]/text()[1]
/html/body/div[12]/text()[2]
/html/body/div[13]/text()
[see above the link]
see the html code
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><meta name="generator" content="DigiOnline GmbH - WebWeaver 3.4 CMS - http://www.webweaver.de"><title>educa.ch</title><meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"><link rel="stylesheet" href="101.htm"><script src="102.htm"></script><script language="JavaScript"><!--
var did='d79376';
var root=new Array('d200','d205','d73137','d1566','d79376','d');
var usefocus = 1;
function check() {
if ((self.focus) && (usefocus)) {
self.focus();
}
}
// --></script></head><body bgcolor="#FFFFFF" leftmargin="0" topmargin="0" marginwidth="0" marginheight="0" onload="check();"><table cellspacing="0" cellpadding="0" border="0" width="100%"><tr><td width="15" class="popuphead"><img src="/0.gif" alt="" width="15" height="16"></td><td width="99%" class="popuphead">Adresse - Schulen in der Schweiz</td><td width="20" class="popuphead" valign="middle"><img src="../pics/print16x13.gif" alt="Drucken" width="16" height="13"></td><td width="20" class="popuphead" valign="middle"><img src="../pics/close21x13.gif" alt="Schliessen" width="21" height="13"></td></tr>
<tr bgcolor="#B2B2B2"><td colspan="4"><img src="/0.gif" alt="" width="1" height="1"></td></tr></table><div class="leerzeile"> </div><div class="leerzeile"><img src="/0.gif" alt="" width="15"height="8">Altes Schulhaus Ossingen </div><div class="leerzeile"> </div><div><img src="/0.gif" alt="" width="15" height="8">Guntibachstrasse 10</div><div><img src="/0.gif" alt="" width="15" height="8"></div><div><img src="/0.gif" alt="" width="15" height="8">8475 Ossingen</div><div class="leerzeile"> </div><div><img src="/0.gif" alt="" width="15" height="8"></div><div><img src="/0.gif" alt="" width="15" height="8">sekretariat.psossingen#bluewin.ch</div><div class="leerzeile"> </div><div><img src="/0.gif" alt="" width="15" height="8">Tel:<img src="/0.gif" alt="" width="6" height="8">052 317 15 45 </div><div><img src="/0.gif" alt="" width="15" height="8">Fax:<img src="/0.gif" alt="" width="4" height="8">052 317 04 42 </div><div> </div></body></html>
So i hope someone would like to review my little Perl-script and helps me with finding the right arguments for the perl-script!
love to hear from you
[B]BTW [/B]- the other tasks are as well important.
how should i fetch the pages: with LWP or Mechanize or something like that!?
How to store the data in a MySQL-database...!?
greetings :cool:

Related

Perl - geturls with WWW::Mechanize

I am trying to submit a form on http://bioinfo.noble.org/TrSSP/ and want to extract the result.
My query data looks like this
>ATCG00270
MTIALGKFTKDEKDLFDIMDDWLRRDRFVFVGWSGLLLFPCAYFALGGWFTGTTFVTSWYTHGLASSYLEGCNFLTAA VSTPANSLAHSLLLLWGPEAQGDFTRWCQLGGLWAFVALHGAFALIGFMLRQFELARSVQLRPYNAIAFSGPIAVFVSVFLIYPLGQSGWFFAPSFGVAAIFRFILFFQGFHNWTLNPFHMMGVAGVLGAALLCAIHGATVENTLFEDGDGANTFRAFNPTQAEETYSMVTANRFWSQIFGVAFSNKRWLHFFMLFVPVTGLWMSALGVVGLALNLRAYDFVSQEIRAAEDPEFETFYTKNILLNEGIRAWMAAQDQPHENLIFPEEVLPRGNAL
My script looks like this
use strict;
use warnings;
use File::Slurp;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
my $sequence = $ARGV[0];
$mech->get( 'http://bioinfo.noble.org/TrSSP' );
$mech->submit_form( fields => { 'query_file' => $sequence, }, );
print $mech->content;
#sleep (10);
open( OUT, ">out.txt" );
my #a = $mech->find_all_links();
print OUT "\n", $a[$_]->url for ( 0 .. $#a );
print $mech->content gives a result like this
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>The job is running, please wait...</title>
<meta http-equiv="refresh" content="4;url=/TrSSP/?sessionid=1492435151653763">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<link rel="stylesheet" href="interface/style.css" type="text/css">
</head>
<body>
<table width="90%" align="center" border="0" cellpadding="0" cellspacing="0" class="table1">
<tr align="center">
<td width="50"> </td>
<td></td>
<td> </td>
</tr>
<tr align="left" height="30" valign="middle">
<td width="30"> </td>
<td bgColor="#CCCCFF"> Your sequences have been submitted to backend pipeline, please wait for result:</td>
<td width="30"> </td>
</tr>
<tr align="left">
<td> </td>
<td>
<br><br><font color="#0000FF"><strong>
</strong></font>
<BR><BR><BR><BR><BR><BR><br><br><BR><br><br><hr>
If you don't want to wait online, please copy and keep the following link to retrieve your result later:<br>
<strong>http://bioinfo.noble.org/TrSSP/?sessionid=1492435151653763</strong>
<script language="JavaScript" type="text/JavaScript">
function doit()
{
window.location.href="/TrSSP/?sessionid=1492435151653763";
}
setTimeout("doit()",9000);
</script>
</td>
<td> </td>
</tr>
</table>
</body>
</html>
I want to extract this link
http://bioinfo.noble.org/TrSSP/?sessionid=1492435151653763
and download the result when the job is completed. But find_all_links() is recognizing /TrSSP/?sessionid=1492434554474809 as a link.
We don't know how long this is backend process there is going to take. If it's minutes, you could have your program wait. Even if it's hours, waiting is reasonable.
In a browser, the page is going to refresh on its own. There are two auto-refresh mechanisms implemented in the response you are showing.
<script language="JavaScript" type="text/JavaScript">
function doit()
{
window.location.href="/TrSSP/?sessionid=1492435151653763";
}
setTimeout("doit()",9000);
</script>
The javascript setTimeout takes an argument in milliseconds, so this will be done after 9 seconds.
There is also a meta tag that tells the browser to auto-refresh:
<meta http-equiv="refresh" content="4;url=/TrSSP/?sessionid=1492435151653763">
Here, the 4 in the content means 4 seconds. So this would be done earlier.
Of course we also don't know how long they keep the session around. It might be a safe approach to reload that page every ten seconds (or more often, if you want).
You can do that by building a simple while loop and checking if the refresh is still in the response.
# do the initial submit here
...
# assign this by grabbing it from the page
$mech->content =~ m{<strong>(\Qhttp://bioinfo.noble.org/TrSSP/?sessionid=\E\d+)</strong>};
my $url = $1; # in this case, regex on HTML is fine
print "Waiting for $url\n";
while (1) {
$mech->get($url);
last unless $mech->content =~ m/refresh/;
sleep 10; # or whatever number of seconds
}
# process the final response ...
We first submit the data. We then extract the URL that you're supposed to call until they are done processing. Since this is a pretty straight-forward document, we can safely use a pattern match. The URL is always the same, and it's clearly marked with the <strong> tag. In general it's not a good idea to use regex to parse HTML, but we're not really parsing, we are just screenscraping a single value. The \Q and \E are the same as quotemeta and make sure that we don't have to escape the . and ?, which is then easier to read than having a buch of backslashes \ in the pattern.
The script will sleep for ten seconds after every attempt before trying again. Once it matches, it breaks out of the endless loop, so you can put the processing of the actual response that has the data you wanted behind that loop.
It might make sense to add some output into the loop so you can see that it's still running.
Note that this needs to really keep running until it's done. Don't stop the process.

Paysafe POSTMAN 404 for cardpayments

I am unable to run a proper transaction. I am not coding right away. I wanted to test it with POSTMAN first but I seem to be running in to problems.
I am using the following endpoint
https://api.test.paysafe.com/cardpayments/v1/accounts/89994061
Sample Request (Code)
> {
> "merchantRefNum": "ORDER_ID:12312",
> "amount": 10098,
> "settleWithAuth": true,
> "card": {
> "cardNum": "4111111111111111",
> "cardExpiry": {
> "month": 2,
> "year": 2017
> },
> "cvv":111
> },
> "billingDetails": {
> "zip": "M5H 2N2"
> } }
Sample Response (Code)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Draft//EN">
<HTML>
<HEAD>
<TITLE>Error 404--Not Found</TITLE>
</HEAD>
<BODY bgcolor="white">
<FONT FACE=Helvetica>
<BR CLEAR=all>
<TABLE border=0 cellspacing=5>
<TR>
<TD>
<BR CLEAR=all>
<FONT FACE="Helvetica" COLOR="black" SIZE="3">
<H2>Error 404--Not Found</H2>
</FONT>
</TD>
</TR>
</TABLE>
<TABLE border=0 width=100% cellpadding=10>
<TR>
<TD VALIGN=top WIDTH=100% BGCOLOR=white>
<FONT FACE="Courier New">
<FONT FACE="Helvetica" SIZE="3">
<H3>From RFC 2068
<i>Hypertext Transfer Protocol -- HTTP/1.1</i>:
</H3>
</FONT>
<FONT FACE="Helvetica" SIZE="3">
<H4>10.4.5 404 Not Found</H4>
</FONT>
<P>
<FONT FACE="Courier New">The server has not found anything matching the Request-URI. No indication is given of whether the condition is temporary or permanent.
</p>
<p>If the server does not wish to make this information available to the client, the status code 403 (Forbidden) can be used instead. The 410 (Gone) status code SHOULD be used if the server knows, through some internally configurable mechanism, that an old resource is permanently unavailable and has no forwarding address.
</FONT>
</P>
</FONT>
</TD>
</TR>
</TABLE>
</BODY>
</HTML>
Seems the endpoint is incorrect. I received this from your developer center. What is incorrect?
It would seem that your endpoints are missing some parameters. What you have is the basic format of the Card Payments API. If you wish to process an actual payment, you need to add certain parameters at the end. Please see example below.
Authorizations
https://api.test.netbanx.com/cardpayments/v1/accounts/89994061/auths
Settlements
https://api.test.netbanx.com/cardpayments/v1/accounts/89994061/settlements
There are many more but you can find them on the Paysafe Developer Center.

Write on unblessed reference in perl

Why am i getting an error saying "Can't call method 'write' on unblessed reference at /usr/..."
When i run on my ubuntu there was no error at all but when i run this code on my gentoo. This error pops out. I think the OS is not the problem here. but what is it?
Here is my code :
#!/usr/bin/perl
#index.cgi
require 'foobar-lib.pl';
ui_print_header(undef, $module_info{'desc'}, "", undef, 1, 1);
ui_print_footer("/", $text{'index'});
use CGI;
use Config::Tiny;
use Data::Dumper;
use CGI::Carp qw(fatalsToBrowser);
#location/directory of configuration file
my $file = "/home/admin_config.conf";
my $Config = Config::Tiny->read($file);
#reads the section, key and the value of the configuration file.
my $status_in_file = $Config->{"offline_online_status"}->{"offline_online_status.offline_online_state"};
print "Content-type:text/html\n\n";
print qq~<html>
<link rel="stylesheet" type="text/css" href="style4.css">
<body>
<div id="content">
<div id="bar">
<span><p>Controller Settings</p></span>
</div>
<div id="tab-container">
<ul>
<li><span>Offline / Online State</span></li>
</ul>
</div>
<div id="main-container">
<table border="0" width="100%" height="80%">
<tr>
<td align="left" width="20%">
<div id="title"><span>Offline/Online Status :</span></div>
</td>
<td width="25%">
<table border="0" style=\"text-align:right;font-family:Arial, Helvetica, sans-serif;\" cellpadding="5">
<tr>
<td width="30%"><div id="data">Offline Online State:</div></td>
</tr>
<tr>
<td width="30%"><div id="data">Data Mode:</div></td>
</tr>
</table>
</td>
<td align="left" width="20%">
<table border="1" style=\"text-align:center;font-family:Arial, Helvetica, sans-serif;\" cellpadding="5">
<tr>
<td width="20%"><div id="data">$status_in_file</div></td>
</tr>
</table>
</td>
<td width="50%"></td>
</tr>
<tr>
<td colspan="4">
<div id="description"><p><b>Description :</b></p>
<p>This <i>indication</i> is sent ..</p>
</div>
</td>
</tr>
</table>
</div>
</div>
</body>
</html>
~;
Can anybody please help me?
Here is my foobar-lib.pl
=head1 foobar-lib.pl
foreign_require("foobar", "foobar-lib.pl");
#sites = foobar::list_foobar_websites()
=cut
BEGIN { push(#INC, ".."); };
use WebminCore;
init_config();
=head2 get_foobar_config()
=cut
sub get_foobar_config
{
my $lref = &read_file_lines($config{'foobar_conf'});
my #rv;
my $lnum = 0;
foreach my $line (#$lref) {
my ($n, $v) = split(/\s+/, $line, 2);
if ($n) {
push(#rv, { 'name' => $n, 'value' => $v, 'line' => $lnum });
}
$lnum++;
}
return #rv;
}
i don't really understand about this foobar-lib.pl also. Maybe this whats caused my problem when i run the codes perhaps?
The code you've shown doesn't attempt to call a method called write on anything at all, let alone on an unblessed reference. So I assume the method call happens in some code you haven't shown. Perhaps in foobar-lib.pl?
Because I can't see the code causing the error, I can only hazard a guess based on the clue that the method is called write.
In Perl, it's kind of ambiguous as to whether filehandles classed as "objects" (and can thus have methods called on them), or unblessed references (and thus can't). The situation changed in Perl 5.12, and again in Perl 5.14. So if you've got different versions of Perl installed on each machine, then you might observe different behaviours when trying to do:
$fh->write($data, $length)
The Perl 5.14+ behaviour is probably what you want (as it's the most awesome), and luckily you can achieve that same behaviour on earlier versions of Perl by pre-loading a couple of modules. Add the following two lines to the top of your script:
use IO::Handle ();
use IO::File ();
Problem solved... perhaps???
It could be because of modules installed on to different location on different server. Please perform perl -V on both the server and check modules are located at identical location.
Also check what are you passing to that method.
Also check the permissions, Does your program has write access?

Parsing HTML String in to ios

I am parsing html tags into ios using TFHpple successfully, but here i got a small problem,
if my HTML Tag is
<div align="center">
<b>
<a href="/?PageName='TeacherPage'&StaffID=194121">
<span class="sectionheader">
Jessica
Cortes
</span>
</a></b><BR>
<span class="subheader">Migrant Education</span>
<BR>
<img src="/images/Phone.gif" width="22" height="23">
912-367-8630
<img src="images/EmailIconSmall.gif" width="16" height="16" style="vertical-align:bottom" />
<a onclick="openme('z','/Common/Email/Email.asp?UserID=194121&SchoolID=786',417,320);return false;" href="#">Email</a>
<BR><BR>
View All Teachers
<BR><BR>
<table cellpadding="4" cellspacing="4" class="subnavtd">
i am parsing it in to ios by using example: NSString *tutorialsXpathQueryString = #"//div/span[#class='subheader']]";
now in one of the HTML page there is no Tag, it has just a number like 912-367-8630 now how to call this in NSString *tutorialsXpathQueryString = #" this number is in above given tags
Are you able to reform the HTML output and wrap that phone number in a tag that you can target? If not, you will probably have to grab the inner text value of a parent div and regex match for a phone number pattern in the string.

Form Hack / XSS / SQL Injection

I got a big problem with a Botnet...I think it is a botnet...
What happens?
The bot fills out the form and spams the database.
Here is the form:
<form method="POST" action="">
<textarea name="text2" style="width: 290px; margin-bottom: 10px;"></textarea>
<center>
<img id="captcha" alt="Captcha" src="http://www.mysite.de/php/captcha/Captcha_show.php?sid='2d7dd1256d06a724c34b9d703f3733e9">
<br>
<a onclick="document.getElementById('captcha').src = 'php/captcha/Captcha_show.php?' + Math.random(); return false" href="#">
<br>
<input id="mod" class="inputbox" type="text" alt="Bitte die Zeichen des Bildes eingeben." style="width: 280px" maxlength="15" name="captcha_code" value="">
<sub>Bitte die Zeichen des Bildes abschreiben</sub>
<br>
<br>
<input class="button" type="submit" value="Hinzufügen" name="submit">
</center>
</form>
Here is an array with words that can´t be inserted:
$badWords = array("/delete/i","/deleted/i","/deletee/i", "/update/i", "/updateu/i", "/updateup/i","/union/i","/unionu/i","/unionun/i", "/insert/i","/inserti/i","/insertin/i","/drop/i","/dropd/i","/dropdr/i","/http/i","/httph/i","/httpht/i","/--/i", "/url/i", "/urlu/i", "/urlur/i", "/true/i", "/truet/i", "/truetr/i", "/false/i", "/falsef/i", "/falsefa/i","/!=/i","/==/i", "/insurance/i", "/eating/i", "/viagra/i");
$text3 = preg_replace($badWords, "a12", $text2);
if($text3 != $text2){
echo "<center><b>No valid data!</b></center> <meta http-equiv=\"refresh\" content=\"2; URL=http://www.mysite.de\">";
exit;
}
So normally the user should not be able to post any text with e.g. "viagra" in it.
I can´t understand how someone or a bot could insert a text with some of these bad words?
I am using PDO and functions like htmlspecialchars() stripslashes() strip_tags() htmlspecialchars() to prevent the hack...
Any ideas?
Your script can be hacked by HTML entities:
Example:
The input is "Hello" but in code it is Hello.
If you now run a preg_match you will not find anything
var_dump(preg_match('/Hello/i','Hello'));
// returns int 0
If you want to prevent SQL injections: Use prepared statements.
If you not want to be spammed, you have also to look for an other way, as long as I could simply insert a valid string many times.
Notice: I think you can prevent my hack by using html_entity_decode
var_dump(preg_match('/Hello/i',html_entity_decode('Hello')));
// returns int 1