Why can't Perl's WWW::Mechanize find the form by field names? - perl

#!/usr/bin/perl
use WWW::Mechanize;
use Compress::Zlib;
my $mech = WWW::Mechanize->new();
my $username = ""; #fill in username here
my $keyword = ""; #fill in password here
my $mobile = $ARGV[0];
my $text = $ARGV[1];
$deb = 1;
print length($text)."\n" if($deb);
$text = $text."\n\n\n\n\n" if(length($text) < 135);
$mech->get("http://wwwl.way2sms.com/content/index.html");
unless($mech->success())
{
exit;
}
$dest = $mech->response->content;
print "Fetching...\n" if($deb);
if($mech->response->header("Content-Encoding") eq "gzip")
{
$dest = Compress::Zlib::memGunzip($dest);
$mech->update_html($dest);
}
$dest =~ s/<form name="loginForm"/<form action='..\/auth.cl' name="loginForm"/g;
$mech->update_html($dest);
$mech->form_with_fields(("username","password"));
$mech->field("username",$username);
$mech->field("password",$keyword);
print "Loggin...\n" if($deb);
$mech->submit_form();
$dest= $mech->response->content;
if($mech->response->header("Content-Encoding") eq "gzip")
{
$dest = Compress::Zlib::memGunzip($dest);
$mech->update_html($dest);
}
$mech->get("http://wwwl.way2sms.com//jsp/InstantSMS.jsp?val=0");
$dest= $mech->response->content;
if($mech->response->header("Content-Encoding") eq "gzip")
{
$dest = Compress::Zlib::memGunzip($dest);
$mech->update_html($dest);
}
print "Sending ... \n" if($deb);
$mech->form_with_fields(("MobNo","textArea"));
$mech->field("MobNo",$mobile);
$mech->field("textArea",$text);
$mech->submit_form();
if($mech->success())
{
print "Done \n" if($deb);
}
else
{
print "Failed \n" if($deb);
exit;
}
$dest = $mech->response->content;
if($mech->response->header("Content-Encoding") eq "gzip")
{
$dest = Compress::Zlib::memGunzip($dest);
#print $dest if($deb);
}
if($dest =~ m/successfully/sig)
{
print "Message sent successfully" if($deb);
}
exit;
When run this code halts with an error saying:
There is no form with the requested fields at ./sms.pl line 65
Can't call method "value" on an undefined value at /usr/share/perl5/vendor_perl/WWW/Mechanize.pm line 1348.

I would guess that there is no form with the fields "MobNo" & "textArea" in http://wwwl.way2sms.com//jsp/InstantSMS.jsp?val=0 which indeed there aren't since the page at that URL lacks even a <body> tag.

$dest =~ s/<form name="loginForm"/<form action='..\/auth.cl' name="loginForm"/g;
find the above line in the script and replace it with the following
$dest =~ s/<form name="loginForm"/<form action='..\/Login1.action' name="loginForm"/ig;
This is required since recently way2sms has restructured its home page and hence the auth.cl form was renamed to Login1.action

When I run into these sorts of problems, I print the entire HTML page so I can look at it. The form you are expecting probably isn't there. I suspect that you aren't getting the page that you think you are.
The first page does quite a bit of JavaScript processing to submit the form. Since WWW::Mechanize doesn't handle any of those bits for you, I'm guessing that your first form submission is somehow incomplete or invalid, so the next page you get is some sort of error page. That's a quite common problem with dynamic websites.
You can also compare what Mech does with what a JavaScript-enabled browser does. Use some sort of HTTP sniffing tool to watch the transactions. Is the interactive browser doing something extra that Mech isn't?

Related

HTML parsing with HTML::TokeParser::Simple

I am parsing an HTML file with HTML::TokeParser::Simple. The HTML file has the content shown far below. My problem is, I am trying to ignore the JavaScript from showing up as text content. Example:
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( 'test.html' );
while ( my $token = $p->get_token ) {
next unless $token->is_text;
print $token->as_is, "\n";
}
This prints the output as seen below:
Test HTML
<!--
var form_submitted = 0;
function submit_form() {
[..]
}
//-->
The actual HTML Document Content:
<html>
<span>Test HTML</span>
<script type="text/javascript">
<!--
var form_submitted = 0;
function submit_form() {
[..]
}
//-->
</script>
</html>
How do I ignore the JavaScript tag contents from showing.
I get the desired result. Comments are (correctly) not considered text by the version I have. Looks like you need to upgrade the modules you are using. (I used HTML::Parser 3.69 and HTML::TokeParser::Simple 3.15.)
>perl a.pl
Test HTML
>
You'll still have to process HTML entities and format the text usefully, the latter being quite difficult since you removed all formatting instruction. Your approach seems fatally flawed.
I believe you only need to use the as_text method.
my $tree = HTML::TreeBuilder->new();
$tree->parse( $html );
$tree->eof();
$tree->elementify(); # just for safety
my $text = $tree->as_text();
$tree->delete;
I adapted this from the WWW::Mechanize module (http://search.cpan.org/dist/WWW-Mechanize/) which has tons of convenience methods that can help you. It basically acts as a web browser in an object.
Scan through the token to ignore all open and close script tags. See below as used to resolved the issue.
my $ignore=0;
while ( my $token = $p->get_token ) {
if ( $token->is_start_tag('script') ) {
print $token->as_is, "\n";
$ignore = 1;
next;
}
if ( $token->is_end_tag('script') ) {
$ignore = 0;
print $token->as_is, "\n";
next;
}
if ($ignore) {
#Everything inside the script tag. Here you can ignore or print as is
print $token->as_is, "\n";
}
else
{
#Everything excluding scripts falls here handle as appropriate
next unless $token->is_text;
print $token->as_is, "\n";
}
}

ReCaptcha Implementation in Perl

To implement recaptcha in my website.
One option is google API . But for that i need to signup with domain name to get API key.
Is there any other way we can do it ?
You don't necessarily need a domain name to sign up, per se.
They have a concept of a "global key" where one single domain key would be used on several domains. When signing up, select the "Enable this key on all domains (global key)" option, and use a unique identifier (domainkey.abhilasha.com) and this will be fine, you can use the key from any domain in the end.
One way: add this code to your perl file that is called by an html form:
Simplified of course
my #field_names=qw(name branch email g-recaptcha-response);
foreach $field_name (#field_names)
{
if (defined param("$field_name"))
{
$FIELD{$field_name} = param("$field_name");
}
}
$captcha=$FIELD{'g-recaptcha-response'};
use LWP::Simple;
$secretKey = "put your key here";
$ip = remote_host;
#Remove # rem to test submitted variables are present
#print "secret= $secretKey";
#print " and response= $captcha";
#print " and remoteip= $ip";
$URL = "https://www.google.com/recaptcha/api/siteverify?secret=".$secretKey."&response=".$captcha."&remoteip=".$ip;
$contents = get $URL or die;
# contents variable takes the form of: "success": true, "challenge_ts": "2016-11-21T16:02:41Z", "hostname": "www.mydomain.org.uk"
use Data::Dumper qw(Dumper);
# Split contents variable by comma:
my ($success, $challenge_time, $hostname) = split /,/, $contents;
# Split success variable by colon:
my ($success_title, $success_value) = split /:/, $success;
#strip whitespace:
$success_value =~ s/^\s+//;
if ($success_value eq "true")
{
print "it worked";
}else{
print "it did not";
}
If you are just trying to block spam, I prefer the honeypot captcha approach: http://haacked.com/archive/2007/09/10/honeypot-captcha.aspx
Put an input field on your form that should be left blank, then hide it with CSS (preferably in an external CSS file). A robot will find it and will put spam in it but humans wont see it.
In your form validation script, check the length of the field, if it contains any characters, do not process the form submission.

Email::MIME can't parse message from Gmail

So I'm using PERL and Email::MIME to get an email from gmail. Here is my code:
use Net::IMAP::Simple::Gmail;
use Email::Mime;
# Creat the object that will read the emails
$server = 'imap.gmail.com';
$imap = Net::IMAP::Simple::Gmail->new($server);
# User and password
$user = 'username#gmail.com';
$password = 'passowrd';
$imap->login($user => $password);
# Select the INBOX and returns the number of messages
$numberOfMessages = $imap->select('INBOX');
# Now let's go through the messages from the top
for ($i = 1; $i <= $numberOfMessages; $i++)
{
$top = $imap->top($i);
print "top = $top\n";
$email = Email::MIME->new( join '', #{ $imap->top($i) } );
$body = $email->body_str;
print "Body = $body\n";
}#end for i
When I run it, I get the following error:
can't get body as a string for multipart/related; boundary="----=_Part_6796768_17893472.1369009276778"; type="text/html" at /Library/Perl/5.8.8/Email/Mime.pm line 341
Email::MIME::body_str('Email::MIME=HASH(0x87afb4)') called at readPhoneEmailFeed.pl line 37
If I replace
$body = $email->body_str;
with
$body = $email->body;
I get the output:
Body =
(i.e. empty string)
What's going on here? is there a way for me to get the raw body of the message (->body_raw doesn't work either)? I'm okay with parsing out the body using regex
Email::MIME is not the best documented package I have ever seen.
The body and body_str methods only work on a single mime part. Mostly that would be a simple text message. For anything more complex use the parts method to get each mime component which is itself an Email::MIME object. The body and body_str methods should work on that. An html formatted message will generally have two MIME parts: text/plain and text/html.
This isn't exactly what you want but should be enough to show you what is going on.
my #parts = $email->parts;
for my $part (#parts) {
print "type: ", $part->content_type, "\n";
print "body: ", $part->body, "\n";
}

Perl - How to get the email address from the FROM part of header?

I am trying to set up this script for my local bands newsletter.
Currently, someone sends an email with a request to be added, we manually add it to newsletter mailer I set up.
(Which works great thanks to help I found here!)
The intent now is to have my script below log into the email account I set up for the list on our server, grab the info to add the email automatically.
I know there are a bunch of apps that do this but, I want to learn myself.
I already have the "add to list" working when there is an email address returned from the header(from) below BUT, sometimes the header(from) is a name and not the email address (eg "persons name" is returned from persons name<email#address> but, not the <email#address>.)
Now, I am not set in stone on the below method but, it works famously... to a point.
I read all the docs on these modules and there was nothing I could find to get the darn email in there all the time.
Can someone help me here? Verbose examples are greatly appreciated since I am struggling learning Perl.
#!/usr/bin/perl -w
##########
use CGI;
use Net::IMAP::Simple;
use Email::Simple;
use IO::Socket::SSL; #optional i think if no ssl is needed
use strict;
use CGI::Carp qw(fatalsToBrowser warningsToBrowser);
######################################################
# fill in your details here
my $username = '#########';
my $password = '#############';
my $mailhost = '##############';
#######################################################
print CGI::header();
# Connect
my $imap = Net::IMAP::Simple->new($mailhost, port=> 143, use_ssl => 0, ) || die "Unable to connect to IMAP: $Net::IMAP::Simple::errstr\n";
# Log in
if ( !$imap->login( $username, $password ) ) {
print STDERR "Login failed: " . $imap->errstr . "\n";
exit(64);
}
# Look in the INBOX
my $nm = $imap->select('INBOX');
# How many messages are there?
my ($unseen, $recent, $num_messages) = $imap->status();
print "unseen: $unseen, <br />recent: $recent, <br />total: $num_messages<br />\n\n";
## Iterate through unseen messages
for ( my $i = 1 ; $i <= $nm ; $i++ ) {
if ( $imap->seen($i) ) {
my $es = Email::Simple->new( join '', #{ $imap->top($i) } );
printf( "[%03d] %s\n\t%s\n", $i, $es->header('From'), $es->header('Subject'));
print "<br />";
next;
}## in the long version these are pushed into different arrays for experimenting purposes
else {
my $es = Email::Simple->new( join '', #{ $imap->top($i) } );
printf( "[%03d] %s\n\t%s\n", $i, $es->header('From'), $es->header('Subject'));
print "<br />";
}
}
# Disconnect
$imap->quit;
exit;
use Email::Address;
my #addresses = Email::Address->parse('persons name <email#address>');
print $addresses[0]->address;
The parse method returns an array, so the above way works for me.
I'm making this a separate answer because even though this information is hidden in the comments of the accepted answer, it took me all day to figure that out.
First you need to get the From header using something like Email::Simple. THEN you need to extract the address portion with Email::Address.
use Email::Simple;
use Email::Address;
my $email = Email::Simple->new($input);
my $from = $email->header('From');
my #addrs = Email::Address->parse($from);
my $from_address = $addrs[0]->address; # finally, the naked From address.
Those 4 steps in that order.
The final step is made confusing by the fact that Email::Address uses some voodoo where if you print the parts that Email::Address->parse returns, they will look like simple strings, but they are actually objects. For example if you print the result of Email::Address->parse like so,
my #addrs = Email::Address->parse($from);
foreach my $addr (#addrs) { say $addr; }
You will get the complete address as output:
"Some Name" <address#example.com>
This was highly confusing when working on this. Granted, I caused the confusion by printing the results in the first place, but I do that out of habit when debugging.

How to post non-latin1 data to non-UTF8 site using perl?

I want to post russian text on a CP1251 site using LWP::UserAgent and get following results:
# $text="Русский текст"; obtained from command line
FIELD_NAME => $text # result: Г?в г'В?г'В?г'В?г?вєг?вёг?в? Г'В'Г?вчг?вєг'В?г'В'
$text=Encode::decode_utf8($text);
FIELD_NAME => $text # result: Р с?с?с?рєрёр? С'Рчрєс?с'
FIELD_NAME => Encode::encode("cp1251", $text) # result: Г?гіг+г+гЄгёгЏ ГІгҐгЄг+гІ
FIELD_NAME => URI::Escape::uri_escape_utf8($text) # result: D0%a0%d1%83%d1%81%d1%81%d0%ba%d0%b8%d0%b9%20%d1%82%d0%b5%d0%ba%d1%81%d1%82
How can I do this? Content-Type must be x-www-form-urlencoded. You can find similar form here, but there you can just escape any non-latin character using &#...; form, trying to escape it in FIELD_NAME results in 10561091108910891 10901077108210891 (every &, # and ; stripped out of the string) or 1056;усский текст (punctuation characters at the beginning of the string are stripped out) depending on what the FIELD_NAME actually is.
UPDATE: Anybody knows how to convert the following code so that it will use LWP::UserAgent::post function?
my $url=shift;
my $fields=shift;
my $request=HTTP::Request->new(POST => absURL($url));
$request->content_type('application/x-www-form-urlencoded');
$request->content_encoding("UTF-8");
$ua->prepare_request($request);
my $content="";
for my $k (keys %$fields) {
$content.="&" if($content ne "");
my $c=$fields->{$k};
eval {$c=Encode::decode_utf8($c)};
$c=Encode::encode("cp1251", $c, Encode::FB_HTMLCREF);
$content.="$k=".URI::Escape::uri_escape($c);
}
$request->content($content);
my $response=$ua->simple_request($request);
This code actually solves the problem, but I do not want to add the third request wrapper function (alongside with get and post).
One way around it appears to be (far from the best, I think) to use recode system command if you have it avialable. From http://const.deribin.com/files/SignChanger.pl.txt
my $boardEncoding="cp1251"; # encoding used by the board
$vals{'Post'} = `fortune $forunePath | recode utf8..$boardEncoding`;
$res = $ua->post($formURL,\%vals);
Another approach seems to be in http://mail2lj.nichego.net/lj.txt
my $formdata = $1 ;
my $hr = ljcomment_string2form($formdata) ;
my $req = new HTTP::Request('POST' => $ljcomment_action)
or die "new HTTP::Request(): $!\n" ;
$hr->{usertype} = 'user' ;
$hr->{encoding} = $mh->mime_attr('content-type.charset') ||
"cp1251" ;
$hr->{subject} = decode_mimewords($mh->get('Subject'));
$hr->{body} = $me->bodyhandle->as_string() ;
$req->content_type('application/x-www-form-urlencoded');
$req->content(href2string($hr)) ;
my $ljres = submit_request($req, "comment") ;
if ($ljres->{'success'} eq "OK") {
print STDERR "journal updated successfully\n" ;
} else {
print STDERR "error updating journal: $ljres->{errmsg}\n" ;
send_bounce($ljres->{errmsg}, $me, $mh->mime_attr("content-type.charset")) ;
}
Use WWW::Mechanize, it takes care of encoding (both character encoding and form encoding) automatically and does the right thing if a form element's accept-charset attribute is set appropriately. If it's missing, the form defaults to UTF-8 and thus needs correction. You seem to be in this situation. By the way, your example site's encoding is KOI8-R, not Windows-1251. Working example:
use utf8;
use WWW::Mechanize qw();
my $message = 'Русский текст';
my $mech = WWW::Mechanize->new(
cookie_jar => {},
agent => 'Mozilla/5.0 (X11; U; Linux x86_64; en-US) AppleWebKit/533.9 SUSE/6.0.401.0-2.1 (KHTML, like Gecko)',
);
$mech->get('http://zhurnal.lib.ru/cgi-bin/comment?COMMENT=/z/zyx/index_4-1');
$mech->current_form->accept_charset(scalar $mech->response->content_type_charset);
$mech->submit_form(with_fields => { TEXT => $message });
HTTP dump (essential parts only):
POST /cgi-bin/comment HTTP/1.1
Content-Length: 115
Content-Type: application/x-www-form-urlencoded
FILE=%2Fz%2Fzyx%2Findex_4-1&MSGID=&OPERATION=store_new&NAME=&EMAIL=&URL=&TEXT=%F2%D5%D3%D3%CB%C9%CA+%D4%C5%CB%D3%D
These functions solve the issue (first for posting application/x-www-form-urlencoded data and second for multipart/form-data):
#{{{2 postue
sub postue($$;$) {
my $url=shift;
my $fields=shift;
my $referer=shift;
if(defined $referer and $referer eq "" and defined $fields->{"DIR"}) {
$referer=absURL($url."?DIR=".$fields->{"DIR"}); }
else {
$referer=absURL($referer); }
my $request=HTTP::Request->new(POST => absURL($url));
$request->content_type('application/x-www-form-urlencoded');
$request->content_encoding("UTF-8");
$ua->prepare_request($request);
my $content="";
for my $k (keys %$fields) {
$content.="&" if($content ne "");
my $c=$fields->{$k};
if(not ref $c) {
$c=Encode::decode_utf8($c) unless Encode::is_utf8($c);
$c=Encode::encode("cp1251", $c, Encode::FB_HTMLCREF);
$c=URI::Escape::uri_escape($c);
}
elsif(ref $c eq "URI::URL") {
$c=$c->canonical();
$c=URI::Escape::uri_escape($c);
}
$content.="$k=$c";
}
$request->content($content);
$request->referer($referer) if(defined $referer);
my $i=0;
print STDERR "Doing POST request to url $url".
(($::o_verbose>2)?(" with fields:\n".
::YAML::dump($fields)):("\n"))
if($::o_verbose>1);
REQUEST:
my $response=$ua->simple_request($request);
$i++;
my $code=$response->code;
if($i<=$o_maxtries and 500<=$code and $code<600) {
print STDERR "Failed to request $url with code $code... retrying\n"
if($::o_verbose>2);
sleep $o_retryafter;
goto REQUEST;
}
return $response;
}
#{{{2 postfd
sub postfd($$;$) {
my $url=absURL(shift);
my $content=shift;
my $referer=shift;
$referer=absURL($referer) if(defined $referer);
my $i=0;
print STDERR "Doing POST request (form-data) to url $url".
(($::o_verbose>2)?(" with fields:\n".
::YAML::dump($content)):("\n"))
if($::o_verbose>1);
my $newcontent=[];
while(my ($f, $c)=splice #$content, 0, 2) {
if(not ref $c) {
$c=Encode::decode_utf8($c) unless Encode::is_utf8($c);
$c=Encode::encode("cp1251", $c, Encode::FB_HTMLCREF);
}
push #$newcontent, $f, $c;
}
POST:
my $response=$ua->post($url, $newcontent,
Content_type => "form-data",
((defined $referer)?(referer => $referer):()));
$i++;
my $code=$response->code;
if($i<=$o_maxtries and 500<=$code and $code<600) {
print STDERR "Failed to download $url with code $code... retrying\n"
if($::o_verbose>2);
sleep $o_retryafter;
goto POST;
}
return $response;
}