I'm new to perl. In the past few days, I've made some simple scripts that save websites' source codes to my computer via "get." They do what they're supposed to, but will not get the content of a website which is a forum. Non-forum websites work just fine.
Any idea what's going on? Here's the problem chunk:
my $url = 'http://www.computerforum.com/';
my $content = get $url || die "Unable to get content";
http://p3rl.org/LWP::Simple#get:
The get() function will fetch the document identified by the given URL and return it. It returns undef if it fails. […]
You will not be able to examine the response code or response headers (like 'Content-Type') when you are accessing the web using this function. If you need that information you should use the full OO interface (see LWP::UserAgent).
You really need better error reporting, switch to the LWP::UserAgent library. I suspect the forum software blocks the LWP user agent.
Related
I was using an open source tool called SimTT which gets an URL of a tabletennis league and then calculates the probable results (e.g. ranking of teams and players). Unfortunately the webpage moved to a different webpage.
I downloaded the open source and repaired the parsing of the webpage, but currently I'm only able to download the page manually and read it then from a file.
Below you can find an excerpt of my code to retrieve the page. It prints success, but the response is empty. Unfortunately I'm not familiar with perl and webtechniques very well, but in Wireshark I could see that one of the last things send was a new session key. But I'm not sure, if the problem is related to cookies, ssl or something like that.
It would be very nice if someone could help me to get access. I know that there are some people out there which would like to use the tool.
So heres the code:
use LWP::UserAgent ();
use HTTP::Cookies;
my $ua = LWP::UserAgent->new(keep_alive=>1);
$ua->agent('Mozilla/5.0');
$ua->cookie_jar({});
my $request = new HTTP::Request('GET', 'https://www.mytischtennis.de/clicktt/ByTTV/18-19/ligen/Bezirksoberliga/gruppe/323819/mannschaftsmeldungen/vr');
my $response = $ua->request($request);
if ($response->is_success) {
print "Success: ", $response->decoded_content;
}
else {
die $response->status_line;
}
Either there is some rudimentary anti-bot protection at the server or the server is misconfigured or otherwise broken. It looks like it expects to have an Accept-Encoding header in the request which LWP by default does not sent. The value of this header does not really seem to matter, i.e. the server will send the content compressed with gzip if the client claims to support it but it will send uncompressed data if the client offered only a compression method which is unknown to the server.
With this knowledge one can change the code like this:
my $request = HTTP::Request->new('GET',
'https://www.mytischtennis.de/clicktt/ByTTV/18-19/ligen/Bezirksoberliga/gruppe/323819/mannschaftsmeldungen/vr',
[ 'Accept-Encoding' => 'foobar' ]
);
With this simple change the code works currently for me. Note that it might change at any time if the server setup will be changed, i.e. it might then need other workarounds.
I'm doing some development work that uses an embedded Linux for the OS and Boa for the web server. I have a web page that posts to a CGI script, handles the form data, and replies. My development environment was Ubuntu and everything worked fine, but when I ported my code over to the embedded Linux, the CGI module did not instantiate (or at least does not seem to instantiate). Here is a stripped down section of my code. The print statement complains about an uninitialized variable.
use CGI;
use strict;
use warnings;
my $cgiObj = CGI->new();
print $cgiObj->param('wlanPort');
Again, this works fine in my development environment, but fails in the embedded environment. The CGI.pm is installed and there are no errors generated on the CGI->new() command. I have also verified that the form data is being sent, but obviously can't guarantee that it is being received by the Perl script.
I have a feeling that it is a Boa configuration issue and that's what I'll be looking into next. I'm fairly new to Perl, so I'm not sure what else to do. Any ideas?
EDIT: Definitely not a Boa config issue. Still looking into it.
UPDATE:
I've simplified my code to the following:
#!/usr/bin/perl
use CGI qw(:standard);
$data = param('wlanPort') || '<i>(No Input)</i>';
print header;
print <<END;
<title>Echoing user input</title>
<p>wlanPort: $data</p>
END
As expected, it prints (No Input)
I should also point out that the form is enctype="multipart/form-data" because I have to have a file upload capability and I am using the "POST" method.
I used the HttpFox plugin to inspect the post data and checked on the wlanPort value:
-----------------------------132407047814270795471206851178 Content-Disposition: form-data;
name="wlanPort"
eth1
So it is almost definitely being sent...
UPDATE 2: I installed the same version of Perl and Boa being used in the embedded system on my Ubuntu laptop. Works on the laptop, not in the device, which is the same result. I've told my employer that that I've exhausted all possibilities other than the way Boa and (Micro) Perl are built on the device vs. in Ubuntu.
CGI is a beautifully simple protocol and while I cannot answer your question directly, I can suggest some techniques to help isolate the problem.
If your form is being submitted via POST, then the contents of that form will appear as a URLencoded string in the content of the HTTP request your script is getting. Without using the CGI module at all, you should be able to read the the request from STDIN:
my $request = "";
while (<STDIN>) {
$request .= $_;
}
if (open my $out, ">>/tmp/myapp.log") {
print $out $request;
close $out;
}
You can then examine /tmp/myapp.log to see if you are getting all the information from the request that you think you are.
For completeness, if your form submits via GET, then the arguments will be in the environment variable QUERY_STRING, which you can look at in Perl with $ENV{'QUERY_STRING'}.
There should be no difference in the way CGI's object interface and functional interface parses the request. I am not familiar with boa, but I can't imagine it breaking basic CGI protocol.
A common problem is that you named the form parameter one thing and are looking for a different parameter name in the CGI script. That's always a laugh.
Good luck and hope this helps some.
I know this is very old post and OP may not be interested in any new info related to this, but the general question about how to debug CGI scripts has some relevance still. I had similar issues with dev vs. prod environments. To help those who stumble upon this thread, I am posting my experience in dealing with this situation. My simple answer is, to use Log::Log4perl and Data::Dumper modules to demystify this (assuming there is a way to access logs on your prod environment). This way with negligible overhead, you can turn on tracing when problems creep in (even if the code worked before, but due to changes over time, it started failing). Log every relevant bit of information at appropriate log level (trace, debug, info, warning, error, fatal) and configure what level is good for operations. Without these mechanisms, it will be very difficult to get insight into production operations. Hope this helps.
use CGI;
use Log::Log4perl qw(easy);
use Data::Dumper;
use strict;
use warnings;
my $cgiObj = CGI->new();
my $log = Log::Log4perl::get_logger();
$log->trace("CGI Data: " . Dumper($cgiObj));
print $cgiObj->param('wlanPort');
I looked up articles about using LWP however I am still lost! On this site we find a list of many schools; see the overview-page and follow some of the links and get some result pages:
I want to parse the sites using LWP::UserAgent and for the parsing : want to use either HTML::TreeBuilder::XPath or HTML::TokeParser
At the moment I am musing bout choosing the right get-request!
I have some issues with the LWP::Useragent. The subsite of the overview can be reached via direct links. but -note: each site has content. e.g. the following URLs of the above mentioned result-pages.
As a Novice here I cannot show you the endings of the different endings by posting the full URL but here you can see the endings:
id=21&extern_eid=709
id=21&extern_eid=789
id=21&extern_eid=1297
id=21&extern_eid=761
There are many different URLS that differ in the end of the URL. The question is : how to i run LWP::UserAgent? I want fetch and parse & ** all the - 1000 sites.**
Question; Does LWP do the job automatically!? Or do i have to set up LWP :: UserAgent that it will look up the different URLS automatically...
Solutions: Perhaps we have to count up form zero to 10000 with there
extern_eid=709 -(count from zero to 100000) here
www-db.sn.schule.de/index.php?id=21&extern_eid=709
BTW: Here the data for LWP User Agent;
REQUEST METHODS The methods described
in this section are used to dispatch
requests via the user agent. The
following request methods are
provided:
$ua->get( $url ) $ua->get( $url ,
$field_name => $value, ... )
This method will dispatch a GET
request on the given $url. Further
arguments can be given to initialize
the headers of the request. These are
given as separate name/value pairs.
The return value is a response object.
See HTTP::Response for a description
of the interface it provides. There
will still be a response object
returned when LWP can't connect to the
server specified in the URL or when
other failures in protocol handlers
occur.
The question is: How to use LWP::UserAgent on the above mentioned site the right way - effectively!?
I look forward to any and all help!
If I understand your question correctly, you are trying to use LWP::UserAgent on same URLs with different query arguments, and you are wondering if LWP::UserAgent provides a way for you to loop through the query arguments?
I don't think LWP::UserAgent has a method for you to do that. However, you can have a loop constructing the URLs and use LWP::UserAgent repeatedly:
for my $id (0 .. 100000)
{
$ua->get($url."?id=21&extern_eid=".(709-$id))
//rest of the code
}
Alternatively you can add a request_prepare handler that computes and add the query arguments before you send out the request.
You describe following links for the purpose of web scraping. The LWP subclass WWW::Mechanize does this more easily than your current attempt.
I want to access a GWT service from a Python script, so I want to generate a x-gwt-rpc request manually. Can't seem to find any info on the format of a GWT RPC call, since everybody does it from Java (so the call is generated by the framework). Where can I find some detailed documentation about this format?
Don't think it is a trivial task to do that, but because gwt is opensource i would say that the source-code is a pretty good documentation for how it works, if you know java that is.
Gwt source
I stumbled on the same problem as you and I think I solved it rather easily.
Though I haven't figured out how to catch the response properly, I managed to get the response and successfully send the request. Here is what I did:
import requests
url = 'yours url'
header = {'Accept':'*/*',
'Accept-Encoding':'gzip, deflate',
etc...
}
cookie = {cookies if needed
}
data_g = 'this would be request payload u can see in F12 of browser '# u just copy it and paste it, !!!like a string (UTF-8 chars)
t = requests.post(url, headers=header, data = data_g, cookies = cookie)
print vars(t).keys()
#line above will print all variables of t
print t
Also these are some good links you should check out:
https://github.com/GDSSecurity/GWT-Penetration-Testing-Toolset
https://docs.google.com/document/d/1eG0YocsYYbNAtivkLtcaiEE5IOF5u4LUol8-LL0TIKU/edit?hl=de&forcehl=1
I am using WWW::Mechanize and currently handling HTTP responses with the 'Content-Encoding: gzip' header in my code by first checking the response headers and then using IO::Uncompress::Gunzip to get the uncompressed content.
However I would like to do this transparently so that WWW::Mechanize methods like form(), links() etc work on and parse the uncompressed content. Since WWW::Mechanize is a sub-class of LWP::UserAgent, I would prefer to use the LWP::UA::handlers to do this.
While I have been partly successful (I can print the uncompressed content for example), I am unable to do this transparently in a way that I can call
$mech->forms();
In summary: How do I "replace" the content inside the $mech object so that from that point onwards, all WWW::Mechanize methods work as if the Content-Encoding never happened?
I would appreciate your attention and help.
Thanks
WWW::Mechanize::GZip, I think.
It looks to me like you can replace it by using the $res->content( $bytes ) member.
By the way, I found this stuff by looking at the source of LWP::UserAgent, then HTTP::Response, then HTTP::Message.
It is built in with UserAgent and thus Mechanize. One MAJOR caveat to save you some hair
-To debug, make sure you check for error $# after the call to decoded_content.
$html = $r->decoded_content;
die $# if $#;
Better yet, look through the source of HTTP::Message and make sure all the support packages are there
In my case, decoded_content returned undef while content is raw binary, and I went on a wild goose chase. UserAgent will set the error flag on failure to decode, but Mechanize will just ignore it (It doesn't check or log the incidence as its own error/warning).
In my case $# sez: "Can't find IO/HTML.pm .. It was eval'ed
After having to dive into the source, I find out the built-in decoding process is long, meticulous, and arduous, covering just about every scenario and making tons of guesses (Thank you Gisle!).
if you are paranoid, explicitly set the default header to be used with every request at new()
$browser = new WWW::Mechanize('default_headers' => HTTP::Headers->new('Accept-Encoding'
=> scalar HTTP::Message::decodable()));