I want to print the redirected url in perl.
Input url : http://pricecheckindia.com/go/store/snapdeal/52517?ref=velusliv
output url : http://www.snapdeal.com/product/vox-2-in-1-camcorder/1154987704?utm_source=aff_prog&utm_campaign=afts&offer_id=17&aff_id=1298&source=pricecheckindia
use LWP::UserAgent qw();
use CGI qw(:all);
print header();
my ($url) = "http://pricecheckindia.com/go/store/snapdeal/52517?ref=velusliv";
my $ua = LWP::UserAgent->new;
my $req = new HTTP::Request(GET => $url);
my $res = $ua->request($req);
print $res->request;
How to get this done in perl?
You need to examine the HTTP response to find the URL. The documentation of HTTP::Response gives full details of how to do this, but to summarise, you should do the following:
use strict;
use warnings;
use feature ':5.10'; # enables "say"
use LWP::UserAgent;
my $url = "http://pricecheckindia.com/go/store/snapdeal/52517?ref=velusliv";
my $ua = LWP::UserAgent->new;
my $req = new HTTP::Request(GET => $url);
my $res = $ua->request($req);
# you should add a check to ensure the response was actually successful:
if (! $res->is_success) {
say "GET failed! " . $res->status_line;
}
# show the base URI for the response:
say "Base URI: " . $res->base;
You can view redirects using HTTP::Response's redirects method:
if ($res->redirects) { # are there any redirects?
my #redirects = $res->redirects;
say join(", ", #redirects);
}
else {
say "No redirects.";
}
In this case, the base URI is the same as $url, and if you examine the contents of the page, you can see why.
# print out the contents of the response:
say $res->decoded_contents;
Right near the bottom of the page, there is the following code:
$(window).load(function() {
window.setTimeout(function() {
window.location = "http://www.snapdeal.com/product/vox-2-in-1-camcorder/1154987704?utm_source=aff_prog&utm_campaign=afts&offer_id=17&aff_id=1298&source=pricecheckindia"
}, 300);
});
The redirect is handled by javascript, and so is not picked up by LWP::UserAgent. If you want to get this URL, you will need to extract it from the response contents (or use a different client that supports javascript).
On a different note, your script starts off like this:
use LWP::UserAgent qw();
The code following the module name, qw(), is used to import particular subroutines into your script so that you can use them by name (instead of having to refer to the module name and the subroutine name). If the qw() is empty, it's not doing anything, so you can just omit it.
To have LWP::UserAgent follow redirects, just set the max_redirects option:
use strict;
use warnings;
use LWP::UserAgent qw();
my $url = "http://pricecheckindia.com/go/store/snapdeal/52517?ref=velusliv";
my $ua = LWP::UserAgent->new( max_redirect => 5 );
my $res = $ua->get($url);
if ( $res->is_success ) {
print $res->decoded_content; # or whatever
} else {
die $res->status_line;
}
However, that website is using a JavaScript redirect.
$(window).load(function() {
window.setTimeout(function() {
window.location = "http://www.snapdeal.com/product/vox-2-in-1-camcorder/1154987704?utm_source=aff_prog&utm_campaign=afts&offer_id=17&aff_id=1298&source=pricecheckindia"
}, 300);
});
This will not work unless you use a framework that enables JavaScript, like WWW::Mechanize::Firefox.
It will throw you an error for the last line $res - > request since it is returning hash and content from the response. So below is the code:
use LWP::UserAgent qw();
use CGI qw(:all);
print header();
my ($url) = "http://pricecheckindia.com/go/store/snapdeal/52517?ref=velusliv";
my $ua = LWP::UserAgent->new;
my $req = new HTTP::Request(GET => $url);
my $res = $ua->request($req);
print $res->content;
Related
I'm writing my first perl script for the requirement
generate HTTP request against a particular web uri in succession using different URL scheme patterns
use HTTP::Request::Generator 'generate_requests';
use URI;
use HTTP::Request::Common;
use strict; # safety net
use warnings; # safety ne
use Test::LWP::UserAgent 'send_request';
use LWP::UserAgent 'send_request';
use Test::More;
use URI;
use HTTP::Request::Common;
use LWP::UserAgent;
my $g = generate_requests(
method => 'POST',
host => ['example.com','www.example.com'],
pattern => 'https://example.com/{bar,foo,gallery}/[00..99].html',
wrap => sub {
my( $req ) = #_;
# Fix up some values
$req->{headers}->{'Content-Length'} = 666;
},
);
while( my $r = $g->()) {
send_request( $r );
};
I'm using atom editor and activeperl on windows 10, I get following error from running above code.
Undefined subroutine &main::send_request called at C:\Users\ADMINI~1\AppData\Local\Temp\atom_script_tempfiles\0ac821e0-0886-11eb-9588-291dbc37d883 line 57.
I have already installed all necessary modules and lib but i think its unable to refer the method send_request. Pls assist.
NOTE
I have replaced real values in variable for privacy reasons.
UPDATE
I plan to use following module
pattern => 'https://example.{com,org,net}/page_[00..99].html', from
https://metacpan.org/pod/HTTP::Request::Generator.
LWP::UserAgent is an object-oriented module. It doesn't export functions. You want to call send_request like this:
my $ua = 'LWP::UserAgent'->new;
while ( my $r = $g->() ) {
$ua->send_request( $r );
}
That said, send_request is an undocumented internal method. I think it is probably more intended for people who are subclassing LWP::UserAgent. You probably want the request method instead.
my $ua = 'LWP::UserAgent'->new;
while ( my $r = $g->() ) {
my $response = $ua->request( $r );
}
Full code:
use strict;
use warnings;
use HTTP::Request::Generator 'generate_requests';
use LWP::UserAgent;
my $ua = 'LWP::UserAgent'->new;
my $gen = generate_requests(
method => 'POST',
host => [ 'example.com', 'www.example.com' ],
pattern => 'https://example.com/{bar,foo,gallery}/[00..99].html',
wrap => sub {
my ( $req ) = #_;
# Fix up some values
$req->{'headers'}{'Content-Length'} = 666;
},
);
while ( my $req = $gen->() ) {
my $response = $ua->request( $req );
# Do something with $response here?
}
I have a doubt I've been trying to solve myself using CPAN modules documentation, but I'm a bit new and I'm confused with some terminology and sections within the different modules.
I'm trying to create the object in the code below, and get the absolute URL for relative links extracted from a website.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use Digest::MD5 qw(md5_hex);
use URI;
my $url = $ARGV[0];
if ($url !~ m{^https?://[^\W]+-?\.com/?}i) {
exit(0);
}
my $ua = LWP::UserAgent->new;
$ua->timeout( 10 );
my $response = $ua->get( $url );
my $content = $response->decoded_content();
my $links = URI->new($content);
my $abs = $links->abs('http:', $content);
my $abs_links = $links->abs($abs);
while ($content =~ m{<a[^>]\s*href\s*=\s*"?([^"\s>]+)}gis) {
$abs_links = $1;
print "$abs_links\n";
print "Digest for the above URL is " . md5_hex($abs_links) . "\n";
}
The problem is when I try to add that part outside the While loop (the 3-line block preceding the loop), it does not work, whereas if I add the same part in the While loop, it will work fine. This one just gets the relative URLs from a given website, but instead of printing "Http://..." it prints "//...".
The script that works fine for me is the following:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use Digest::MD5 qw(md5_hex);
use URI::URL;
my $url = $ARGV[0]; ## Url passed in command
if ($url !~ m{^https?://[\w]+-?[\w]+\.com/?}i) {
exit(0); ## Program stops if not valid URL
}
my $ua = LWP::UserAgent->new;
$ua->timeout( 10 );
my $response = $ua->get( $url ); ## Get response, not content
my $content = $response->decoded_content(); ## Now let's get the content
while ($content =~ m{<a[^>]\s*href\s*=\s*"?([^"\s>]+)}gis) { ## All links
my $links = $1;
my $abs = new URI::URL "$links";
my $abs_url = $abs->abs('http:', $links);
print "$abs_url\n";
print "Digest for the above URL is " . md5_hex($abs_url) . "\n";
}
Any ideas? Much appreciated.
I don't understand your code. There are a few weird bits:
[^\W] is the same as \w
The regex allows an optional - before and an optional / after .com, i.e. http://bitwise.complement.biz matches but http://cool-beans.com doesn't.
URI->new($content) makes no sense: $content is random HTML, not a URI.
$links->abs('http:', $content) makes no sense: $content is simply ignored, and $links->abs('http:') tries to make $links an absolute URL relative to 'http:', but 'http:' is not a valid URL.
Here's what I think you're trying to do:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use HTML::LinkExtor;
use Digest::MD5 qw(md5_hex);
#ARGV == 1 or die "Usage: $0 URL\n";
my $url = $ARGV[0];
my $ua = LWP::UserAgent->new(timeout => 10);
my $response = $ua->get($url);
$response->is_success or die "$0: " . $response->request->uri . ": " . $response->status_line . "\n";
my $content = $response->decoded_content;
my $base = $response->base;
my #links;
my $p = HTML::LinkExtor->new(
sub {
my ($tag, %attrs) = #_;
if ($tag eq 'a' && $attrs{href}) {
push #links, "$attrs{href}"; # stringify
}
},
$base,
);
$p->parse($content);
$p->eof;
for my $link (#links) {
print "$link\n";
print "Digest for the above URL is " . md5_hex($link) . "\n";
}
I don't try to validate the URL passed in $ARGV[0]. Leave it to LWP::UserAgent. (If you don't like this, just add the check back in.)
I make sure $ua->get($url) was successful before proceeding.
I get the base URL for absolutifying relative links from $response->base.
I use HTML::LinkExtor for parsing the content, extracting links, and making them absolute.
I think your biggest mistake is trying to parse links out of HTML using a regular expression. You would be far better advised to use a CPAN module for this. I'd recommend WWW::Mechanize, which would make your code look something like this:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use WWW::Mechanize;
use Digest::MD5 qw(md5_hex);
use URI;
my $url = $ARGV[0];
if ($url !~ m{^https?://[^\W]+-?\.com/?}i) {
exit(0);
}
my $ua = WWW::Mechanize->new;
$ua->timeout( 10 );
$ua->get( $url );
foreach ($ua->links) {
say $_->url;
say "Digest for the above URL is " . md5_hex($_->url) . "\n";
}
That looks a lot simpler to me.
The if statement is showing me that there is a response, but when I try to print the response I get nothing
use LWP::UserAgent;
use strict;
use warnings;
use HTTP::Request::Common;
# use this {"process": "mobileGps","phone": "9565551236"}
my $url = "the url goes here";
my $json = '{data :{"process" : "mobileGps", "phone" : "9565551236"}}';
my $req = HTTP::Request->new( POST => $url );
$req->header( 'Content-Type' => 'application/json' );
$req->content( $json );
my $ua = LWP::UserAgent->new;
my $res = $ua->request( $req );
if ( $res->is_success ) {
print "It worked";
print $res->decoded_content;
}
else {
print $res->code;
}
I do have the URL: I just took it out for the purpose of this example.
What am I missing?
Try debugging your script like this:
use strict;
use warnings;
use HTTP::Request::Common;
use LWP::ConsoleLogger::Easy qw( debug_ua );
use LWP::UserAgent;
# use this {"process": "mobileGps","phone": "9565551236"}
my $url = "the url goes here";
my $json = '{data :{"process" : "mobileGps", "phone" : "9565551236"}}';
my $req = HTTP::Request->new(POST => $url);
$req->header('Content-Type' =>'application/json');
$req->content($json);
my $ua = LWP::UserAgent->new;
debug_ua( $ua );
my $res = $ua->request($req);
if ($res->is_success) {
print "It worked";
print $res->decoded_content;
} else {
print $res->code;
}
That will (hopefully) give you a better idea of what's going on.
Can you not use the debugger, or add some print statements to see how your program is progressing?
If not then this is going to be another case of on-line turn-by-turn debugging, which benefits no one except the OP, and the ultimate diagnosis is that they should have learned the language first
The internet can be wise, but it will make many more artisans Pretender than craftsmen
Please don't ever expect to make a half-hearted attempt at a sketch, and then rope in the rest of the world to finish your job. It takes a huge amount of experience, aptitude, and understanding to get even a "What's your name" .. "Hello" program working, and things only get harder thereafter
If you don't like being careful and thorough, and would rather ask for people to do your stuff for you than discover a solution by experimentation, then you are a manager, not a programmer. I hope you will never try to advance a software career by getting great at delegating, because that doesn't work with software
Here. Use this as you will. The world is full of managers; it is good programmers that we need
use strict;
use warnings 'all';
use feature 'say';
use constant URL => 'http://example.com/';
use LWP;
my $ua = LWP::UserAgent->new;
my $json = '{}';
my $req = HTTP::Request->new( POST => URL );
$req->header( content_type => 'application/json' );
$req->content( $json );
my $res = $ua->request( $req );
say $res->as_string;
The code is fine. The problem must be with the server that is serving the request upon status code 200. You should check at server's end.
I've created a perl script to use HTML::TableExtract to scrape data from tables on a site.
It works great to dump out table data for unsecured sites (i.e. HTTP site), but when I try HTTPS sites, it doesn't work (the tables_report line just prints blank.. it should print a bunch of table data).
However, if I take the content of that HTTPS page, and save it to an html file and then post it on an unsecured HTTP site (and change my content to point to this HTTP page), this script works as expected.
Anyone know how I can get this to work over HTTPS?
#!/usr/bin/perl
use lib qw( ..);
use HTML::TableExtract;
use LWP::Simple;
use Data::Dumper;
# DOESN'T work:
my $content = get("https://datatables.net/");
# DOES work:
# my $content = get("http://www.w3schools.com/html/html_tables.asp");
my $te = HTML::TableExtract->new();
$te->parse($content);
print $te->tables_report(show_content=>1);
print "\n";
print "End\n";
The sites mentioned above for $content are just examples.. these aren't really the sites I'm extracting, but they work just like the site I'm really trying to scrape.
One option I guess is for me to use perl to download the page locally first and extract from there, but I'd rather not, if there's an easier way to do this (anyone that helps, please don't spend any crazy amount of time coming up with a complicated solution!).
The problem is related to the user agent that LWP::Simple uses, which is stopped at that site. Use LWP::UserAgent and set an allowed user agent, like this:
use strict;
use warnings;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $url = 'https://datatables.net/';
$ua->agent("Mozilla/5.0"); # set user agent
my $res = $ua->get($url); # send request
# check the outcome
if ($res->is_success) {
# ok -> I simply print the content in this example, you should parse it
print $res->decoded_content;
}
else {
# ko
print "Error: ", $res->status_line, "\n";
}
This is because datatables.net is blocking LWP::Simple requests. You can confirm this by using below code:
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
print is_success(getprint("https://datatables.net/"));
Output:
$ perl test.pl
403 Forbidden <URL:https://datatables.net/>
You could try using LWP::RobotUA. Below code works fine for me.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::RobotUA;
use HTML::TableExtract;
my $ua = LWP::RobotUA->new( 'bot_chankey/1.1', 'chankeypathak#stackoverflow.com' );
$ua->delay(5/60); # 5 second delay between requests
my $response = $ua->get('https://datatables.net/');
if ( $response->is_success ) {
my $te = HTML::TableExtract->new();
$te->parse($response->content);
print $te->tables_report(show_content=>1);
}
else {
die $response->status_line;
}
In the end, a combination of Miguel and Chankey's responses provided my solution. Miguel's made up most of my code, so I selected that as the answer, but here is my "final" code (got a lot more to do, but this is all I couldn't figure out.. the rest should be no problem).
I couldn't quite get either mentioned by Miguel/Chankey to work, but they got me 99% of the way.. then I just had to figure out how to get around the error "certificate verify failed". I found that answer with Miguel's method right away, so in the end, I mostly used his code, but both responses were great!
#!/usr/bin/perl
use lib qw( ..);
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TableExtract;
use LWP::RobotUA;
use Data::Dumper;
my $ua = LWP::UserAgent->new(
ssl_opts => { SSL_verify_mode => 'SSL_VERIFY_PEER' },
);
my $url = 'https://WebsiteIUsedWasSomethingElse.com';
$ua->agent("Mozilla/5.0"); # set user agent
my $res = $ua->get($url); # send request
# check the outcome
if ($res->is_success)
{
my $te = HTML::TableExtract->new();
$te->parse($res->content);
print $te->tables_report(show_content=>1);
}
else {
# ko
print "Error: ", $res->status_line, "\n";
}
my $url = "https://ohsesfire01.summit.network/reports/slices";
my $user = 'xxxxxx';
my $pass = 'xxxxxx';
my $ua = new LWP::UserAgent;
my $request = new HTTP::Request GET=> $url;
# authenticate
$request->authorization_basic($user, $pass);
my $page = $ua->request($request);
I am attempting to request a token from https://launchpad.net, according to the docs all it wants is a POST to /+request-token with the form encoded values of oauth_consumer_key, oauth_signature, and oauth_signature_method. Providing those items via curl works as expected:
curl --data "oauth_consumer_key=test-app&oauth_signature=%26&oauth_signature_method=PLAINTEXT" https://launchpad.net/+request-token
However, when i attempt to do it through my perl script it is giving me a 401 unauthorized error.
#!/usr/bin/env perl
use strict;
use YAML qw(DumpFile);
use Log::Log4perl qw(:easy);
use LWP::UserAgent;
use Net::OAuth;
$Net::OAuth::PROTOCOL_VERSION = Net::OAuth::PROTOCOL_VERSION_1_0A;
use HTTP::Request::Common;
use Data::Dumper;
use Browser::Open qw(open_browser);
my $ua = LWP::UserAgent->new;
my ($home) = glob '~';
my $cfg = "$home/.lp-auth.yml";
my $access_token_url = q[https://launchpad.net/+access-token];
my $authorize_path = q[https://launchpad.net/+authorize-token];
sub consumer_key { 'lp-ua-browser' }
sub request_url {"https://launchpad.net/+request-token"}
my $request = Net::OAuth->request('consumer')->new(
consumer_key => consumer_key(),
consumer_secret => '',
request_url => request_url(),
request_method => 'POST',
signature_method => 'PLAINTEXT',
timestamp => time,
nonce => nonce(),
);
$request->sign;
print $request->to_url;
my $res = $ua->request(POST $request->to_url, Content $request->to_post_body);
my $token;
my $token_secret;
print Dumper($res);
if ($res->is_success) {
my $response =
Net::OAuth->response('request token')->from_post_body($res->content);
$token = $response->token;
$token_secret = $response->token_secret;
print "request token ", $token, "\n";
print "request token secret", $token_secret, "\n";
open_browser($authorize_path . "?oauth_token=" . $token);
}
else {
die "something broke ($!)";
}
I tried both with $request->sign and without it as i dont think that is required during the request token phase. Anyway any help with this would be appreciated.
Update, switched to LWP::UserAgent and had to pass in both POST and Content :
my $res = $ua->request(POST $request->to_url, Content $request->to_post_body);
Thanks
Sorry I'm not able to verify from my tablet but with recent Perl you should install and use
use LWP::Protocol::https;
http://blogs.perl.org/users/brian_d_foy/2011/07/now-you-need-lwpprotocolhttps.html