I am trying to write a basic webscraping program in Perl. For some reason it is not working correctly and I don't have the slightest clue as to why.
Just the first part of my code where I am getting the content (just saving all of the HTML code from the webpage to a variable) does not work with certain websites.
I am testing it by just printing it out, and it does not print anything out with this specific website. It works with some other sites, but not all.
Is there another way of doing this that will work?
#use strict;
use LWP::Simple qw/get/;
use LWP::Simple qw/getstore/;
## Grab a Web page, and throw the content in a Perl variable.
my $content = get("https://jobscout.lhh.com/Portal/Page/ResumeProfile.aspx?Mode=View&ResumeId=53650");
print $content;
You have a badly-written web site there. The request times out with a 500 Internal Server Error.
I can't suggest how to get around it, but the site almost certainly uses JavaScript as well which LWP doesn't support, so I doubt if an answer would be much use to you.
Update
It looks like the site has been written so that it goes crazy if there is no Accept-Language header in the request.
The full LWP::UserAgent module is necessary to set it up, like this
use strict;
use warnings;
use LWP;
my $ua = LWP::UserAgent->new(timeout => 10);
my $url = 'https://jobscout.lhh.com/Portal/Page/ResumeProfile.aspx?Mode=View&ResumeId=53650';
my $resp = $ua->get($url, accept_language => 'en-gb,en', );
print $resp->status_line, "\n\n";
print $resp->decoded_content;
This returns with a status of 200 OK and some HTML.
To interact with a website that uses Javascript, I would advise that you use the following module:WWW::Mechanize::Firefox
use strict;
use warnings;
use WWW::Mechanize::Firefox;
my $url = "https://jobscout.lhh.com/Portal/Page/ResumeProfile.aspx?Mode=View&ResumeId=53650"
my $mech = WWW::Mechanize::Firefox->new();
$mech->get($url);
print $mech->status();
my $content = $mech->content();
Related
I have a question here:
For some reason when I call get request in perl script, I am getting only half page.
However this is happening only on certain machines, but on some machines it is returning whole page as expected.
The requested page size is only 64 kb, and perl is running on windows 7 (RAM 8GB, 64 bit OS).
I just wondering if anyone have seen this issue before? Is there any way to solve it?
Thank you!
Here is my code
use strict;
use warnings;
use File::Slurp;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
$ua->default_header( 'Authorization' => 'Basic [encoded credential]' );
$ua->max_size(2000000000) ;
my $resp = $ua->get( 'url', ':content_file'=>'test.txt' );
say $resp->status_line;
my $content = read_file('test.txt');
print $content;
I am a new hire in my company and first time I am working on Perl.
I get a task in which I find IP-Reputation from this link: https://www.talosintelligence.com/reputation_center/lookup?search=27.34.246.62
But in perl when we use:
#!/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
open FILE1, ">./Reports/Reputation.txt" or die "Cannot open Reputation.txt!";
my $mech = WWW::Mechanize->new( autocheck => 1 );
my $url="https://www.talosintelligence.com/reputation_center/lookup?search=27.34.246.62";
$mech->get($url);
print $mech->status();
my $content = $mech->content();
open FILE1, ">./Reports/Reputation.txt" or die "Cannot open Reputation.txt!";
print FILE1 ($content);
close FILE1;
print "\nIP Reputation Report Generated \n";
I don't get the whole content. What can I do to get this?
Contents are loading from JavaScript. So you can't crawl the content using simple methods.
There is two option for this kind of situation.
1) Some API contains the original data and JavaScript loads/formating the data in front end. If you want to parse the JavaScript loading content try to use the
WWW::Mechanize::Firefox
2) Try to figure out from where it is loading, for your IP following link has the corresponding data, which is JSON formated, so parse the content using JSON module. it is so simple compare to using RegEx.
https://www.talosintelligence.com/sb_api/query_lookup?query=%2Fapi%2Fv2%2Frelated_ips%2Fip%2F&query_entry=27.34.246.62
The problem I met is that I need to get one URL (I cannot be specific that link exactly, this link is doing request and looks like http://link.com/?name=name&password=password& and etc)
And I need to fetch this URL 100 times in a row. I can not do this manually using browser - this takes much time.
Is there any option to run (just run, like you put link in browser and press enter) this link 100 times in a row using Perl scripting?
I have not met before with the Perl and therefore asking the help directly. As I google before some information and make a little script, but seems like I missing something in my knowledge:
#!/usr/bin/perl -w
use LWP::Simple;
my $uri = 'http://my link here';
my $content = get $uri;
Could you please advise to me how I can finish this script?
Use a (simple) for loop.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $uri = 'http://my link here';
get $uri for 1 .. 100;
Update: Just read in a comment that you don't care about the returned data, so I've edited my answer to remove the unnecessary $content variable.
I am trying to write a Perl script which will automatically key in search variables on this LexisNexis search page and retrieve the search results.
I am using the WWW::Mechanize module but I am not sure how to figure out the field name of the search bar itself. This is the script I have so far ->
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
my $m = WWW::Mechanize->new();
my $url = "http://www.lexisnexis.com/hottopics/lnacademic/?verb=sr&csi=379740";
$m->get($url);
$m->form_name('f');
$m->field('q', 'Test');
my $response = $m->submit();
print $response->content();
However, I think the "Name" of the search box in this website is not "q". I am getting the following Error - "Can't call method "value" on an undefined value at site/lib/WWW/Mechanize.pm line 1442." Any help is much appreciated. Thank you !
If you disable the JavaScript in your browser then you will notice that the search form doesn't load which means it's being loaded by JavaScript, that's why you are unable to handle it with WWW::Mechanize. Have a look at WWW::Mechanize::Firefox, this might help you with your task. Check out the example scripts, cookbook and FAQs.
You can also do the same using Selenium, see Gabor's tutorial on Selenium.
I would like to automate this website with a Perl script
http://bioinfo.uni-plovdiv.bg/microinspector/
This is what I have so far and I am not sure how to get to the output page after this, I know it has something to do with POST, redirect_ok?, response(), but I am not sure. I read through the documentation but am confused about some things. Thanks.
use strict;
use warnings;
use WWW::Mechanize;
# create object for browser
my $browser = WWW::Mechanize->new();
my ($sequence, $results);
open (DRG, "<microRNA_target_cspg_drg_output.fa") || die "cannot open microRNA_target_cspg_drg_output.fa";
while (<DRG>) {
chomp;
$sequence=$_;
last; #for testing purposes
}
close (DRG);
$browser->get("http://bioinfo.uni-plovdiv.bg/microinspector/");
$browser->form_number(1);
$browser->field("target_sequence", $sequence);
$browser->field("Choose an organism : ", "Mus musculus");
$browser->click_button( number => 1);
You should start with WWW::Mechanize. It's page provides examples on submitting forms, and anything else you will need.
EDIT: as a reply to your update, if you want to get the content of the page, use the content method, like in this example:
my $content = $browser->content();
See this for more info.