How can I download link targets from a web site using Perl?

How can I download link targets from a web site using Perl? - perl

I just made a script to grab links from a website, and in turn saves them into a text file.
Now I'm working on my regexes so it will grab links which contains php?dl= in the url from the text file:
E.g.: www.example.com/site/admin/a_files.php?dl=33931
Its pretty much the address you get when you hover over the dl button on the site. From which you can click to download or "right click save".
I'm just wondering on how to achieve this, having to download the content of the given address which will download a *.txt file. All from the script of course.

Make WWW::Mechanize your new best friend.
Here's why:
It can identify links on a webpage that match a specific regex (/php\?dl=/ in this case)
It can follow those links through the follow_link method
It can get the targets of those links and save them to file
All this without needing to save your wanted links in an intermediate file! Life's sweet when you have the right tool for the job...
Example
use strict;
use warnings;
use WWW::Mechanize;
my $url = 'http://www.example.com/';
my $mech = WWW::Mechanize->new();
$mech->get ( $url );
my #linksOfInterest = $mech->find_all_links ( text_regex => qr/php\?dl=/ );
my $fileNumber++;
foreach my $link (#linksOfInterest) {
$mech->get ( $link, ':contentfile' => "file".($fileNumber++).".txt" );
$mech->back();
}

You can download the file with LWP::UserAgent:
my $ua = LWP::UserAgent->new();
my $response = $ua->get($url, ':content_file' => 'file.txt');
Or if you need a filehandle:
open my $fh, '<', $response->content_ref or die $!;

Old question, but when I'm doing quick scripts, I often use "wget" or "curl" and pipe. This isn't cross-system portable, perhaps, but if I know my system has one or the other of these commands, it's generally good.
For example:
#! /usr/bin/env perl
use strict;
open my $fp, "curl http://www.example.com/ |";
while (<$fp>) {
print;
}

Related

Perl: Open a file from a URL

I wanted to know how to open a file from a URL rather than a local file and I found the following answer on another thread:
use IO::String;
my $handle = IO::String->new(get("google.com"));
my #lines = <$handle>;
close $handle;
This works perfectly... on my PC...
But when I transferred the code over to my hosted server it complains that it can't find the IO module. So is there another way to open a file from an URL, that doesn't require any external modules (or uses one that is pretty much installed on every server)...?

You can install PerlIO::http, which will give you an input layer for opening a filehandle from a URL via open. This thing is not included in the Perl core, but it will work with Perls as early as 5.8.9.
Once you've installed it, all you need to do is open with a layer :http in the mode argument. There is nothing to use here. That happens automatically.
open my $fh, '<:http', 'https://metacpan.org/recent';
You can then read from $fh like a regular file. Under the hood it will take care of getting the data over the wire.
while (my $line = <$fh>) { ... }

There is no way to "open a file from a URL" as you ask. Well, I suppose you could throw something together using the progress() callback from LWP::UserAgent, but even then I don't think it would work how you want it to.
But you can make something that looks like it's doing what you want pretty easily. Actually, what we're really doing is pulling all the data back from the URL and then opening a filehandle on a string that contains that data.
use LWP::Simple;
my $data = get('https://google.com');
open my $url_fh, '<', \$data or die $!;
# Now $url_fh is a filehandle wrapped around your data.
# Treat it like any other filehandle.
while (<$url_fh>) {
print;
}
Your problem was that IO::String wasn't installed. But there's no need to install it, as it's simple enough to do what it does with standard Perl features (simply open a filehandle on a reference to a string).
Update: IO::String is completely unnecessary here. Not only because you can do what it does very simply, by just opening a filehandle on a reference to your string, but also because all you want to do is to read a file from a web site into an array. And in that case, your code is simply:
use LWP::Simple;
my $url = 'something';
my #records = split /\n/, get($url);
You might even consider adding some error handing.
use LWP::Simple;
my $url = 'something';
my $data = get($url);
die "No data found\n" unless defined $data;
my #array = split /\n/, get($url);

get the whole content from the web site using perl script

I am a new hire in my company and first time I am working on Perl.
I get a task in which I find IP-Reputation from this link: https://www.talosintelligence.com/reputation_center/lookup?search=27.34.246.62
But in perl when we use:
#!/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
open FILE1, ">./Reports/Reputation.txt" or die "Cannot open Reputation.txt!";
my $mech = WWW::Mechanize->new( autocheck => 1 );
my $url="https://www.talosintelligence.com/reputation_center/lookup?search=27.34.246.62";
$mech->get($url);
print $mech->status();
my $content = $mech->content();
open FILE1, ">./Reports/Reputation.txt" or die "Cannot open Reputation.txt!";
print FILE1 ($content);
close FILE1;
print "\nIP Reputation Report Generated \n";
I don't get the whole content. What can I do to get this?

Contents are loading from JavaScript. So you can't crawl the content using simple methods.
There is two option for this kind of situation.
1) Some API contains the original data and JavaScript loads/formating the data in front end. If you want to parse the JavaScript loading content try to use the
WWW::Mechanize::Firefox
2) Try to figure out from where it is loading, for your IP following link has the corresponding data, which is JSON formated, so parse the content using JSON module. it is so simple compare to using RegEx.
https://www.talosintelligence.com/sb_api/query_lookup?query=%2Fapi%2Fv2%2Frelated_ips%2Fip%2F&query_entry=27.34.246.62

Perl - Mechanize object size too big

i am trying to get a xml file from a database using WWW::Mechanize. I know that the file is quite big (bigger than my memory) and it is constantly crashes either i try to view it in the browser or try to store in in a file using get(). I am planning to user XML::Twig in the future, but i cannot ever store the result in a file.
Does anyone know how to split the mechanized object in little chunks,get them one after another, and store them in a file, one after another without running out of memory?
Here is the query api: ArrayExpress Programmatic Access .
Thank you.
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
my $base = 'http://www.ebi.ac.uk/arrayexpress/xml/v2/experiments';
#Parameters
my $query ='?species="homo sapiens"' ;
my $url = $base . $query;
# Create a new mechanize object
my $mech = WWW::Mechanize->new(stack_depth=>0);
# Associate the mechanize object with a URL
$mech->get($url);
#store xml content
my $content = $mech->content;
#open output file for writing
unlink("ArrayExpress_Human_Final.txt");
open( $fh, '>>:encoding(UTF-8)','ArrayExpress_Human_Final.txt') || die "Can't open file!\n";
print $fh $content;
close $fh;

Sounds like what you want to do is save the file directly to disk, rather than loading it into memory.
From the Mech FAQ question "How do I save an image? How do I save a large tarball?"
You can also save any content directly to disk using the :content_file flag to get(), which is part of LWP::UserAgent.
$mech->get( 'http://www.cpan.org/src/stable.tar.gz',
':content_file' => 'stable.tar.gz' );
Also note that if all you're doing is downloading the file, it may not even make sense to use WWW::Mechanize, and to use the underlying LWP::UserAgent directly.

Getstore to Buffer, not using temporary files

I've started Perl recently and mixed quite a bit of things to get what I want.
My script gets the content of a webpage, writes it to a file.
Then I open a filehandler, plug the file report.html in (sorry i'm not english, i don't know how to say it better) and parse it.
I write every line i encounter to a new file, except lines containing a specific color.
It works, but I'd like to try another way which doesn't require me to create a "report.html" temporary file.
Furthermore, I'd like to print my result directly in a file, I don't want to have to use a system redirection '>'. That'd mean my script has to be called by another .sh script, and I don't want that.
use strict;
use warnings;
use LWP::Simple;
my $report = "report.html";
getstore('http://test/report.php', 'report.html') or d\
ie 'Unable to get page\n';
open my $fh2, "<$report" or die("could not open report file : $!\n");
while (<$fh2>)
{
print if (!(/<td style="background-color:#71B53A;"/ .. //));
}
close($fh2);
Thanks for your help

If you have got the html content into a variable, you can use a open call on this variable. Like:
my $var = "your html content\ncomes here\nstored into this variable";
open my $fh, '<', \$var;
# .. just do the things you like to $fh
You can try get function in LWP::Simple Module ;)
To your sencond question, use open like open $fh, '<', $filepath. you can use perldoc -f open to see more info.

Perl script to automate a website for bioinformatics

I would like to automate this website with a Perl script
http://bioinfo.uni-plovdiv.bg/microinspector/
This is what I have so far and I am not sure how to get to the output page after this, I know it has something to do with POST, redirect_ok?, response(), but I am not sure. I read through the documentation but am confused about some things. Thanks.
use strict;
use warnings;
use WWW::Mechanize;
# create object for browser
my $browser = WWW::Mechanize->new();
my ($sequence, $results);
open (DRG, "<microRNA_target_cspg_drg_output.fa") || die "cannot open microRNA_target_cspg_drg_output.fa";
while (<DRG>) {
chomp;
$sequence=$_;
last; #for testing purposes
}
close (DRG);
$browser->get("http://bioinfo.uni-plovdiv.bg/microinspector/");
$browser->form_number(1);
$browser->field("target_sequence", $sequence);
$browser->field("Choose an organism : ", "Mus musculus");
$browser->click_button( number => 1);

You should start with WWW::Mechanize. It's page provides examples on submitting forms, and anything else you will need.
EDIT: as a reply to your update, if you want to get the content of the page, use the content method, like in this example:
my $content = $browser->content();
See this for more info.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How can I download link targets from a web site using Perl? - perl

You can download the file with LWP::UserAgent: my $ua = LWP::UserAgent->new(); my $response = $ua->get($url, ':content_file' => 'file.txt'); Or if you need a filehandle: open my $fh, '<', $response->content_ref or die $!;

Related

Perl: Open a file from a URL

get the whole content from the web site using perl script

Perl - Mechanize object size too big

Getstore to Buffer, not using temporary files

Perl script to automate a website for bioinformatics

Categories

Resources