Mojo::DOM shortcut to get absolute url for a resource? - perl

When parsing a webpage with Mojo::DOM (or any other framework), it's fairly common to be pulling a resource address that could be either relative or absolute. Is there a shortcut method to translate such a resource address to an absolute URL?
The following mojo command pulls all the stylesheets on mojolicio.us:
$ mojo get http://mojolicio.us "link[rel=stylesheet]" attr href
/mojo/prettify/prettify-mojo-light.css
/css/index.css
And the following script does the same, but also uses URI to translate the resource into an absolute URL.
use strict;
use warnings;
use Mojo::UserAgent;
use URI;
my $url = 'http://mojolicio.us';
my $ua = Mojo::UserAgent->new;
my $dom = $ua->get($url)->res->dom;
for my $csshref ($dom->find('link[rel=stylesheet]')->attr('href')->each) {
my $cssurl = URI->new($csshref)->abs($url);
print "$cssurl\n";
}
Outputs:
http://mojolicio.us/mojo/prettify/prettify-mojo-light.css
http://mojolicio.us/css/index.css
Obviously, a relative URL in this context should be made absolute using the URL that loaded DOM. However, I don't know of a way to get a resource absolute URL except for coding it myself.
There is Mojo::URL #to_abs in Mojolicious. However, I don't know if that would integrate in some way with Mojo::DOM, and by itself would take more code than URI.
My ideal solution would be if something like the following were possible from both a script and command line, but looking for any related insights into using Mojo for parsing:
mojo get http://mojolicio.us "link[rel=stylesheet]" attr href to_abs

I'm not sure why you think it would take more code to use Mojo::URL? In the following example I get the actual request URL from the transaction (there might have been redirects, which I've allowed) which I have called $base.
Then since $base is an instance of Mojo::URL I can create a new instance with $base->new. Of course if that seems to magical, you can replace it with Mojo::URL->new.
use Mojo::Base -strict;
use Mojo::UserAgent;
my $url = 'http://mojolicio.us';
my $ua = Mojo::UserAgent->new->max_redirects(10);
my $tx = $ua->get($url);
my $base = $tx->req->url;
$tx->res
->dom
->find('link[rel=stylesheet]')
->map(sub{$base->new($_->{href})->to_abs($base)})
->each(sub{say});

Related

Mojolicious and directory traversal

I am new to Mojolicious and trying to build a tiny webservice using this framework ,
I wrote the below code which render some file remotely
use Mojolicious::Lite;
use strict;
use warnings;
app->static->paths->[0]='C:\results';
get '/result' => sub {
my $self = shift;
my $headers = $self->res->headers;
$headers->content_type('text/zip;charset=UTF-8');
$self->render_static('result.zip');
};
app->start;
but it seems when i try to fetch the file using the following url:
http://mydomain:3000/result/./../result
i get the file .
is there any option on mojolicious to prevent such directory traversal?
i.e in the above case i want only
http:/mydomain:300/result
to serve the page if someone enter this url :
http://mydomain:3000/result/./../result
the page should not be served .
is it possoible to do this ?
/$result^/ is a regular expression, and if you have not defined the scalar variable $result (which it does not appear you have), it resolves to /^/, which matches not just
http://mydomain:3000/result/./../result but also
http://mydomain:3000/john/jacob/jingleheimer/schmidt.
use strict and use warnings, even on tiny webservices.

LWP getstore usage

I'm pretty new to Perl. While I just created a simple scripts to retrieve a file with
getstore($url, $file);
But how do I know whether the task is done correctly or the connection interrupted in the middle, or authentication failed, or whatever response. I searched all the web and I found some, like a response list, and some talking about useragent stuff, which I totally can't understand, especially the operator $ua->.
What I wish is to an explanation about that operator stuff (I don't even know what -> used for), and the RC code meaning, and finally, how to use it.
Its a lot of stuff so I appreciate any answer given, even just partially. And, thanks first for whoever will to help. =)
The LWP::Simple module is just that: quite simplistic. The documentation states that the getstore function returns the HTTP status code which we can save into a variable. There are also the is_success and is_error functions that tell us whether a certain return value is ok or not.
my $url = "http://www.example.com/";
my $filename = "some-file.html";
my $rc = getstore($url, $filename)
if (is_error($rc)) {
die "getstore of <$url> failed with $rc";
}
Of course, this doesn't catch errors with the file system.
The die throws a fatal exception that terminates the execution of your script and displays itself on the terminal. If you don't want to abort execution use warn.
The LWP::Simple functions provide high-level controls for common tasks. If you need more control over the requests, you have to manually create an LWP::UserAgent. An user agent (abbreviated ua) is a browser-like object that can make requests to servers. We have very detailed control over these requests, and can even modify the exact header fields.
The -> operator is a general dereference operator, which you'll use a lot when you need complex data structures. It is also used for method calls in object-oriented programming:
$object->method(#args);
would call the method on $object with the #args. We can also call methods on class names. To create a new object, usually the new method is used on the class name:
my $object = The::Class->new();
Methods are just like functions, except that you leave it to the class of the object to figure out which function exactly will be called.
The normal workflow with LWP::UserAgent looks like this:
use LWP::UserAgent; # load the class
my $ua = LWP::UserAgent->new();
We can also provide named arguments to the new method. Because these UA objects are robots, it is considered good manners to tell everybody who sent this Bot. We can do so with the from field:
my $ua = LWP::UserAgent->new(
from => 'ss-tangerine#example.com',
);
We could also change the timeout from the default three minutes. These options can also be set after we constructed a new $ua, so we can do
$ua->timeout(30); # half a minute
The $ua has methods for all the HTTP requests like get and post. To duplicate the behaviour of getstore, we first have to get the URL we are interested in:
my $url = "http://www.example.com/";
my $response = $ua->get($url);
The $response is an object too, and we can ask it whether it is_success:
$response->is_success or die $response->status_line;
So if execution flows past this statement, everything went fine. We can now access the content of the request. NB: use the decoded_content method, as this manages transfer encodings for us:
my $content = $response->decoded_content;
We can now print that to a file:
use autodie; # automatic error handling
open my $fh, ">", "some-file.html";
print {$fh} $content;
(when handling binary files on Windows: binmode $fh after opening the file, or use the ">:raw" open mode)
Done!
To learn about LWP::UserAgent, read the documentation. To learn about objects, read perlootut. You can also visit the perl tag on SO for some book suggestions.

Posting or passing perl variables to next web page?

How can I pass the variables from one perl webpage to the next, here is my example:
This is what I want passed from the first page, $data[0] and $data[2]
<a href="Month_entries.pl?month='$data[2]'&user='$data[0]'
style="text-decoration:none"
onclick="return popitup('Month_entries')">$busitotal2</a>
With it going to Month_entries.pl how to a call these variables in the new webpage(Month_entries)? what is this process called?
First, you should make sure that you are constructing the URI you actually want.
You probably don't want ' characters in the data
You problem should be protecting against XSS and broken data with URI::Encode.
Then it comes down to getting data from the query string.
How you do this depends on how you server and Perl are communicating.
If you are using Plack (which is generally a good idea for modern Perl), then see the code in the synopsis for Plack::Request:
my $app_or_middleware = sub {
my $env = shift;
my $req = Plack::Request->new($env);
my $path_info = $req->path_info;
# Change 'query' to whatever you called your key in the query string
my $query = $req->param('query');
my $res = $req->new_response(200);
$res->finalize;
};
If you are using a framework (such as Web::Simple, Catalyst or Dancer) then it will probably provide its own interface.
If you are using CGI, and using the CGI module, you would:
my $cgi = CGI->new();
my $ query = $cgi->param('query')

Get files from a given URL on the basis of pattern passed using Perl on Unix

I have been told that a given URL contains several xml and text files and I need to download all the xml files starting with AAA(that is AAA*.xml) inside a given directory.
Credentials to access that URL are provided to me.
Please not that size of xml files could be in GBs.
I have used below code to achieve the same-
use strict;
use warnings;
use LWP;
my $browser = LWP::UserAgent->new;
my $username ='scott';
my $password='tiger';
# Create HTTP request object
my $req = HTTP::Request->new( GET => "https://url.com/");
# Authenticate the user
$req->authorization_basic( $username , $password);
my $res = $browser->request( $req , ':content_file' => '/fold/AAA1.xml');
print $res->status_line, "\n";
It prints 200 OK status but I am not able to get the file. Any suggestions?
Man
If the server doesn't allow you to receive a folder list (i.e. Apache without "Options +Indexes"), you will not GET the collection of files.
But, having the list, you can filter it with a regexpr like /AAA.*/, and with LWP::Simple module, it's easy to get it

What do I gain by filtering URLs through Perl's URI module?

Do I gain something when I transform my $url like this: $url = URI->new( $url )?
#!/usr/bin/env perl
use warnings; use strict;
use 5.012;
use URI;
use XML::LibXML;
my $url = 'http://stackoverflow.com/';
$url = URI->new( $url );
my $doc = XML::LibXML->load_html( location => $url, recover => 2 );
my #nodes = $doc->getElementsByTagName( 'a' );
say scalar #nodes;
The URI module constructor would clean up the URI for you - for example correctly escape the characters invalid for URI construction (see URI::Escape).
The URI module as several benefits:
It normalizes the URL for you
It can resolve relative URLs
It can detect invalid URLs (although you need to turn off the schemeless bits)
You can easily filter the URLs that you want to process.
The benefit that you get with the little bit of code that you show is minimal, but as you continue to work on the problem, perhaps spidering the site, URI becomes more handy as you select what to do next.
I'm surprised nobody has mentioned it yet, but$url = URI->new( $url ); doesn't clean up your $url and hand it back to you, it creates a new object of class URI (or, rather, of one if its subclasses) which can then be passed to other code which requires a URI object. That's not particularly important in this case, since XML::LibXML appears to be happy to accept locations as either strings or objects, but some other modules require you to give them a URI object and will reject URLs presented as plain strings.