Removing top-directory-only URLs from a list of URLs? - perl

I have a question that I'm having trouble researching, as I don't know how to ask it correctly on a search engine.
I have a list of URLs. I would like to have some automated way (Perl for preference) to go through the list and remove all URLs that are top directory only.
So for example I might have this list:
http://www.example.com/hello.html
http://www.foo.com/this/thingrighthere.html
In this case I would want to remove example.com from my list, as it is either top-directory only or they reference files in a top directory.
I'm trying to figure out how to do that. My first thought was, count forward slashes and if there's more than two, eliminate the URL from the list. But then you have trailing forward slashes, so that wouldn't work.
Any ideas or thoughts would be much appreciated.

Something like this:
use URI::Split qw( uri_split );
my $url = "http://www.foo.com/this/thingrighthere.html";
my ($scheme, $auth, $path, $query, $frag) = uri_split( $url );
if (($path =~ tr/\///) > 1 ) {
print "I care about this $url";
}
http://metacpan.org/pod/URI::Split

You could do this with regexes, but its much less work to let the URI library do it for you. You won't get caught out by funny schemes, escapes, and extra stuff before and after the path (query, anchor, authorization...). There's some trickiness around how paths are represented by path_segments(). See the comments below and the URI docs for details.
I have assumed that http://www.example.com/foo/ is considered a top directory. Adjust as necessary, but its something you have to think about.
#!/usr/bin/env perl
use URI;
use File::Spec;
use strict;
use warnings;
use Test::More 'no_plan';
sub is_top_level_uri {
my $uri = shift;
# turn it into a URI object if it isn't already
$uri = URI->new($uri) unless eval { $uri->isa("URI") };
# normalize it
$uri = $uri->canonical;
# split the path part into pieces
my #path_segments = $uri->path_segments;
# for an absolute path, which most are, the absoluteness will be
# represented by an empty string. Also /foo/ will come out as two elements.
# Strip that all out, it gets in our way for this purpose.
#path_segments = grep { $_ ne '' } #path_segments;
return #path_segments <= 1;
}
my #filtered_uris = (
"http://www.example.com/hello.html",
"http://www.example.com/",
"http://www.example.com",
"https://www.example.com/",
"https://www.example.com/foo/#extra",
"ftp://www.example.com/foo",
"ftp://www.example.com/foo/",
"https://www.example.com/foo/#extra",
"https://www.example.com/foo/?extra",
"http://www.example.com/hello.html#extra",
"http://www.example.com/hello.html?extra",
"file:///foo",
"file:///foo/",
"file:///foo.txt",
);
my #unfiltered_uris = (
"http://www.foo.com/this/thingrighthere.html",
"https://www.example.com/foo/bar",
"ftp://www.example.com/foo/bar/",
"file:///foo/bar",
"file:///foo/bar.txt",
);
for my $uri (#filtered_uris) {
ok is_top_level_uri($uri), $uri;
}
for my $uri (#unfiltered_uris) {
ok !is_top_level_uri($uri), $uri;
}

Use the URI module from CPAN. http://search.cpan.org/dist/URI
This is a solved problem. People have already written, tested and debugged code that handles this already. Whenever you have a programming problem that others have probably had to deal with, then look for existing code that does it for you.

Related

Method to parse request_uri header from decoded JSON in Perl

Ok, so here is what we are doing. We are viewing a json request/response string.
Code snippet (assuming relevant modules been used):
if( open( my $json_file, $filename ))
{
my $json = JSON->new;
my $data = $json->decode(<$json_file>);
close( json_file );
$request_uri = $data->{'input'}{'Headers'}{'REQUEST_URI'};
}
So $request_uri looks something like
/user/12345?param1=4&param2=9956
Whilst I could use regex or whatever to extract data out of there, I am sure this is a common situation and there should be a method to parse this particular REST into its parts and then extract them out. I do not see this in the REST manual which seems to be more about constructing requests.
Use the URI module.
my $request_uri = URI->new( $data->{'input'}{'Headers'}{'REQUEST_URI'} );
my $path = $request_uri->path;
my $query = $request_uri->query;
# etc

Perl mechanize Find all links array loop issue

I am currently attempting to create a Perl webspider using WWW::Mechanize.
What I am trying to do is create a webspider that will crawl the whole site of the URL (entered by the user) and extract all of the links from every page on the site.
But I have a problem with how to spider the whole site to get every link, without duplicates
What I have done so far (the part im having trouble with anyway):
foreach (#nonduplicates) { #array contain urls like www.tree.com/contact-us, www.tree.com/varieties....
$mech->get($_);
my #list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/); #find all links on this page that starts with http://www.tree.com
#NOW THIS IS WHAT I WANT IT TO DO AFTER THE ABOVE (IN PSEUDOCODE), BUT CANT GET WORKING
#foreach (#list) {
#if $_ is already in #nonduplicates
#then do nothing because that link has already been found
#} else {
#append the link to the end of #nonduplicates so that if it has not been crawled for links already, it will be
How would I be able to do the above?
I am doing this to try and spider the whole site to get a comprehensive list of every URL on the site, without duplicates.
If you think this is not the best/easiest method of achieving the same result I'm open to ideas.
Your help is much appreciated, thanks.
Create a hash to track which links you've seen before and put any unseen ones onto #nonduplicates for processing:
$| = 1;
my $scanned = 0;
my #nonduplicates = ( $urlToSpider ); # Add the first link to the queue.
my %link_tracker = map { $_ => 1 } #nonduplicates; # Keep track of what links we've found already.
while (my $queued_link = pop #nonduplicates) {
$mech->get($queued_link);
my #list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);
for my $new_link (#list) {
# Add the link to the queue unless we already encountered it.
# Increment so we don't add it again.
push #nonduplicates, $new_link->url_abs() unless $link_tracker{$new_link->url_abs()}++;
}
printf "\rPages scanned: [%d] Unique Links: [%s] Queued: [%s]", ++$scanned, scalar keys %link_tracker, scalar #nonduplicates;
}
use Data::Dumper;
print Dumper(\%link_tracker);
use List::MoreUtils qw/uniq/;
...
my #list = $mech->find_all_links(...);
my #unique_urls = uniq( map { $_->url } #list );
Now #unique_urls contains the unique urls from #list.

URI::URL Not Getting Parameter

# The url I'm on: https://development.cherrylanekeepsakes.com/cgi-bin/employees.cgi?action=edit_timeclock_dashboard&id=80&start_date=2012-09-01&end_date=2012-09-16&print_view=1
use URI::URL;
use Data::Dumper;
my $url = URI::URL->new( '' . $cgi->new->url(-path_info => 1, -query => 1) );
warn Dumper($url->params('print_view'));
It gives me nothing. What am I doing wrong? This seems like a pretty simple task.
Is there a reason you're using URI::URL instead of URI? It's an obsolete module that only exists for backwards compatibility. It's not even documented, so I can't even confirm that params is suppose to do what you think it does.
What follows is a solution using the module that replaced URI::URL. It's even part of the same distribution.
use URI qw( );
my $url = URI->new('https://...');
my %query_form = $url->query_form();
say $query_form{print_view};
Or better yet,
use URI qw( );
use URI::QueryParam qw( );
my $url = URI->new('https://...');
say $url->query_param('print_view');
Note: To assign one of the value of query_param to a scalar, you need to use parens as follows:
my ($print_view) = $url->query_param('print_view');
As per #ikegami's recommendation, I am now using URI, instead of obsolete URL::URI.
# https://development.cherrylanekeepsakes.com/cgi-bin/employees.cgi?action=edit_timeclock_dashboard&id=80&start_date=2012-09-01&end_date=2012-09-16&print_view=1
use URI;
use URI::QueryParam;
my $url = URI->new('' . $cgi->new->url(-path_info => 1, -query => 1));
warn $url->query_param('print_view'); # prints 1 as expected
The reason your code displays nothing is that your URL has no param fields - only a set of query fields.
A URL looks roughly like
scheme://host:port/path1/path2;param1;param2?query1=A&query2=B#fragment
and there are no semicolons in your URL

Perl -- 'Not a HASH reference' error when using JSON::RPC::Client

I'm a newbie in Perl.
I have a JSON-RPC server running at http://localhost:19000 and I need to call checkEmail() method.
use JSON::RPC::Client;
my $client = new JSON::RPC::Client;
my $url = 'http://localhost:19000';
my $callobj = {
method => 'checkEmail',
params => [ 'rprikhodchenko#gmail.com' ],
};
my $res = $client->call($url, $callobj);
if($res) {
if ($res->is_error) {
print "Error : ", $res->error_message;
}
else {
print $res->result;
}
}
else {
print $client->status_line;
}
When I try to launch it it tells following:
perl ./check_ac.pl
Not a HASH reference at /usr/local/share/perl/5.10.1/JSON/RPC/Client.pm line 193.
UPD:
Full stack-trace:
perl -MCarp::Always ./check_ac.pl
Not a HASH reference at /usr/local/share/perl/5.10.1/JSON/RPC/Client.pm line 193
JSON::RPC::ReturnObject::new('JSON::RPC::ReturnObject', 'HTTP::Response=HASH(0x9938d48)', 'JSON=SCALAR(0x96f1518)') called at /usr/local/share/perl/5.10.1/JSON/RPC/Client.pm line 118
JSON::RPC::Client::call('JSON::RPC::Client=HASH(0x944a818)', 'http://localhost:19000', 'HASH(0x96f1578)') called at ./check_ac.pl line 11
This error means that your JSON-RPC server is not actually one, inasmuch as it does not satisfy requirement 7.3. The error is triggered when JSON::RPC::Client assumes the document returned by the JSON-RPC service is well-formed (i.e., a JSON Object), and this assumptions turns out to have been in error. A bug report to the author of JSON::RPC::Client would be an appropriate way to request better error messaging.
I would attack this sort of problem by finding out what the server was returning that was causing JSON::RPC::Client to choke. Unfortunately, JRC fails to provide adequate hookpoints for finding this out, so you'll have to be a little bit tricky.
I don't like editing external libraries, so I recommend an extend-and-override approach to instrumenting traffic with the JSON-RPC server. Something like this (in check_ac.pl):
use Data::Dumper qw();
package JSON::RPC::InstrumentedClient;
use base 'JSON::RPC::Client';
# This would be better done with Module::Install, but I'm limiting dependencies today.
sub _get {
my ($self, #args) = #_;
return $self->_dump_response($self->SUPER::_get(#args));
}
sub _post {
my ($self, #args) = #_;
return $self->_dump_response($self->SUPER::_post(#args));
}
sub _dump_response {
my ($self, $response) = #_;
warn Data::Dumper::Dump([$response->decoded_content], [qw(content)]);
return $response;
}
package main;
my $client = JSON::RPC::InstrumentedClient->new();
my $url = 'http://localhost:19000';
... # rest of check_ac.pl
This wraps the calls to _get and _post that JSON::RPC::Client makes internally in such a way as to let you examine what the web server actually said in response to the request we made. The above code dumps the text content of the page; this might not be the right thing in your case and will blow up if an error is encountered. It's a debugging aid only, to help you figure out from the client code side what is wrong with the server.
That's enough caveats for now, I think. Good luck.
It seems to be a bug in method new of JSON::RPC::ReturnObject.
sub new {
my ($class, $obj, $json) = #_;
my $content = ( $json || JSON->new->utf8 )->decode( $obj->content );
#...
# line 193
$content->{error} ? $self->is_success(0) : $self->is_success(1);
#...
}
$content's value will be something returned from a JSON::decode() call. But looking at the documentation, it seems that JSON->decode() returns a scalar which could be a number, a string, an array reference, or a hash reference.
Unfortunately, JSON::RPC::ReturnObject->new() doesn't check what sort of thing JSON->decode() returned before trying to access it as a hashref. Given your error, I'm going to go ahead and assume what it got in your case was not one. :-)
I don't know if there's a way to force a fix from your code. I'd recommend contacting the author and letting him know about the issue, and/or filing a bug.

How can I validate a website URL in Perl?

I need a regular expression or module for validating the website URL using Perl.
Regexp::Common::URI::http
I don't use regular expressions. I try to create a URI object and see what happens. If it works, I have a URI object that I can query to get the scheme (the other things get turned into "schemeless" URIs).
use URI;
while( <DATA> )
{
chomp;
my $uri = URI->new( $_, 'http' );
if( $uri->scheme ) { print "$uri is a URL\n"; }
else { print "$uri is not a URL\n"; }
}
__END__
foo.html
http://www.example.com/index.html
abc
www.example.com
If I'm looking for a specific sort of URI, I can query the object to see if it satisfies whatever I need, such as a particular domain name. If I'm doing something with URLs, I'm probably going to make an object anyway, so I might as well start with it.
Since you are talking about "a website URL", I guess you are interested in HTTP and HTTPS URLs only.
For that, instead of using regex, you can use the Perl's Data::Validate::URI module.
For example, to validate HTTP and HTTPS URLs:
use Data::Validate::URI;
my $url = "http://google.com";
my $uriValidator = new Data::Validate::URI();
print "Valid web URL!" if $uriValidator->is_web_uri($url)
And, to validate HTTP URL only:
print "Valid HTTP URL!" if $uriValidator->is_http_uri($url)
Finally, to validate any well-formatted URI:
print "Valid URI!" if $uriValidator->is_uri($url)
If instead, for any reason, you actually want a regex, then you can use something like the following to validate HTTP/HTTPS/FTP/SFTP URLs:
print "Valid URL!\n" if $url =~ /^(?:(?:https?|s?ftp))/i;
use Regexp::Common qw /URI/;
while (<>) {
/($RE{URI}{HTTP})/ and print "$1 is an HTTP URI.\n";
}