How can I validate a website URL in Perl?

How can I validate a website URL in Perl? - perl

I need a regular expression or module for validating the website URL using Perl.

Regexp::Common::URI::http

I don't use regular expressions. I try to create a URI object and see what happens. If it works, I have a URI object that I can query to get the scheme (the other things get turned into "schemeless" URIs).
use URI;
while( <DATA> )
{
chomp;
my $uri = URI->new( $_, 'http' );
if( $uri->scheme ) { print "$uri is a URL\n"; }
else { print "$uri is not a URL\n"; }
}
__END__
foo.html
http://www.example.com/index.html
abc
www.example.com
If I'm looking for a specific sort of URI, I can query the object to see if it satisfies whatever I need, such as a particular domain name. If I'm doing something with URLs, I'm probably going to make an object anyway, so I might as well start with it.

Since you are talking about "a website URL", I guess you are interested in HTTP and HTTPS URLs only.
For that, instead of using regex, you can use the Perl's Data::Validate::URI module.
For example, to validate HTTP and HTTPS URLs:
use Data::Validate::URI;
my $url = "http://google.com";
my $uriValidator = new Data::Validate::URI();
print "Valid web URL!" if $uriValidator->is_web_uri($url)
And, to validate HTTP URL only:
print "Valid HTTP URL!" if $uriValidator->is_http_uri($url)
Finally, to validate any well-formatted URI:
print "Valid URI!" if $uriValidator->is_uri($url)
If instead, for any reason, you actually want a regex, then you can use something like the following to validate HTTP/HTTPS/FTP/SFTP URLs:
print "Valid URL!\n" if $url =~ /^(?:(?:https?|s?ftp))/i;

use Regexp::Common qw /URI/;
while (<>) {
/($RE{URI}{HTTP})/ and print "$1 is an HTTP URI.\n";
}

Related

Regexp::Common qw /URI/ not allowing HTTPS

A bug was given to me where https isn't allowed to be entered into one of our forms. After searching the form I noticed that we are using Regexp::Common qw /URI/
I have tried
if ($params{URL} =~ /$RE{URI}{HTTP}{-keep}{-scheme}/)
{
$form{URL} = $1;
}
else
{
$error .= '<li>Website Address is invalid. The URL must be in this form: <b>http://example.com</b></li>';
}
and that allows http and https but only saves ://www.google.com into the database
if ($params{URL} =~ /$RE{URI}{HTTP}{-keep}/)
{
$form{URL} = $1;
}
else
{
$error .= '<li>Website Address is invalid. The URL must be in this form: <b>http://example.com</b></li>';
}
allows only http but saves the entire url into the database
if ($params{URL} =~ /$RE{URI}{HTTP}{-scheme}/)
{
$form{URL} = $1;
}
else
{
$error .= '<li>Website Address is invalid. The URL must be in this form: <b>http://example.com</b></li>';
}
allows http and https but doesn't save ANYTHING to the database
What I would like is to have https and http valid AND have the complete url saved in the database.

The -scheme flag for Regexp::Common::URI::http takes an argument which is a regex to match allowed schemes. It defaults to just matching http and leaving out the argument seems to mean the scheme is not included in the match at all. So to match both http and https you can pass it a regex of https?:
m/$RE{URI}{HTTP}{-scheme => qr<https?>}{-keep}/

How may I bypass LWP's URL encoding for a GET request?

I'm talking to what seems to be a broken HTTP daemon and I need to make a GET request that includes a pipe | character in the URL.
LWP::UserAgent escapes the pipe character before the request is sent.
For example, a URL passed in as:
https://hostname/url/doSomethingScript?ss=1234&activities=Lec1|01
is passed to the HTTP daemon as
https://hostname/url/doSomethingScript?ss=1234&activities=Lec1%7C01
This is correct, but doesn't work with this broken server.
How can I override or bypass the encoding that LWP and its friends are doing?
Note
I've seen and tried other answers here on StackOverflow addressing similar problems. The difference here seems to be that those answers are dealing with POST requests where the formfield parts of the URL can be passed as an array of key/value pairs or as a 'Content' => $content parameter. Those approaches aren't working for me with an LWP request.
I've also tried constructing an HTTP::Request object and passing that to LWP, and passing the full URL direct to LWP->get(). No dice with either approach.
In response to Borodin's request, this is a sanitised version of the code I'm using
#!/usr/local/bin/perl -w
use HTTP::Cookies;
use LWP;
my $debug = 1;
# make a 'browser' object
my $browser = LWP::UserAgent->new();
# cookie handling...
$browser->cookie_jar(HTTP::Cookies->new(
'file' => '.cookie_jar.txt',
'autosave' => 1,
'ignore_discard' => 1,
));
# proxy, so we can watch...
if ($debug == 1) {
$browser->proxy(['http', 'ftp', 'https'], 'http://localhost:8080/');
}
# user agent string (pretend to be Firefox)
$agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.7.12) Gecko/20050919 Firefox/1.0.7';
# set the user agent
$browser->agent($agent);
# do some things here to log in to the web site, accept session cookies, etc.
# These are basic POSTs of filled forms. Works fine.
# [...]
my $baseURL = 'https://hostname/url/doSomethingScript?ss=1234&activities=VALUEA|VALUEB';
#values = ['Lec1', '01', 'Lec1', '02'];
while (1) {
if (scalar(#values) < 2) { last; }
my $vala = shift(#values);
my $valb = shift(#values);
my $url = $basEURL;
$url =~ s/VALUEA/$vala/g;
$url =~ s/VALUEB/$valb/g;
# simplified. Would usually check request for '200' response, etc...
$content = $browser->get($url)->content();
# do something here with the content
# [...]
# fails because the '|' character in the url is escaped after it's handed
# to LWP
}
# end

As #bchgys mentions in his comment, this is (almost) answered in the linked thread. Here are two solutions:
The first and arguably cleanest one is to locally override the escape map in URI::Escape to not modify the pipe character:
use URI;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new();
my $res;
{
# Violate RFC 2396 by forcing broken query string
# local makes the override take effect only in the current code block
local $URI::Escape::escapes{'|'} = '|';
$res = $ua->get('http://server/script?q=a|b');
}
print $res->request->as_string, "\n";
Alternatively, you can simply undo the escaping by modifying the URI directly in the request after the request has been created:
use HTTP::Request;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new();
my $req = HTTP::Request->new(GET => 'http://server/script?q=a|b');
# Violate RFC 2396 by forcing broken query string
${$req->uri} =~ s/%7C/|/;
my $res = $ua->request($req);
print $res->request->as_string, "\n";
The first solution is almost certainly preferable because it at least relies on the %URI::Escape::escapes package variable which is exported and documented, so that's probably as close as you're gonna get to doing this with a supported API.
Note that in either case you are in violation of RFC 2396 but as mentioned you may have no choice when talking to a broken server that you have no control over.

ReCaptcha Implementation in Perl

To implement recaptcha in my website.
One option is google API . But for that i need to signup with domain name to get API key.
Is there any other way we can do it ?

You don't necessarily need a domain name to sign up, per se.
They have a concept of a "global key" where one single domain key would be used on several domains. When signing up, select the "Enable this key on all domains (global key)" option, and use a unique identifier (domainkey.abhilasha.com) and this will be fine, you can use the key from any domain in the end.

One way: add this code to your perl file that is called by an html form:
Simplified of course
my #field_names=qw(name branch email g-recaptcha-response);
foreach $field_name (#field_names)
{
if (defined param("$field_name"))
{
$FIELD{$field_name} = param("$field_name");
}
}
$captcha=$FIELD{'g-recaptcha-response'};
use LWP::Simple;
$secretKey = "put your key here";
$ip = remote_host;
#Remove # rem to test submitted variables are present
#print "secret= $secretKey";
#print " and response= $captcha";
#print " and remoteip= $ip";
$URL = "https://www.google.com/recaptcha/api/siteverify?secret=".$secretKey."&response=".$captcha."&remoteip=".$ip;
$contents = get $URL or die;
# contents variable takes the form of: "success": true, "challenge_ts": "2016-11-21T16:02:41Z", "hostname": "www.mydomain.org.uk"
use Data::Dumper qw(Dumper);
# Split contents variable by comma:
my ($success, $challenge_time, $hostname) = split /,/, $contents;
# Split success variable by colon:
my ($success_title, $success_value) = split /:/, $success;
#strip whitespace:
$success_value =~ s/^\s+//;
if ($success_value eq "true")
{
print "it worked";
}else{
print "it did not";
}

If you are just trying to block spam, I prefer the honeypot captcha approach: http://haacked.com/archive/2007/09/10/honeypot-captcha.aspx
Put an input field on your form that should be left blank, then hide it with CSS (preferably in an external CSS file). A robot will find it and will put spam in it but humans wont see it.
In your form validation script, check the length of the field, if it contains any characters, do not process the form submission.

Removing top-directory-only URLs from a list of URLs?

I have a question that I'm having trouble researching, as I don't know how to ask it correctly on a search engine.
I have a list of URLs. I would like to have some automated way (Perl for preference) to go through the list and remove all URLs that are top directory only.
So for example I might have this list:
http://www.example.com/hello.html
http://www.foo.com/this/thingrighthere.html
In this case I would want to remove example.com from my list, as it is either top-directory only or they reference files in a top directory.
I'm trying to figure out how to do that. My first thought was, count forward slashes and if there's more than two, eliminate the URL from the list. But then you have trailing forward slashes, so that wouldn't work.
Any ideas or thoughts would be much appreciated.

Something like this:
use URI::Split qw( uri_split );
my $url = "http://www.foo.com/this/thingrighthere.html";
my ($scheme, $auth, $path, $query, $frag) = uri_split( $url );
if (($path =~ tr/\///) > 1 ) {
print "I care about this $url";
}
http://metacpan.org/pod/URI::Split

You could do this with regexes, but its much less work to let the URI library do it for you. You won't get caught out by funny schemes, escapes, and extra stuff before and after the path (query, anchor, authorization...). There's some trickiness around how paths are represented by path_segments(). See the comments below and the URI docs for details.
I have assumed that http://www.example.com/foo/ is considered a top directory. Adjust as necessary, but its something you have to think about.
#!/usr/bin/env perl
use URI;
use File::Spec;
use strict;
use warnings;
use Test::More 'no_plan';
sub is_top_level_uri {
my $uri = shift;
# turn it into a URI object if it isn't already
$uri = URI->new($uri) unless eval { $uri->isa("URI") };
# normalize it
$uri = $uri->canonical;
# split the path part into pieces
my #path_segments = $uri->path_segments;
# for an absolute path, which most are, the absoluteness will be
# represented by an empty string. Also /foo/ will come out as two elements.
# Strip that all out, it gets in our way for this purpose.
#path_segments = grep { $_ ne '' } #path_segments;
return #path_segments <= 1;
}
my #filtered_uris = (
"http://www.example.com/hello.html",
"http://www.example.com/",
"http://www.example.com",
"https://www.example.com/",
"https://www.example.com/foo/#extra",
"ftp://www.example.com/foo",
"ftp://www.example.com/foo/",
"https://www.example.com/foo/#extra",
"https://www.example.com/foo/?extra",
"http://www.example.com/hello.html#extra",
"http://www.example.com/hello.html?extra",
"file:///foo",
"file:///foo/",
"file:///foo.txt",
);
my #unfiltered_uris = (
"http://www.foo.com/this/thingrighthere.html",
"https://www.example.com/foo/bar",
"ftp://www.example.com/foo/bar/",
"file:///foo/bar",
"file:///foo/bar.txt",
);
for my $uri (#filtered_uris) {
ok is_top_level_uri($uri), $uri;
}
for my $uri (#unfiltered_uris) {
ok !is_top_level_uri($uri), $uri;
}

Use the URI module from CPAN. http://search.cpan.org/dist/URI
This is a solved problem. People have already written, tested and debugged code that handles this already. Whenever you have a programming problem that others have probably had to deal with, then look for existing code that does it for you.

How do I access the HTTP Header of request in a CGI script?

I've used Perl a bit for small applications and test code, but I'm new to networking and CGI.
I get how to make the header of a request (using CGI.pm and printing the results of the header() function), but haven't been able to find any info on how to access the headers being sent to my CGI script. Could someone point me in the right direction?
This could be from a request like this:
curl http://127.0.0.1:80/cgi-bin/headers.cgi -H "HeaderAttribute: value"

The CGI module has a http() function you can use to that purpose:
#!/usr/bin/perl --
use strict;
use warnings;
use CGI;
my $q = CGI->new;
my %headers = map { $_ => $q->http($_) } $q->http();
print $q->header('text/plain');
print "Got the following headers:\n";
for my $header ( keys %headers ) {
print "$header: $headers{$header}\n";
}
Try it out; the above gives me:
$ curl http://localhost/test.cgi -H "HeaderAttribute: value"
Got the following headers:
HTTP_HEADERATTRIBUTE: value
HTTP_ACCEPT: */*
HTTP_HOST: localhost
HTTP_USER_AGENT: curl/7.21.0 (i686-pc-linux-gnu) libcurl/7.21.0 OpenSSL/0.9.8o zlib/1.2.3.4 libidn/1.18

In addition to the CGI.pm http() method you can get HTTP headers information from the environment variables.
So in case you are using something like CGI::Minimal, which doesn't have the http method. you can do something like:
my $header = 'HTTP_X_REQUESTED_WITH';
if (exists $ENV{$header} && lc $ENV{$header} eq 'xmlhttprequest') {
_do_some_ajaxian_stuff();
}

They're supplied as environment variables, such as
HTTP_HEADERATTRIBUTE=value
You may have to do something to configure your web server to supply such a variable, though.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How can I validate a website URL in Perl? - perl

I need a regular expression or module for validating the website URL using Perl.

Regexp::Common::URI::http

use Regexp::Common qw /URI/; while (<>) { /($RE{URI}{HTTP})/ and print "$1 is an HTTP URI.\n"; }

Related

Regexp::Common qw /URI/ not allowing HTTPS

How may I bypass LWP's URL encoding for a GET request?

ReCaptcha Implementation in Perl

Removing top-directory-only URLs from a list of URLs?

How do I access the HTTP Header of request in a CGI script?

Categories

Resources