Perl LWP Find links in page that contain a specific word

Perl LWP Find links in page that contain a specific word - match

I am really stuck.
I am using LWP.
I desire to push specific links from an HTML document into an array.
But:
while ($edocument =~ m/href\s*=\s*"([^"\s]+)"/gi) {
#dostuff
}
Will process all the links.
I just want the links that have the word 'test' in the url.
I have tried all kinds of combinations like. (too many attempts to list)
while ($edocument =~ m/href\s*=\s*"([^"\s*test*]+)"/gi) {
I have been reading and reading and I really need a clue for this embarrassing situation.
Can someone help?
In addition, I only need ONE match of the word test per $edocument as well. Kind of like last I guess in a loop.
Also tried variations of
#links = $edocument =~ m/<a[^>]+href\s*=\s*["']?([^"'> ]+)/ig;
Then ran #links through a unique sub. But still, just need links with the word 'test'.

What about the following regexp:
while ($edocument =~ m/href\s*=\s*"([^"\s]+test[^"\s]+)"/gi) { #dostuff }
This regexp only matches urls with substring test in it.

Related

Validating a path in Perl

I am working on a lab for a class I am taking and I have a question about checking Perl strings for specific input.
Essentially what I want to be able to do is make sure that the input I am receiving from a user is something like:
/home/[anything is valid]/memo
The whole point of the lab is to prevent a pathname attack in a simple program that the instructor provided us with. So I want to check to make sure that the pathname provided by the user is within this format before doing anything with it.
I am currently using the abs_path() method in Perl to make get the absolute path of the string being passed in, but now I need to make sure that the absolute path contains what I have above.
Here is what I am trying to achieve:
my $input = "localhost:8080/cgi-bin/memo.cgi?memo=/home/megaboz/memo/new_CEO";
my $memo = '/home/megaboz/memo/new_CEO';
my $pathName = abs_path($memo);
if($pathName ne "/home/[anything works here]/memo/[anything works here]") {
#throw an error
}
else {
#process input
}
Any pointers?

Welcome to the wonderful world of regular expressions, which is something Perl is quite good with.
Let's walk through how to construct one of these. First, we usually use forward slashes to denote a regex, i.e.
/some-expression/
but since your paths have forward slashes in them, doing so would involve a messy bit of string escaping, so we'll use an alternate delimiter with m instead.
m(some-expression)
Now, we want to start with /home/ and end with /memo. You can read all about the different syntax in the link above, but in regular expressions we use ^ and $ (called anchors) to represent, respectively, the start and end of the string. So our regex is going to look like
m(^/home/SOMETHING/memo$)
Now for the piece in the middle. We want anything to pass. Your general purpose "anything" regex is a dot ., which matches any single character. And we can apply the Kleene star *, which says "zero or more of whatever comes before". So .* together says "zero or more of anything at all".
m(^/home/.*/memo$)
There's our regex. To apply it, we use =~ to ask "does it match", or !~ to ask "does it fail". The way your code is structured, we want to check for failure.
if ($pathName !~ m(^/home/.*/memo$)) {
...
} else {
...
}
Regular expressions are fairly ubiquitous and can be used in basically any programming language, so it's definitely a skill worth having (although Perl is particularly well-known for having strong regex support, so you're in the right tool for string matching capabilities).

There's a lot that's missing from your question, so I'll have to make some guesses. And, since Stackoverflow is mostly about other people having similar problems reading these answers, some of this may not apply to you. Futhermore, most of this is about web security and not particular to Perl. You'd want to go through the same things in any language.
First, you say "anything works here". Don't let that be true. Consider that .., the virtual parent directory, specifies movement around the directory structure:
/home/../../memo/../../../target.pl
You end up with a file that you didn't want to expose. Not only that, if they were, through other means, able to make a memo symlink in the right spots, they are able to use that to move around too. That is, you can't really tell what file you'll get just by looking at the path because symlink (or hard links, too, I guess) can completely change things. What if memo was a symlink to /?
Second, don't ever let remote CGI users tell you where a file is. That's just too much for them to decide for you. Instead, it looks like there are two things that you will allow them to supply. A directory in the second position and something at the end. Make them specify those two things in isolation:
https://localhost:8080/cgi-bin/memo.cgi?user=megaboz&thing=NewCEO
You still have to validate these two things, but it's much easier to do them separately than in the middle of a bunch of other things. And, since you are taking input from the user and mapping it onto the file system, you should use taint checking (perlsec), which helps you catch user input being used outside your program. To untaint a value, use a match and capture what you will allow. I suggest that you do not try to salvage any bad data here. If it doesn't match what you expect, return an error. Also, it's better to specify what you allow rather than come up with everything you will disallow:
#!perl -T
my( $user ) = however_you_get_CGI_params( 'user' ) =~ m/\A([a-z0-9]+)\z/i;
my( $thing ) = however_you_get_CGI_params( 'thing' ) =~ m/\A([a-z0-9]+)\z/i;
unless( defined $user and defined $thing ) { ... return some error ... }
Now, this doesn't mean that the values you now have in $user and $thing are true. They are merely valid values. Map those to whatever you need to fetch. Since you've constructed a path, checking that the path exists might be enough:
use File::Spec::Functions;
my $path = catfile( '/home', $user, 'memo', $thing );
unless( -e $path ) { ... return some error ... }

About searching recursively in Perl

I have a Perl script that I, well, mostly pieced together from questions on this site. I've read the documentation on some parts to better understand it. Anyway, here it is:
#!/usr/bin/perl
use File::Find;
my $dir = '/home/jdoe';
my $string = "hard-coded pattern to match";
find(\&printFile, $dir);
sub printFile
{
my $element = $_;
if(-f $element && $element =~ /\.txt$/)
{
open my $in, "<", $element or die $!;
while(<$in>)
{
if (/\Q$string\E/)
{
print "$File::Find::name\n";
last; # stops looking after match is found
}
}
}
}
This is a simple script that, similar to grep, will look down recursively through directories for a matching string. It will then print the location of the file that contains the string. It works, but only if the file is located in my home directory. If I change the hard-coded search to look in a different directory (that I have permissions in), for example /admin/programs, the script no longer seems to do anything: No output is displayed, even when I know it should be matching at least one file (tested by making a file in admin/programs with the hard-coded pattern. Why am I experiencing this behavior?
Also, might as well disclaim that this isn't a really useful script (heck, this would be so easy with grep or awk!), but understanding how to do this in Perl is important to me right now. Thanks
EDIT: Found the problem. A simple oversight in that the files in the directory I was looking for did not have .txt as extension. Thanks for helping me find that.

I was able to get the desired output using the code you pasted by making few changes like:
use strict;
use warnings;
You should always use them as they notify of various errors in your code which you may not get hold of.
Next I changed the line :
my $dir = './home/jdoe'; ##'./admin/programs'
The . signifies current directory. Also if you face problems still try using the absolute path(from source) instead of relative path. Do let me know if this solves your problem.

This script works fine without any issue. One thing hidden from this script to us is the pattern. you can share the pattern and let us know what you are expecting from that pattern, so that we can validate that.
You could also run your program in debug mode i.e.,
perl -d your_program.
That should take you to debug mode and there are lot of options available to inspect through the flow. type 'n' on the debug prompt to step in to the code flow to understand how your code flows. Typing 'n' will print the code execution point and its result

To match for a certain number

I have a file which has a lot of floating point numbers like this:
4.5268e-06 4.5268e-08 4.5678e-01 4.5689e-04...
I need to check if there is atleast one number with an expoenent -1. So, I wrote this short snippet with the regex. The regex works because I checked and it does. But what I am getting in the output is all 1s. I know I am missing something very basic. Please help.
#!usr/local/bin/perl
use strict;
use warnings;
my $i;
my #values;
open(WPR,"test.txt")||die "couldnt open $!";
while(<WPR>)
{
chomp();
push #values,(/\d\.\d\d\d\de+[+-][0][1]/);
}
foreach $i (#values){
print "$i\n";}
close(WPR);

The regular expression match operator m (which you have omitted) returns true if it matches. True in Perl is usually returned as 1. (Note that most stuff is true, though).
If you want to stick with the short syntax, do this:
push #values, $1 if /(\d\.\d\d\d\de+[+-][0][1])/;

If I move the parenthesis, it works fine:
push #values,/(\d\.\d\d\d\de+[+-][0][1])/;
If there's going to be more than one match on the line, I'd add a g at the end.
If you have capture groups, and a list context, then match returns a list of capture results.
If you want to take this to its insane conclusion then:
my #values = map { /(\d\.\d\d\d\de+[+-][0][1])/g } <WPR> ;
Yes, you can use <WPR> in a list context too.
BTW, while your regex works, it probably isn't exactly what you meant. For example e+ matches one or more es. A little simpler might be:
/\d\.\d{4}e[+-]01/ ;
Which is still going to have other issues like matching x.xxxxe+01 as well.

You could try with this one:
/\d+\.\d+e-01/

Perl: Frequency of words and a top ten list of the words

Im working on making a perl script work, and beware I'm rather new to this..
Here's is what im trying to achieve: making a script that takes a .txt file and counts each word in the file. And when it's counted the words makes a list if the top 10 words in the file displaying how many times each word i counted.
well here's what ive got so far, ive been able to make the script count the words and how many times they appear. Now i need to make the top ten list and I don't really know where and how to do it. This is a homework assignment so I don't want/expect you to solve it for me, just give me some pointers in where to begin.
Thank you for helping (in advance)
Updated 15 oct
ok, it's sorting everything great but..
As it is now it's just printing everything in one line. I need it to print it like this:
4 word
3 next word
2 next word
Well you get it..
I think i've got it...i think :P
......................................
#! /usr/bin/perl
use utf8;
print ("Vilken fil?\n");
my $filen = #ARGV ? shift(#ARGV) : <STDIN>;
chomp $filen;
my %freq;
open my $DATA, $filen or die "Hittade inte den filen!";
while(<$DATA>) {
s/[;:()".,!?]/ /gio;
foreach $word(split(' ', lc $_)) {
$freq{$word}++;
}
}
#listing = (sort { $freq{$b} <=> $freq{$a} } keys %freq)[0..9];
foreach my $word (#listing )
{ print $freq{$word}." $word\n"; };

Look at docs for the Perl sort function:
http://perldoc.perl.org/functions/sort.html
It has a form that lets you specify a block of code to define the ordering of elements. You can use this to order your list by frequency rather than by the word's alphabetical ordering.
The docs include this example:
# this sorts the %age hash by value instead of key
# using an in-line function
#eldest = sort { $age{$b} <=> $age{$a} } keys %age;
You should be able to adapt this pattern to your own problem.
Probably the most efficient way to get the top ten list is to keep track of the top ten as you go: each time you compute a count, check if it belongs in the top ten, and if so then insert it in the correct place, potentially knocking off the bottom item on the list. That way, you only need to track the ordering of ten words at a time regardless of how big the dictionary is. I don't know if you need this extra efficiency, though.
By the way, I have seen this kind of question in several job interviews, so it's a good thing to have a handle on.

Building on Nate's answer, you can extract the top 10 elements, by using a slice:
#eldest = ( sort { $age{$b} <=> $age{$a} } keys %age)[0..9];

Ha, by the time I finished reading your problem description I knew it was some kind of homework assignment! :)
For the next step, you have to scan through your %count hash and determine which words have the most occurrences.
The most naive way would be to scan through the list 10 times; each time, find the one with the highest count and store it in a top-ten list, then remove it from %count (or set it to 0 would also work).
If you want to get more ambitious, you could implement a sort function that sorts the %count entries, and then the 10 highest will be right together.
My Perl is rusty, but the Perl lib might even have some sort functions for you. In general, it's definitely worth your time to skim through a library reference to familiarize yourself on what's available.

How can i count the respective lines for each sub in my perl code?

I am refactoring a rather large body of code and a sort of esoteric question came to me while pondering where to go on with this. What this code needs in large parts is shortening of subs.
As such it would be very advantageous to point some sort of statistics collector at the directory, which would go through all the .pm, .cgi and .pl files, find all subs (i'm fine if it only gets the named ones) and gives me a table of all of them, along with their line count.
I gave PPI a cursory look, but could not find anything directly relevant, with some tools that might be appropiate, but rather complex to use.
Are there any easier modules that do something like this?
Failing that, how would you do this?
Edit:
Played around with PPI a bit and created a script that collects relevant statistics on a code base: http://gist.github.com/514512

my $document = PPI::Document->new($file);
# Strip out comments and documentation
$document->prune('PPI::Token::Pod');
$document->prune('PPI::Token::Comment');
# Find all the named subroutines
my $sub_nodes = $document->find(
sub { $_[1]->isa('PPI::Statement::Sub') and $_[1]->name } );
print map { sprintf "%s %s\n", $_->name, scalar split /\n/, $_->content } #$sub_nodes;

I'm dubious that simply identifying long functions is the best way to identify what needs to be refactored. Instead, I'd run the code through perlcritic at increasing levels of harshness and follow the suggestions.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Perl LWP Find links in page that contain a specific word - match

What about the following regexp: while ($edocument =~ m/href\s=\s"([^"\s]+test[^"\s]+)"/gi) { #dostuff } This regexp only matches urls with substring test in it.

Related

Validating a path in Perl

About searching recursively in Perl

To match for a certain number

Perl: Frequency of words and a top ten list of the words

How can i count the respective lines for each sub in my perl code?

Categories

Resources

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Perl LWP Find links in page that contain a specific word - match

What about the following regexp: while ($edocument =~ m/href\s*=\s*"([^"\s]+test[^"\s]+)"/gi) { #dostuff } This regexp only matches urls with substring test in it.

Related

Validating a path in Perl

About searching recursively in Perl

To match for a certain number

Perl: Frequency of words and a top ten list of the words

How can i count the respective lines for each sub in my perl code?

Categories

Resources

What about the following regexp: while ($edocument =~ m/href\s=\s"([^"\s]+test[^"\s]+)"/gi) { #dostuff } This regexp only matches urls with substring test in it.