Perl WWW::Mechanize Comparing the response header content length of different urls - perl

I have a question I'm hoping you could help with?
I have two text files containing the following:
FILE1.txt
http://www.dog.com/
http://www.cat.com/
http://www.antelope.com/
FILE2.txt
1
2
Barry
The output I correctly achieve is as follows:
http://www.dog.com/1
http://www.dog.com/2
http://www.dog.com/Barry
http://www.cat.com/1
http://www.cat.com/2
http://www.cat.com/Barry
http://www.antelope.com/1
http://www.antelope.com/2
http://www.antelope.com/Barry
Code to do the above
open my $animalUrls, '<', 'FILE1.txt' or die "Can't open: $!";
open my $directory, '<', 'FILE2.txt' or die "Can't open: $!";
my #directory = <$directory>; #each line of the file into an array
close $directory or die "Can't close: $!";
while (my $line = <$animalUrls>) {
chomp $line;
print $line.$_ foreach (#directory);
push (#newListOfUrls, $line.$_) foreach (#directory); #put each new url into array
}
Now the problem I am having:
I need to get the Content Length of the original urls (File1.txt) and compare the Content-Length of each of the new urls with the corresponding original one to see if they are the same or different, for example:
I need to compare the Content-Length of http://www.dog.com/1 with the Content-Length of original url http://www.dog.com/ to see if there is a difference.
Then I need to compare the Content-Length of http://www.dog.com/2 with the Content-Length of the original url http://www.dog.com/ to see if there is a difference.
Then I need to compare the Content-Length of http://www.dog.com/Barry with the Content-Length of the original url http://www.dog.com/ to see if there is a difference.
Then I need to compare the Content-Length of http://www.cat.com/1 with the Content-Length of the original url http://www.cat.com/ to see if there is a difference.
Then I need to compare the Content-Length of http://www.cat.com/2 with the Content-Length of the original url http://www.cat.com/ to see if there is a difference.
And so on........
Code to get the Content-Length:
print $mech->response->header('Content-Length'); #returns the content length
What I am having trouble with is how to compare each new url with the correct corresponding original one? (i.e not accidently comparing the Content-Length of http://www.cat.com/Barry with the Content-Length of http://www.dog.com/) Should I use a hash maybe and how would I go about that?
Your help with this would be much appreciated, Many Thanks

You should use a hash for this. I'd change your input code to make a more complex data structure, as this makes the task easier.
open my $animalUrls, '<', 'FILE1.txt' or die "Can't open: $!";
open my $directory, '<', 'FILE2.txt' or die "Can't open: $!";
my #directory = <$directory>; #each line of the file into an array
close $directory or die "Can't close: $!";
my $newURLs;
while ( my $baseURL = <$animalUrls> ) {
chomp $baseURL;
SUBDIR: foreach my $subdir (#directory) {
chomp $subdir;
next SUBDIR if $subdir eq "";
# put each new url into arrayref
push( #{ $newURLs->{$baseURL} }, $baseURL . $subdir );
}
}
We can now use this to our advantage. Assuming we have already set up Mechanize:
foreach my $url ( keys %{$newURLs} ) {
# first get the base URL and save its content length
$mech->get($url);
my $content_length = $mech->response->header('Content-Length');
# now iterate all the 'child' URLs
foreach my $child_url ( #{ $newURLs->{$url} } ) {
# get the content
$mech->get($child_url);
# compare
if ( $mech->response->header('Content-Length') != $content_length ) {
print "$child_url: different content length: $content_length vs "
. $mech->response->header('Content-Length') . "!\n";
}
}
}
You could even do it without the second set of foreach loops by putting the code where you build up your data structure.
If you are unfamiliar with these references, take a look at perlreftut. What we have done here is make a hash with a key for each of the base URLs, and put an array of all the generated child URLs into that. If you use Data::Dumper to output the final $newURLs, it will look something like this:
$VAR1 = {
'http://www.dog.com/' => [
'http://www.dog.com/1',
'http://www.dog.com/2',
],
'http://www.cat.com/' => [
'http://www.cat.com/1',
'http://www.cat.com/2',
],
};
EDIT: I updated the code. I used these files to test it:
URLS:
http://www.stackoverflow.com/
http://www.superuser.com/
Dirs:
faq
questions
/

This code seems to do what you need. It stores all the URLs in #urls and prints the content lengths as it fetches each URL. I don't know what you need the length data for afterwards, but I have stored the lengths of each response in the hash %lengths to associate them with the URLs.
use 5.010;
use warnings;
use LWP::UserAgent;
STDOUT->autoflush;
my #urls;
open my $fh, '<', 'FILE1.txt' or die $!;
while (my $base = <$fh>) {
chomp $base;
push #urls, $base;
open my $fh, '<', 'FILE2.txt' or die $!;
while (my $path = <$fh>) {
chomp $path;
push #urls, $base.$path;
}
}
my $ua = LWP::UserAgent->new;
my %lengths;
for my $url (#urls) {
my $resp = $ua->get($url);
my $length = $resp->header('Content-Length');
$lengths{$url} = $length;
printf "%s -- %s\n", $url, $length // 'undef';
}
output
http://www.dog.com/ -- undef
http://www.dog.com/1 -- 56244
http://www.dog.com/2 -- 56244
http://www.dog.com/Barry -- 56249
http://www.cat.com/ -- 156
http://www.cat.com/1 -- 11088
http://www.cat.com/2 -- 11088
http://www.cat.com/Barry -- 11088
http://www.antelope.com/ -- undef
http://www.antelope.com/1 -- undef
http://www.antelope.com/2 -- undef
http://www.antelope.com/Barry -- undef

Related

Including a perl file that is generated in current file

I am working on a perl script that successfully generates output files containing hashes. I want to use those hashes in my file. Is it possible to include a file that is generated in that file or will I have to create another file?
Technically, it might be cleaner to start a new .pl file that uses those hashes, but I would like to keep everything in a single script if possible. Is it even possible to do so?
Edit: I'm just unsure if I can "circle" it back around so I can use those hashes in my file because the hashes are generated on a weekly basis. I don't want my file to mistakenly reach out for last week's hashes instead of the newly generated ones. I have not yet wrote my script in a manner to classify each week's generated hashes.
In summary, here is what my file does. It extracts a table from another file. removes columns and rows that are not needed. Once left with the only two columns needed, it takes them and puts them into a hash. One column being the key and the other being the value. For this reason, I've found Data::Dumper to be the best option for my hashes. I'm intermediate in Perl and this is a script I'm putting together for an internship.
Here is an example how you can save a hash as JSON to a file and later read back the JSON to a perl hash. This example is using JSON::XS:
use strict;
use warnings;
use Data::Dumper;
use JSON::XS;
{
my %h = (a => 1, b => 2);
my $str = encode_json( \%h );
my $fn = 'test.json';
save_json( $fn, \%h );
my $h2 = read_json( $fn );
print Dumper( $h2 );
}
sub read_json {
my ( $fn ) = #_;
open ( my $fh, '<', $fn ) or die "Could not open file '$fn': $!";
my $str = do { local $/; <$fh> };
close $fh;
my $h = decode_json $str;
return $h;
}
sub save_json {
my ( $fn, $hash ) = #_;
my $str = encode_json( $hash );
open ( my $fh, '>', $fn ) or die "Could not open file '$fn': $!";
print $fh $str;
close $fh;
}
Output:
$VAR1 = {
'a' => 1,
'b' => 2
};
Some alternatives to JSON are YAML and Storable.

how to extract substrings by knowing the coordinates

I am terribly sorry for bothering you with my problem in several questions, but I need to solve it...
I want to extract several substrings from a file whick contains string by using another file with the begin and the end of each substring that I want to extract.
The first file is like:
>scaffold30 24194
CTTAGCAGCAGCAGCAGCAGTGACTGAAGGAACTGAGAAAAAGAGCGAGCTGAAAGGAAGCATAGCCATTTGGGAGTGCCAGAGAGTTGGGAGG GAGGGAGGGCAGAGATGGAAGAAGAAAGGCAGAAATACAGGGAGATTGAGGATCACCAGGGAG.........
.................
(the string must be everything in the file except the first line), and the coordinates file is like:
44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
.....................
my script is this:
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $count = 0;
while(<INFILE1>) {
if ($_ !~ m/^>/) {
local $/ = undef;
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my #cline = <INFILE>;
foreach my $cline (#cline) {
print "$cline\n";
my#data = split('\t', $cline);
my $start = $data[0];
my $end = $data[1];
my $offset = $end - $start;
$count++;
my $sub = substr ($var, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE;
}
}
when I run it, it looks like it does only one iteration and it prints me the start of the first file.
It seems like the foreach loop doesn't work.
also substr seems that doesn't work.
when I put an exit to print the cline to check the loop, it prints all the lines of the file with the coordinates.
I am sorry if I become annoying, but I must finish it and I am a little bit desperate...
Thank you again.
This line
local $/ = undef;
changes $/ for the entire enclosing block, which includes the section where you read in your second file. $/ is the input record separator, which essentially defines what a "line" is (it is a newline by default, see perldoc perlvar for details). When you read from a filehandle using <>, $/ is used to determine where to stop reading. For example, the following program relies on the default line-splitting behavior, and so only reads until the first newline:
my $foo = <DATA>;
say $foo;
# Output:
# 1
__DATA__
1
2
3
Whereas this program reads all the way to EOF:
local $/;
my $foo = <DATA>;
say $foo;
# Output:
# 1
# 2
# 3
__DATA__
1
2
3
This means your #cline array gets only one element, which is a string containing the text of your entire coordinates file. You can see this using Data::Dumper:
use Data::Dumper;
print Dumper(\#cline);
Which in your case will output something like:
$VAR1 = [
'44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
'
];
Notice how your array (technically an arrayref in this case), delineated by [ and ], contains only a single element, which is a string (delineated by single quotes) that contains newlines.
Let's walk through the relevant sections of your code:
while(<INFILE1>) {
if ($_ !~ m/^>/) {
# Enable localized slurp mode. Stays in effect until we leave the 'if'
local $/ = undef;
# Read the rest of INFILE1 into $var (from current line to EOF)
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
# In list context, return each block until the $/ character as a
# separate list element. Since $/ is still undef, this will read
# everything until EOF into our first list element, resulting in
# a one-element array
my #cline = <INFILE>;
# Since #cline only has one element, the loop only has one iteration
foreach my $cline (#cline) {
As a side note, your code could be cleaned up a bit. The names you chose for your filehandles leave something to be desired, and you should probably use lexical filehandles anyway (and the three-argument form of open):
open my $chromosome_fh, "<", $ARGV[0] or die $!;
open my $coordinates_fh, "<", $ARGV[1] or die $!;
Also, you do not need to nest your loops in this case, it just makes your code more convoluted. First read the relevant parts of your chromosome file into a variable (named something more meaningful than var):
# Get rid of the `local $/` statement, we don't need it
my $chromosome;
while (<$chromosome_fh>) {
next if /^>/;
$chromosome .= $_;
}
Then read in your coordinates file:
my #cline = <$coordinates_fh>;
Or if you only need to use the contents of the coordinates file once, process each line as you go using a while loop:
while (<$coordinates_fh>) {
# Do something for each line here
}
As 'ThisSuitIsBlackNot' suggested, your code could be cleaned up a little. Here is a possible solution that may be what you want.
#!/usr/bin/perl
use strict;
use warnings;
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $fasta;
<INFILE1>; # get rid of the first line - '>scaffold30 24194'
while(<INFILE1>) {
chomp;
$fasta .= $_;
}
close INFILE1 or die "Could not close '$chrom'. $!";
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my $count = 0;
while(<INFILE>) {
my ($start, $end) = split;
# Or, should this be: my $offset = $end - ($start - 1);
# That would include the start fasta
my $offset = $end - $start;
$count++;
my $sub = substr ($fasta, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE or die "Could not close '$coords_file'. $!";

How to compare two text files and removing the matching contents and pass to output in perl?

I have two text files text1.txt and text2.txt like below
text1
ac
abc
abcd
abcde
text2
ab
abc
acd
abcd
output
ac
abcde
I need to compare the two files and remove the content from text1 when there is a match in the second file.
I want the code in Perl. Currently I am trying the below code.
#!usr/bin/perl
use strict;
use warnings;
open (GEN, "text1.txt") || die ("cannot open general.txt");
open (SEA, "text2.txt") || die ("cannot open search.txt");
open (OUT,">> output.txt") || die ("cannot open intflist.txt");
open (LOG, ">> logfile.txt");
undef $/;
foreach (<GEN>) {
my $gen = $_;
chomp ($gen);
print LOG $gen;
foreach (<SEA>) {
my $sea = $_;
chomp($sea);
print LOG $sea;
if($gen ne $sea) {
print OUT $gen;
}
}
}
In this I am getting all content from text1, not the unmatched content. Please help me out.
I think you should read the text2 in an array and then in the second foreach on that array use the array.
#b = <SEA>;
Or else in the second loop the file pointer would be at the end already
One way:
#!/usr/bin/perl
use strict;
use warnings;
$\="\n";
open my $fh1, '<', 'file1' or die $!;
open my $fh2, '<', 'file2' or die $!;
open my $out, '>', 'file3' or die $!;
chomp(my #arr1=<$fh1>);
chomp(my #arr2=<$fh2>);
foreach my $x (#arr1){
print $out $x if (!grep (/^\Q$x\E$/,#arr2));
}
close $fh1;
close $fh2;
close $out;
After executing the above, the file 'file3' contains:
$ cat file3
ac
abcde
This is my plan:
Read the contents of first file in a hash, with a counter of occurrences. For example, working with your data you get:
%lines = ( 'ac' => 1,
'abc' => 1,
'abcd' => 1,
'abcde' => 1);
Read the second file, deleting the previous hash %lines if key exists.
Print the keys %lines to the desired file.
Example:
use strict;
open my $fh1, '<', 'text1' or die $!;
open my $fh2, '<', 'text2' or die $!;
open my $out, '>', 'output' or die $!;
my %lines = ();
while( my $key = <$fh1> ) {
chomp $key;
$lines{$key} = 1;
}
while( my $key = <$fh2> ) {
chomp $key;
delete $lines{$key};
}
foreach my $key(keys %lines){
print $out $key, "\n";
}
close $fh1;
close $fh2;
close $out;
Your main problem is that you have undefined the input record separator $/. That means the whole file will be read as a single string, and all you can do is say that the two files are different.
Remove undef $/ and things will work a whole lot better. However the inner for loop will read and print all the lines in file2 that don't match the first line of file1. The second time this loop is encountered all the data has been read from the file so the body of the loop won't be executed at all. You must either open file2 inside the outer loop or read the file into an array and loop over that instead.
Then again, do you really want to print all lines from file2 that aren't equal to each line in file1?
Update
As I wrote in my comment, it sounds like you want to output the lines in text1 that don't appear anywhere in text2. That is easily achieved using a hash:
use strict;
use warnings;
my %exclude;
open my $fh, '<', 'text2.txt' or die $!;
while (<$fh>) {
chomp;
$exclude{$_}++;
}
open $fh, '<', 'text1.txt' or die $!;
while (<$fh>) {
chomp;
print "$_\n" unless $exclude{$_};
}
With the data you show in your question, that produces this output
ac
abcde
I would like to view your problem like this:
You have a set S of strings in file.txt.
You have a set F of forbidden strings in forbidden.txt.
You want the strings that are allowed, so S \ F (setminus).
There is a data structure in Perl that implements a set of strings: The hash. (It can also map to scalars, but that is secondary here).
So first we create the set of the lines we have. We let all the strings in that file map to undef, as we don't need that value:
open my $FILE, "<", "file.txt" or die "Can't open file.txt: $!";
my %Set = map {$_ => undef} <$FILE>;
We create the forbidden set the same way:
open my $FORBIDDEN, "<", "forbidden.txt" or die "Can't open forbidden.txt: $!";
my %Forbidden = map {$_ => undef} <$FORBIDDEN>;
The set minus works like either of these ways:
For each element x in S, x is in the result set R iff x isn't in F.
my %Result = map {$_ => $Set{$_}} grep {not exists $Forbidden{$_}} keys %Set;
The result set R initially is S. For each element in F, we delete that item from R:
my %Result = %Set; # make a copy
delete $Result{$_} for keys %Forbidden;
(the keys function accesses the elements in the set of strings)
We can then print out all the keys: print keys %Result.
But what if we want to preserve the order? Entries in a hash can also carry an associated value, so why not the line number? We create the set S like this:
open my $FILE, "<", "file.txt" or die "Can't open file.txt: $!";
my $line_no = 1;
my %Set = map {$_ => $line_no++} <$FILE>;
Now, this value is carried around with the string, and we can access it at the end. Specifically, we sort the keys in the hash after their line number:
my #sorted_keys = sort { $Result{$a} <=> $Result{$b} } keys %Result;
print #sorted_keys;
Note: All of this assumes that the files are terminated by newline. Else, you would have to chomp.

Count the occurrences of a string in a file

I wrote a perl script to count the occurrences of a character in a file.
So far this is what I have got,
#!/usr/bin/perl -w
use warnings;
no warnings ('uninitialized', 'substr');
my $lines_ref;
my #lines;
my $count;
sub countModule()
{
my $file = "/test";
open my $fh, "<",$file or die "could not open $file: $!";
my #contents = $fh;
my #filtered = grep (/\// ,#contents);
return \#filtered;
}
#lines = countModule();
##lines = $lines_ref;
$count = #lines;
print "###########\n $count \n###########\n";
My test file looks like this:
10.0.0.1/24
192.168.10.0/24
172.16.30.1/24
I am basically trying to count the number of instances of "/"
This is the output that I get:
###########
1
###########
I am getting 1 instead of 3, which is the number of occurrences.
Still learning perl, so any help will be appreciated..Thank you!!
Here are a few points about your code
You should always use strict at the top of your program, and only use no warnings for special reasons in a limited scope. There is no general reason why a working Perl program should need to disable warnings globally
Declare your variables close to their first point of use. The style of declaring everything at the top of the file is unnecessary and is a legacy of C
Never use prototypes in your code. They are available for very special purposes and shouldn't be used for the vast majority of Perl code. sub countModule() { ... } insists that countModule may never be called with any parameters and isn't necessary or useful. The definition should be just sub countModule { ... }
A big well done! for using a lexical file handle, the three-parameter form of open, and putting $! in your die string
my #contents = $fh will just set #contents to a single-element list containing just the filehandle. To read the whole file into the array you need my #contents = <$fh>
You can avoid escaping slashes in a regular expression if you use a different delimiter. To do that you need to use the m operator explicitly, like my #filtered = grep m|/|, #contents)
You return an array reference but assign the returned value to an array, so #lines = countModule() sets #lines to a single-element list containing just the array reference. You should either return a list with return #filtered or dereference the return value on assignment with #lines = #{ countModule }
If all you need to do is to print the number of lines in the file that contain a slash character then you could write something like this
use strict;
use warnings;
my $count;
sub countModule {
open my $fh, '<', '/test' or die "Could not open $file: $!";
return [ grep m|/|, <$fh> ];
}
my $lines = countModule;
$count = #$lines;
print "###########\n $count \n###########\n";
Close, but a few issues:
use strict;
use warnings;
sub countModule
{
my $file = "/test";
open my $fh, "<",$file or die "could not open $file: $!";
my #contents = <$fh>; # The <> brackets are used to read from $fh.
my #filtered = grep (/\// ,#contents);
return #filtered; # Remove the reference.
}
my #lines = countModule();
my $count = scalar #lines; # 'scalar' is not required, but lends clarity.
print "###########\n $count \n###########\n";
Each of the changes I made to your code are annotated with a #comment explaining what was done.
Now in list context your subroutine will return the filtered lines. In scalar context it will return a count of how many lines were filtered.
You did also mention find the occurrences of a character (despite everything in your script being line-oriented). Perhaps your counter sub would look like this:
sub file_tallies{
my $file = '/test';
open my $fh, '<', $file or die $!;
my $count;
my $lines;
while( <$fh> ) {
$lines++;
$count += $_ =~ tr[\/][\/];
}
return ( $lines, $count );
}
my( $line_count, $slash_count ) = file_tallies();
In list context,
return \#filtered;
returns a list with one element -- a reference to the named array #filtered. Maybe you wanted to return the list itself
return #filtered;
Here's some simpler code:
sub countMatches {
my ($file, $c) = #_; # Pass parameters
local $/;
undef $/; # Slurp input
open my $fh, "<",$file or die "could not open $file: $!";
my $s = <$fh>; # The <> brackets are used to read from $fh.
close $fh;
my $ptn = quotemeta($c); # So we can match strings like ".*" verbatim
my #hits = $s =~ m/($ptn)/g;
0 + #hits
}
print countMatches ("/test", '/') . "\n";
The code pushes Perl beyond the very basics, but not too much. Salient points:
By undeffing $/ you can read the input into one string. If you're counting
occurrences of a string in a file, and not occurrences of lines that contain
the string, this is usually easier to do.
m/(...)/g will find all the hits, but if you want to count strings like
"." you need to quote the meta characters in them.
Store the results in an array to evaluate m// in list context
Adding 0 to a list gives the number of items in it.

How can I check if a value is in a list in Perl?

I have a file in which every line is an integer which represents an id. What I want to do is just check whether some specific ids are in this list.
But the code didn't work. It never tells me it exists even if 123 is a line in that file. I don't know why? Help appreciated.
open (FILE, "list.txt") or die ("unable to open !");
my #data=<FILE>;
my %lookup =map {chop($_) => undef} #data;
my $element= '123';
if (exists $lookup{$element})
{
print "Exists";
}
Thanks in advance.
You want to ensure you make your hash correctly. The very outdated chop isn't what you want to use. Use chomp instead, and use it on the entire array at once and before you create the hash:
open my $fh, '<', 'list.txt' or die "unable to open list.txt: $!";
chomp( my #data = <$fh> );
my $hash = map { $_, 1 } #data;
With Perl 5.10 and up, you can also use the smart match operator:
my $id = get_id_to_check_for();
open my $fh, '<', 'list.txt' or die "unable to open list.txt: $!";
chomp( my #data = <$fh> );
print "Id found!" if $id ~~ #data;
perldoc -q contain
chop returns the character it chopped, not what was left behind. You perhaps want something like this:
my %lookup = map { substr($_,0,-1) => undef } #data;
However, generally, you should consider using chomp instead of chop to do a more intelligent CRLF removal, so you'd end up with a line like this:
my %lookup =map {chomp; $_ => undef } #data;
Your problem is that chop returns the character chopped, not the resulting string, so you're creating a hash with a single entry for newline. This would be obvious in debugging if you used Data::Dumper to output the resulting hash.
Try this instead:
my #data=<FILE>;
chomp #data;
my %lookup = map {$_ => undef} #data;
This should work... it uses first in List::Util to do the searching, and eliminates the initial map (this is assuming you don't need to store the values for something else immediately after). The chomp is done while searching for the value; see perldoc -f chomp.
use List::Util 'first';
open (my $fh, 'list.txt') or die 'unable to open list.txt!';
my #elements = <$fh>;
my $element = '123';
if (first { chomp; $_ eq $element } #elements)
{
print "Exists";
}
This one may not exactly match your specific problem,
but if your integer numbers need to be
counted, you might even use the good
old "canonical" perl approach:
open my $fh, '<', 'list.txt' or die "unable to open list.txt: $!";
my %lookup;
while( <$fh> ) { chomp; $lookup{$_}++ } # this will count occurences if ints
my $element = '123';
if( exists $lookup{$element} ) {
print "$element $lookup{$element} times there\n"
}
This might even be in some circumstances faster than
solutions with intermediate array.
Regards
rbo