I would like to count the total number of files whose modify_time is between $atime and $btime. Here is part of my code, but it doesn't return anything. What is wrong?
sub mtime_between {
my $mtime=0;
my $counts=0;
$mtime = (stat $File::Find::name)[9] if -f $File::Find::name;
if ($mtime > $atime and $mtime < $btime) {
return sub { print ++$counts,"$File::Find::name\n"};
}
When i call the subroutine, I get nothing.
find(\&mtime_between,"/usr");
You should not be returning a function.
Check File::Find documentation.
find() does a depth-first search over the given #directories in the order they are given. For each file or directory found, it calls the &wanted subroutine.
In the wanted function you should do the things you want to do directly. To return a function reference will not work and this is why you are having problems.
So you actually want something more like:
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw{say};
use File::Find;
use Data::Dumper;
my ($atime, $btime) = (1461220840, 1561220844);
sub findFilesEditedBetweenTimestamps {
my ($atime, $btime, $path) = #_;
my $count = 0;
my #files = ();
my $mtime_between = sub {
my $mtime = 0;
$mtime = (stat $File::Find::name)[9] if -f $File::Find::name;
if ($mtime > $atime and $mtime < $btime) {
push #files, $File::Find::name;
$count++;
}
return;
};
find ($mtime_between, $path);
say "Found a total of $count files";
say "Files:";
print Dumper(#files);
}
findFilesEditedBetweenTimestamps($atime, $btime, "./");
I get:
Found a total of 2 files
Files:
$VAR1 = './test.txt';
$VAR2 = './test.pl';
As has been said, the value returned by the wanted subroutine is ignored. Returning a callback from a callback may be a step too far for some!
This may be of interest. I've used the File::stat module to make extraction of the modification time more readable, and Time::Piece, so that $atime and $btime can be expressed in readable strings instead of epoch values
There's no need to write a separate subroutine for the wanted function unless you prefer -- you can just use an anonymous subroutine in the find call. And it's easiest to simply return from the wanted subroutine if the node isn't a file
#!/usr/bin/env perl
use strict;
use warnings 'all';
use File::Find;
use File::stat;
use Time::Piece;
sub count_files_between_times {
my ($from, $to, $path) = #_;
my $count = 0;
find(sub {
my $st = stat($_) or die $!;
return unless -f $st;
my $mtime = $st->mtime;
++$count if $mtime >= $fromand $mtime <= $to;
}, $path);
print "Found a total of $count files\n";
}
my ($from, $to) = map {
Time::Piece->strptime($_, '%Y-%m-%dT%H:%M:%S')->epoch;
} '2016-04-19T00:00:00', '2019-04-22T00:00:00';
count_files_between_times($from, $to, '/usr');
Update
Some people prefer the File::Find::Rule module. Personally I dislike it intensely, and having looked at the source code I am very wary of it, but it certainly makes this process more concise
Note that File::Find::Rule is layered on top of File::Find, which does the heavy-lifting for it. So it is essentially a different way of writing the wanted subroutine
use File::Find::Rule ();
sub count_files_between_times {
my ($from, $to, $path) = #_;
my #files = File::Find::Rule->file->mtime(">= $from")->mtime("<= $to")->in($path);
printf "Found a total of %d files\n", scalar #files;
}
or if you prefer you can add the restrictions one statement at a time
use File::Find::Rule ();
sub count_files_between_times {
my ($from, $to, $path) = #_;
my $rule = File::Find::Rule->new;
$rule->file;
$rule->mtime(">= $from");
$rule->mtime("<= $to");
my #files = $rule->in($path);
printf "Found a total of %d files\n", scalar #files;
}
Both of these alternative subroutines produce identical results to that of the original above
Related
I am using the File::Grep module. I have following example:
#!/usr/bin/perl
use strict;
use warnings;
use File::Grep qw( fgrep fmap fdo );
my #matches = fgrep { 1.1.1 } glob "file.csv";
foreach my $str (#matches) {
print "$str\n";
}
But when I try to print $str value it gives me HEX value: GLOB(0xac2e78)
What's wrong with this code?
The documentation doesn't seem to be accurate, but judging from the source-code — http://cpansearch.perl.org/src/MNEYLON/File-Grep-0.02/Grep.pm — the list you get back from fgrep contains one element per file. Each element is a hash of the form
{
filename => $filename,
count => $num_matches_in_that_file,
matches => {
$line_number => $line,
...
}
}
I think it would be simpler to skip fgrep and its complicated return-value that has way more information than you want, in favor of fdo, which lets you just iterate over all lines of a file and do what you want:
fdo { my ( $file, $pos, $line ) = #_;
print $line if $line =~ m/1\.1\.1/;
} 'file.csv';
(Note that I removed the glob, by the way. There's not much point in writing glob "file.csv", since only one file can match that globstring.)
or even just dispense with this module and write:
{
open my $fh, '<', 'file.csv';
while (<$fh>) {
print if m/1\.1\.1/;
}
}
I assume you want to see all the lines in file.csv that contain 1.1.1?
The documentation for File::Grep isn't up to date, but this program will put into #lines all the matching lines from all the files (if there were more than one).
use strict;
use warnings;
use File::Grep qw/ fgrep /;
$File::Grep::SILENT = 0;
my #matches = fgrep { /1\.1\.1/ } 'file.csv';
my #lines = map {
my $matches = $_->{matches};
#{$matches}{ sort { $a <=> $b } keys %$matches};
} #matches;
print for #lines;
Update
The most Perlish way to do this is like so
use strict;
use warnings;
open my $fh, '<', 'file.csv' or die $!;
while (<$fh>) {
print if /1\.1\.1/;
}
I am trying to extract a DNA sequence from this FASTA file to a specified length of bases per line, say 40.
> sample dna (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
Using this Perl module (fasta.pm):
package fasta;
use strict;
sub read_fasta ($filename) {
my $filename = #_;
open (my $FH_IN, "<", $filename) or die "Can't open file: $filename $!";
my #lines = <$FH_IN>;
chomp #lines;
return #lines;
}
sub read_seq (\#lines) {
my $linesRef = #_;
my #lines = #{$linesRef};
my #seq;
foreach my $line (#lines) {
if ($line!~ /^>/) {
print "$line\n";
push (#seq, $line);
}
}
return #seq;
}
sub print_seq_40 (\#seq) {
my $linesRef = #_;
my #lines = #{$linesRef};
my $seq;
foreach my $line (#lines) {
$seq = $seq.$line;
}
my $i= 0;
my $seq_line;
while (($i+1)*40 < length ($seq)) {
my $seq_line = substr ($seq, $i*40, 40);
print "$seq_line\n";
$i++;
}
$seq_line = substr ($seq, $i*40);
print "$seq_line\n";
}
1;
And the main script is
use strict;
use warnings;
use fasta;
print "What is your filename: ";
my $filename = <STDIN>;
chomp $filename;
my #lines = read_fasta ($filename);
my #seq = read_seq (\#lines);
print_seq_40 (\#seq);
exit;
This is the error I get
Undefined subroutine &main::read_fasta called at q2.pl line 13, <STDIN> line 1.
Can anyone please enlighten me on which part I did wrong?
It looks like you're getting nowhere with this.
I think your choice to use a module and subroutines is a little strange, given that you call each subroutine only once and the correspond to very little code indeed.
Both your program and your module need to start with use strict and use warnings, and you cannot use prototypes like that in Perl subroutines. Including a number of other bugs, this is a lot closer to the code that you need.
package Fasta;
use strict;
use warnings;
use 5.010;
use autodie;
use base 'Exporter';
our #EXPORT = qw/ read_fasta read_seq print_seq_40 /;
sub read_fasta {
my ($filename) = #_;
open my $fh_in, '<', $filename;
chomp(my #lines = <$fh_in>);
#lines;
}
sub read_seq {
my ($lines_ref) = $_[0];
grep { not /^>/ } #$lines_ref;
}
sub print_seq_40 {
my ($lines_ref) = #_;
print "$_\n" for unpack '(A40)*', join '', #$lines_ref;
}
1;
q2.pl
use strict;
use warnings;
use Fasta qw/ read_fasta read_seq print_seq_40 /;
print "What is your filename: ";
my $filename = <STDIN>;
chomp $filename;
my #lines = read_fasta($filename);
my #seq = read_seq(\#lines);
print_seq_40(\#seq);
You need to either:
add to your module:
use Exporter;
our #EXPORT = qw ( read_fasta
read_seq ); #etc.
call the code in the remote module explicitly:
fasta::read_fasta();
explicitly import the module sub:
use fasta qw ( read_fasta );
Also: General convention on modules is to uppercase the first letter of the module name.
In Perl, if you use fasta;, this does not automatically export all its methods into the namespace of your program. Call fasta::read_fasta instead.
Or: use Exporter to automatically export methods or enable something like use Fasta qw/read_fasta/.
For example:
package Fasta;
require Exporter;
our #ISA = qw(Exporter);
our #EXPORT_OK = qw/read_fasta read_seq read_seq40/;
To use:
use Fasta qw/read_fasta read_seq read_seq40/;
You can also make Fasta export all methods automatically or define keywords to group methods, though the latter has caused me some problems in the past, and I would recommend it only if you are certain it is worth possible trouble.
If you want to make all methods available:
package Fasta;
use Exporter;
our #ISA = qw(Exporter);
our #EXPORT = qw/read_fasta read_seq read_seq40/;
Note #EXPORT is not #EXPORT_OK. The latter allows importing them later (as I did), the former automatically exports all. The documentation I linked to makes this clear.
I just noticed something else. You are flattening #_ into $filename in read_fasta. I am not sure this works. Try this:
sub read_fasta {
my $filename = $_[0]; # or ($filename) = #_; #_ is an array. $filename not.
}
To explain the problem: $filename = #_; means: store #_ ( an ARRAY ) into $filename (a SCALAR). Perl does this in this way: ARRAY length is stored in $filename. That is not what you want. You want the first element of the array. That would be $_[0].
Added #ISA which is probably needed OR use comment by Borodir.
My task is:
Read the directory, type of sorting, and order of sorting from command line.
Sort the file names and print them out with size and date.
Here is what I got so far.
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use Getopt::Long;
my $dir = "";
my $sortby = "";
my $order = "";
my $result;
$result = GetOptions (
'dir=s' => \$dir, # specify derictory
'sortby=s' => \$sortby, # 'name' or 'date'
'order=s' => \$order); # 'asc'- or 'des'-ending order of sorting
print "derictory = $dir, sortby = $sortby, order = $order \n\n";
opendir (DH, $dir)or die "couldn open dericroty: $!\n";
my #filenames = grep ! /^\./, readdir DH;
closedir (DH);
if ($sortby eq "name") {
if ($order eq "asc") {
foreach my $name (sort {lc $a cmp lc $b} #filenames) {
my #statinfo = stat("$dir/$name");
print "$name\tsize= " . $statinfo[7] . ",\t last modified=" .
scalar(localtime($statinfo[9])) . "\n";
}
}
elsif ($order eq "des") {
foreach my $name (sort {lc $b cmp lc $a} #filenames) {
my #statinfo = stat("$dir/$name");
print "$name\tsize= " . $statinfo[7] . ",\t last modified=" .
scalar(localtime($statinfo[9])) . "\n";
}
}
}
if ($sortby eq "date") {
if ($order eq "asc") {
#filenames = sort { -M "$dir/$a" <=> -M "$dir/$b" } (#filenames);
print join ("\n", #filenames);
}
elsif ($order eq "des") {
#filenames = sort { -M "$dir/$b" <=> -M "$dir/$a" } (#filenames);
print join ("\n", #filenames);
}
}
The problem is if I need to sort it by date modified, I don't know how to print out the list of the file names with the size and date. I guess I am supposed to use the stat function, but I can't loop through names, and get each stat.
All I have above is basically what I was able to google and put together.
Here's a different way to think about the problem. The essential points:
Write small functions that do simple things, and build your program
by assembling those functions together.
If you collect all of your information in a convenient data
structure (in this example, a list of hashes), the algorithmic/logical
aspects of the program become easy and natural.
For simplicity, this example ignore option-parsing and instead just accepts the params as regular command line arguments.
use strict;
use warnings;
main();
sub main {
my ($dir, $sortby, $order) = #ARGV;
my #contents = read_dir($dir);
my $sb = $sortby eq 'date' ? 'mtime' : 'path';
my #sorted = sort { $a->{$sb} cmp $b->{$sb} } #contents;
#sorted = reverse(#sorted) if $order eq 'des';
for my $fi (#sorted){
print $fi->{path}, ' : ', $fi->{mtime}, "\n";
}
}
sub read_dir {
# Takes a dir path.
# Returns a list of file_info() hash refs.
my $d = shift;
opendir(my $dh, $d) or die $!;
return map { file_info($_) } # Collect info.
map { "$d/$_" } # Attach dir path.
grep { ! /^\.\.?$/ } # No dot dirs.
readdir($dh);
}
sub file_info {
# Takes a path to a file/dir.
# Returns hash ref containing the path plus any stat() info you need.
my $f = shift;
my #s = stat($f);
return {
path => $f,
mtime => $s[9],
};
}
If you are going to sort by certain properties of your data, you may want to take a look at the Schwartzian Transform. This is a basic example of how you might use it to sort by modified time:
use strict;
use warnings;
use constant MTIME_STAT_INDEX => 9;
use constant FILENAME_INDEX => 0;
use constant MTIME_INDEX => 1;
# Grab a list of files in the current folder
my $some_dir = '.';
opendir(my $dh, $some_dir) || die "can't opendir $some_dir: $!";
my #fileNames = readdir $dh;
closedir $dh;
# Use a Schwartzian transform to generate a sorted list of <file_name, mtime> tuples
my #sortedByMtime =
map { $_ }
sort { $a->[MTIME_INDEX] cmp $b->[MTIME_INDEX] }
map { [$_, (stat($_))[MTIME_STAT_INDEX]] } #fileNames;
# Print the file name and mtime
for my $sortedRecord (#sortedByMtime) {
print $sortedRecord->[FILENAME_INDEX] . "\t" . $sortedRecord->[MTIME_INDEX] . "\n";
}
1;
It may help to read the transform outside-in (ie starting at the end and working toward the start). Starting with a list of file names, you use map to produce an array containing entries of the form <file_name, modified_time>. You then sort this list by modified time and can use the final map (ie the first one) to strip out any unwanted properties. In this example, I did not strip anything out, but I hope you get the idea that you could in theory have other properties in this built up structure, such as file size, for instance.
This is intended to just get you started as a proof of concept – I did not take much consideration for efficiency, error handling, or making the output pretty.
You should look at File::stat. This module (which comes with Subversion allows you to easily access all sorts of information about the file.
You should also look at Time::Piece. This module allows you to easily format your date and time.
I would also not worry about having four separate sorting routines. Instead, just sort what you need in an array standard ascending order. Then, before you print out, see if the user requested descending order. If the user did request descending order, you can use the reverse to reverse your sorted array.
I am using References. The array I am storing my file names contains not a string, but a reference to a hash. This way, each entry in my array contains four separate bits of information about my file.
I am also use Pod::Usage to print out messages based upon my POD documentation. POD is a rather simple format for storing documentation about your program. Users can use the perldoc command to display the pod:
$ perldoc prog.pl
Or, they can use commands such as pod2html to convert the documentation into HTML. These various Perldoc and POD commands come with your Perl distribution. I highly recommend that you learn POD and use it extensively. It keeps your program documentation in your program and allows you to produce all sorts of formats for your documentation. (Text, HTML, manpage, markdown, wiki, etc.).
#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);
use autodie;
# All of these are standard Perl module and come with all distributions
# or Perl
use Time::Piece;
use File::stat;
use Getopt::Long;
use Pod::Usage;
use File::Basename;
my ( $directory, $sort_order, $sort_descending, $help );
#
# Using pod2usage to print out my messages
#
GetOptions (
"directory=s" => \$directory,
"sort=s" => \$sort_order,
"descending" => \$sort_descending,
"help" => \$help,
) or pod2usage;
if ( $help ) {
pod2usage ( -message => qq(Use command 'perldoc print_dir.pl' for complete documetation) );
}
if ( not ( defined $directory and defined $sort_order ) ) {
pod2usage ( -message => qq(Must use parameters "directory" and "sort") );
}
if ( $sort_order ne "name" and
$sort_order ne "ctime" and
$sort_order ne "size" and
$sort_order ne "mtime" ) {
die qq(Sort order must be "name", "size", "ctime", or "mtime"\n);
}
opendir ( my $dir_fh, $directory ); #Will autodie here if directory doesn't exist
my #files;
while ( my $file = readdir $dir_fh ) {
$file = "$directory/$file";
next if not -f $file;
#
# Note I'm using File::stat to get the info on the files
#
my $stat = stat $file or die qq(Couldn't stat file "$file"\n);
my %file;
$file{NAME} = basename $file;
$file{CTIME} = $stat->ctime;
$file{MTIME} = $stat->mtime;
$file{SIZE} = $stat->size;
#
# I'm storing this information in a hash and pushing a Hash Reference
#
push #files, \%file; #Pushing a reference to the hash
}
closedir $dir_fh;
my #sorted_files = sort file_sort #files;
#
# I am using the fact that my hash keys and my sort options
# are very similar. One routine sorts all which ways
#
sub file_sort {
my $sort_by = uc $sort_order;
if ( $sort_order eq "name" ) {
return $a->{$sort_by} cmp $b->{$sort_by};
} else {
return $a->{$sort_by} <=> $b->{$sort_by};
}
}
#
# If the user wants descending order, reverse the array
#
if ( $sort_descending ) {
#sorted_files = reverse #sorted_files;
}
#
# I'm using 'printf' to print out a nice report.
# My $format is the format of the report, and I
# can use it for the title or the body.
#
my $format = "%-20.20s %-10d %-11.11s %-11.11s\n";
( my $title_format = $format ) =~ s/d/s/;
printf $title_format, "Name", "Sixe", "Mod-Time", "C-Time";
say join " ", "=" x 20, "=" x 10, "=" x 11, "=" x 11;
for my $file ( #sorted_files ) {
#
# The "->" dereferences the hash
# Note how I use Time::Piece to format my time
#
my $mtime = Time::Piece->new ( $file->{MTIME} );
my $ctime = Time::Piece->new ( $file->{CTIME} );
printf $format, $file->{NAME}, $file->{SIZE}, $mtime->ymd, $ctime->ymd;
}
#
# Here be the Plain Old Documention (POD) This is the standard
# way to document Perl programs. You can use the "perldoc" program
# to print it out, and pod2usage to print out bits and pieces.
#
=pod
=head1 NAME
print_dir.pl
=head1 SYNOPSIS
print_dir.pl -sort [name|size|mtime|ctime] -directory $directory [ -descending ]
=head1 DESCRIPTION
This program does somee amazing wonderful stuff...
=head1 OPTIONS
=over 4
=item *
-sort
(Required) Sort order of directory parameters can be C<name>, C<size>, C<mtime>, C<ctime>
=item *
-directory
(Required) Name of the directory to print
=item *
-descending
(Optional) Sort in descending order instead of ascending order
=back
=cut
I have been trying to get rid of a weird bug for hours, with no success. I have a subroutine that sorts a file. here is the code:
sub sort_file {
$filename = #_;
print #_;
print $filename;
open(SRTINFILE,"<$filename");
#lines=<SRTINFILE>;
close(SRTINFILE);
open(SRTOUTFILE,">$filename");
#sorted = sort { #aa=split(/ /,$a); #bb=split(/ /,$b); return ($aa[1] <=> $bb[1]); } #lines;
print SRTOUTFILE #sorted;
close(SRTOUTFILE);
}
any time this function is run, perl creates a file, called "1". i have no idea why. I am a complete perl noob and am just using it for quick and dirty text file processing. anyone know what's wrong?
An array in scalar context evalutes to the number of elements in the array. If you pass one argument to the function, the following assigns 1 to $filename.
$filename = #_;
You want any of the following:
$filename = $_[0];
$filename = shift;
($filename) = #_;
Furthermore, you want to limit the scope of the variable to the function, so you want
my $filename = $_[0];
my $filename = shift;
my ($filename) = #_;
(my $filename) = #_; # Exact same as previous.
The other answers are sufficient to tell you why you were getting strange errors.
I would like to show you how a more experienced Perl programmer might write this subroutine.
use warnings;
use strict;
use autodie;
sub sort_file {
my( $filename ) = #_;
my #lines;
{
# 3 arg open
open my $in_fh, '<', $filename;
#lines = <$in_fh>;
close $in_fh;
}
# Schwartzian transform
my #sorted = map{
$_->[0]
} sort {
$a->[2] <=> $b->[2]
} map {
[ $_, split ' ', $_ ]
} #lines;
{
open my $out_fh, '>', $filename;
print {$out_fh} #sorted;
close $out_fh;
}
}
use strict;
prevents you from using a variable without declaring it (among other things).
use warnings;
Informs you of some potential errors.
use autodie;
Now you don't need to write open .... or die ....
{ open ...; #lines = <$fh>; close $fh }
Limits the scope of the FileHandle.
#sorted = map { ... } sort { ... } map { ... } #list
This is an examples of a Schwartzian transform, which reduces the number of times that the values are split. In this example, it may be overkill.
How confusing. Assigning $filename = #_ the way you are means that you are evaluating an array in scalar context, which means that $filename is assigned the number of elements in #_. Because you don't check to see if the first open call succeeds, reading the file 1 likely fails, but you continue anyway and open for writing a file named 1. The solution is to use $filename in an array context and begin your subroutine with ($filename) = #_ or $filename = shift.
Why aren't you using use strict by the way?
Always use:
use strict;
use warnings;
Then Perl will tell you when you're off the mark.
As you've observed, the notation:
$filename = #_;
means that an unscoped variable is assigned the number of elements in the argument list to the function, and since you pass one file, the name of the created file will be '1'.
You meant to write:
my($filename) = #_;
This provides list context for the array, and assigns $_[0] to $filename, ignoring any extra arguments to the function.
OK... nevermind. it just dawned on me. $filename = #_; makes no sense. should be $filename = #_[0]; . There goes 2 hours of my life. note to other perl noobs: beware.
I am currently working on a little parser.
i have had very good results with the first script! This was able to run great!
It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated, so the subequent work with the data is a bit difficult. Therefore i have a second script - see below!
Note - friends helped me with the both scripts. I need to introduce myself as a true novice who needs help in migration two in one. So, you see, my Perl-knowlgedge is not so elaborated that i am able to do the migration into one on my own! Any and all help would be great!
The first script: a spider and parser: it spits out the data like this:
lfd. Nr. Schul- nummer Schulname Straße PLZ Ort Telefon Fax Schulart Webseite
1 0401 Mädchenrealschule Marienburg, Abenberg, der Diözese Eichstätt Marienburg 1 91183 Abenberg 09178/509210 Realschulen mrs-marienburg.homepage.t-online.de
2 6581 Volksschule Abenberg (Grundschule) Güssübelstr. 2 91183 Abenberg 09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
3 6913 Mittelschule Abenberg Güssübelstr. 2 91183 Abenberg 09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
4 0402 Johann-Turmair-Realschule Staatliche Realschule Abensberg Stadionstraße 46 93326 Abensberg 09443/9143-0,12,13 09443/914330 Realschulen www.rs-abensberg.de
But i need to separate the data: with commas or someting like that!
And i have a second script. This part can do the CSV-formate. i want to ombine it with the spider-logic. But first lets have a look at the first script: with the great spider-logic.
see the code that is appropiate:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(#$row);
print OUTFILE "#$row\n";
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
}
sub cleanup() {
for ( #_ ) {
s/s+/ /g;
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
But as this-above script-unfortunatley does not take care for the separators i have had to take care for a method, that does look for separators. In order to get the data (output) separated.
So with the separation i am able to work with the data - and store it in a mysql-table.. or do something else...So here [below] are the bits - that work out the csv-formate Note - i want to put the code below into the code above - to combine the spider-logic of the above mentioned code with the logic of outputting the data in CSV-formate.
where to set in the code Question: can we identify this point to migrate the one into the other... !?
That would be amazing... I hope i could make clear what i have in mind...!? Are we able to use the benefits of the both parts (/scripts ) migrating them into one?
So the question is: where to set in with the CSV-Script into the script (above)
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20';
$html =~ tr/\r//d; # strip carriage returns
$html =~ s/ / /g; # expand spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my #cols = qw(
rownum
number
name
phone
type
website
);
my #fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
# trim leading/trailing whitespace from base fields
s/^\s+//, s/\s+$// for #$row;
# load the fields into the hash using a "hash slice"
my %h;
#h{#cols} = #$row;
# derive some fields from base fields, again using a hash slice
#h{qw/name street postal town/} = split /\n+/, $h{name};
#h{qw/phone fax/} = split /\n+/, $h{phone};
# trim leading/trailing whitespace from derived fields
s/^\s+//, s/\s+$// for #h{qw/name street postal town/};
$csv->combine(#h{#fields});
print $csv->string, "\n";
}
}
The thing is that i have had very good results with the first script! It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated...!
And i have a second script. This part can do the CSV-formate. i want to combine it with the spider-logic.
where is the part to insert? I look forward to any and all help.
if i have to be more precice - just let me know...
Since you have entered a complete script, I'll assume you want critique of the whole thing.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
Since you only use $te in one block, why are you declaring and initializing it in this outer scope? The same question applies to most of your variables -- try to declare them in the innermost scope possible.
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
In general, english variable names will enable you to collaborate with far more people than german names. I understand german, so I understand the intent of your code, but most of SO doesn't.
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
Don't use & to call subs. Just call them with workDir;. It hasn't been necessary to use & since 1994, and it can lead to a nasty gotcha because &callMySub; is a special case which doesn't do what you might think, while callMySub; does the Right Thing.
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
Generally lexical filehandles are preferred these days: open my $outfile, ">file"; Also, you should check for errors from open or use autodie; to make open die on failure.
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(#$row);
print OUTFILE "#$row\n";
This is the line to change if you want to put commas in separating your data. Look at the join function, it can do what you want.
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
It's very strange to initialize $te at the end of the loop instead of the beginning. It's much more idiomatic to declare and initialize $te at the top of the loop.
}
sub cleanup() {
for ( #_ ) {
s/s+/ /g;
Did you mean s/\s+/ /g;?
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
I haven't commented on your second script; perhaps you should ask it as a separate question.