Finding two newest files in two separate directories and merging them

Finding two newest files in two separate directories and merging them - perl

I have two direcotries, each containing pictures. The regional directory is updated every 5 minutes, the watch directory is updated every 15.
What I am trying to do is find the newest file in each directory and take those files and use Image Magik to create a third image.
What I have works for some but is very inconsistent, for example my code will sometimes miss the regional files when it's time matches the watch files.
Other times it will merge two watch files, even though the watch files and regional files are in two separate directories.
I have no clue how to fix it.
Here is my code:
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
use vars qw/%files_watch/;
use vars qw/%files_regional/;
sub findNewestFiles {
my $element = $File::Find::name;
return if ( !-f $element );
$files_watch{$element} = ( stat($element) )[10];
$files_regional{$element} = ( stat($element) )[10];
}
my $image_magick_exe = "composite.exe\"";
my $pic_dir = "C:\\eterra\\eterravision\\weather";
my $watch_dir = "C:\\eterra\\eterravision\\weather\\watch";
my $regional_dir = "C:\\eterra\\eterravision\\weather\\regional";
open( OUT, ">>names.txt" ) || die;
find( \&findNewestFiles, $watch_dir );
my $newestfile_watch;
my $time_watch = 0;
while ( my ( $t1, $t2 ) = each(%files_watch) ) {
if ( $t2 > $time_watch ) {
$newestfile_watch = $t1;
$time_watch = $t2;
}
}
$time_watch = localtime($time_watch);
find( \&findNewestFiles, $regional_dir );
my $newestfile_regional;
my $time_regional = 0;
while ( my ( $t3, $t4 ) = each(%files_regional) ) {
if ( $t4 > $time_regional ) {
$newestfile_regional = $t3;
$time_regional = $t4;
}
}
$time_regional = localtime($time_regional);
$newestfile_watch =~ s/\//\\/g;
$newestfile_regional =~ s/\//\\/g; #replacing the "/" in the file path to "\"
my #temp = split( /_/, $newestfile_regional );
my $type = $temp[0];
my $date = $temp[1];
my $time = $temp[2];
my $check = "$pic_dir/radarwatch\_$date\_$time"; #check if file was created
unless ( -e $check )
{
system("\"$image_magick_exe \"$newestfile_regional\" \"$newestfile_watch\" \"$pic_dir\\radarwatch\_$date\_$time\"");
print "file created\n";
}
I changed the [10] in the sub function to an [8], and a [9]. 8 is access time, 9 is modification time, and 10 is creation time, 10 hase been the most successful.
I think the problem is with the sub function.
Is there a better way to search for the newest creation time? Something that is more reliable than what I have?

I think the crux of your problem is finding the most recent file in each directory, and then processing them. Leaving aside the details of processing them, here is a script that finds the most recent files. I leave out the imagemagick stuff, that can all be put into the process_latest subroutine. No need for File::Find. File::stat allows us to use names instead of trying to remember those numbers. The program has a clearer structure.
use strict;
use warnings;
use File::stat;
my $watch_dir = "C:\\eterra\\eterravision\\weather\\watch";
my $regional_dir = "C:\\eterra\\eterravision\\weather\\regional";
# get the latest in each directory
my $latest_regional = get_latest_file($regional_dir);
my $latest_watch = get_latest_file($watch_dir);
# do whatever you want here...
process_latest ($latest_regional, $latest_watch);
# I exit 1 in Windows, exit 0 in Unix
exit 1;
#--------------------------------
# subroutines
#--------------------------------
sub get_latest_file {
my $dir = shift;
opendir my $DIR, $dir or die "$dir $!";
my $latest_time = -1;
my $latest_file = '';
FILE:
while (readdir($DIR)) {
my $file = "$dir\\$_";
next FILE unless -f $file;
my $file_time = stat($file)->mtime;
print "$file $file_time\n";
if ($file_time > $latest_time) {
$latest_time = $file_time;
$latest_file = $file;
}
}
closedir $DIR;
return $latest_file;
}
sub process_latest {
my $regional = shift;
my $watch = shift;
print "Latest Regional: $regional\n";
print "Latest Watch: $watch\n";
}

Related

Perl: CAM::PDF makes changes, but these don't materialise in final doc

I've been using the CAM::PDF module to try editing pdf docs at work - essentially just trying to change the date on docs automatically to show they have been reviewed recently
unfortunately, despite my code telling me that I am making changes to the PDF objects ($pdf->{changes})
and giving the pdfs the doc is attempting to change maximum accessibility (anyone can access, read, write)
the pdf's outputted never seem to materialise with these changes. I have also been grepping the object node tmp files I output on mass and found that all of these show no sign of the old date after running the code; yet when I view the pdf after running it, the old date is still on the pdf. Has anyone encountered this before or can suggest anything?
just doing this manually isn't an option; I want to script this so I can have a script I just run against multiple files at once (I have LOTS of these files to sort out at work) but other than changing dates written on the doc, the doc has to remain looking the sameish (by which I mean, it would be ok if they changed in size a little but not ok if they completely changed in appearance)
I strictly followed the example changepdfstring.pl (https://metacpan.org/pod/distribution/CAM-PDF/bin/changepdfstring.pl) from the author of the module CAM::PDF on how to do this for my code, then tried different variations of it to try and get things to work - so I'm bemused that nothing has worked in the end
#!/usr/bin/perl
use strict;
use warnings;
use CAM::PDF;
use Data::Dumper;
my $pdf = CAM::PDF->new('Order fulfilment process flowchart.pdf');
if (!$pdf->canModify())
{
die "This PDF forbids modification\n";
}
my $olddate = "15.02.2019";
my $newdate = "22.02.2022";
foreach my $objectnumber (keys %{$pdf->{xref}}){
my $objectnode = $pdf->dereference($objectnumber);
$pdf->changeString($objectnode, {$olddate=>$newdate});
}
my $change = $pdf->{changes};
print Dumper($change);
my $count = 0;
foreach my $objectnumber (keys %{$pdf->{xref}}){
my $objectnode = $pdf->dereference($objectnumber);
$count++;
open (ONO, ">tmp.objectnode.$count");
print ONO Dumper($objectnode);
close (ONO);}
if (!scalar %{$pdf->{changes}})
{
die "no changes were made :(";
}
$pdf->preserveOrder();
$pdf->cleanoutput('pleasework.pdf');
Any help or advice would be greatly appreciated

A quick search in page 145 of the PDF specification[1] shows that there are 2 metadata fields that should allow a simple change to achieve what you are trying to do.
CreationDate
ModDate
Below you can find a quick script using CAM::PDF to set/modify the ModDate with the current date, thus giving the illusion of "modifying" the PDF.
The script can, if needed, be amended to use a specific date instead of the current time to set the modification date.
Please note that I'm not sure that CAM::PDF is the best option to get this task done.
The script is a only a sample of what can be done within the limitations and simplicity of CAM::PDF.
[1] https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/pdf_reference_archives/PDFReference.pdf
#!/usr/bin/env perl
use strict;
use warnings;
use Time::Local;
use CAM::PDF;
use CAM::PDF::Node;
my $infile = shift || die 'syntax...';
my $outfile = shift || die 'syntax...';
my $pdf = CAM::PDF->new($infile) || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info});
if ($info) {
my #time = localtime(time);
my ($sec,$min,$hour,$mday,$mon,$year,$wday,$yday,$isdst) = #time;
$year += 1900;
$mon++;
my $gmt_offset_in_seconds = timegm(#time) - timelocal(#time);
my $gmt_offset_min = ($gmt_offset_in_seconds / 60) % 60;
my $gmt_offset_hour = abs(int($gmt_offset_in_seconds / (60*60)));
my $offset_char = "";
if ($gmt_offset_in_seconds < 0) {
$offset_char = "-";
} else {
$offset_char = "+";
}
my $date = sprintf("D:%04d%02d%02d%02d%02d%02d%s%02d'%02d'", $year, $mon, $mday, $hour, $min, $sec, $offset_char, $gmt_offset_hour, $gmt_offset_min);
my $objnum = undef;
for my $obj ('Creator', 'Producer', 'CreationDate') {
if (exists $info->{$obj} and exists $info->{$obj}->{objnum}) {
$objnum = $info->{$obj}->{objnum};
last;
}
}
die "Cannot find objnum, halting..." if not defined $objnum;
my $mod_date = $info->{ModDate};
if ($mod_date) {
$mod_date->{value} = $date;
} else {
my $mod_date = new CAM::PDF::Node('string',$date);
$mod_date->{gennum} = 0;
$mod_date->{objnum} = $objnum;
$info->{ModDate} = $mod_date;
}
$pdf->preserveOrder();
$pdf->cleanoutput($outfile);
} else {
print "Cannot find PDF info section, doing nothing!\n";
}

I'm the author of CAM::PDF. Without seeing the PDF, I can only guess but I'd wager that the problem is that $olddate simply doesn't match any text in the doc. Kerning can break strings into multiple parts, for example. Also, there are several different ways to encode strings that appear the same in the resulting doc. So the trick for you will be figuring out what is the pattern for the dates in your specific docs.
That said, I also like the clever idea that #Bruce Ramos offered in a separate answer. That approach won't change a date that's visible in the rendered PDF (like, if you print it) but it should show up as metadata in nearly any PDF viewer.

I found that the line I was trying to edit was not actually a contiguous set of characters in the pdf, but rather it was inside a TJ operator in a BT line in the PDF. I cannot see any provision for handling cases where the desired text is in TJ lines in the CAM::PDF library (although perhaps there is #ChrisDolan ?) hence it was unable to be operated on nor "swapped out" by CAM::PDF. After decompressing all the streams (where applicable) I found this 'TJ' line which had the text I wished to operate on:
[(D)-20(a)24(t)62(e)-46(:)86( )-46(1)52(5)-37(.)70(0)-37(2)52(.)-20(2)52(0)-37(1)52(9)] TJ
I don't believe it would have been possible for CAM::PDF to act on TJ lines, perhaps it can only act on Tj lines
For anyone looking for a quick answer to this same problem, this "dirty" script worked for me in this case:
#!/usr/bin/perl
use strict;
use Compress::Raw::Zlib;
use bytes;
open(OUT,'>', "newfromoldscript.pdf");
my $fname = 'Order fulfilment process flowchart.pdf';
open(FILE, '<:raw', $fname) || die("can't open($fname): $!");
$/ = undef;
my $file = <FILE>;
my $file_len = length($file);
my $i = 0;
my $offset;
my $offset;
my $o;
do {
$o = doX(substr($file, $offset, $file_len), $i);
$offset+=$o;
$i++;
} while($o && $i< 100);
sub doX {
my $file = shift;
my $i = shift;
my $stream = index($file, "\nstream");
if ($stream < 0) {
print OUT $file;
return 0;
}
$stream++;
my $deflate = 1;
my $line_before = rindex(substr($file,0,$stream), "<<");
print OUT substr($file,0,$line_before);
my $x = substr($file, $line_before,$stream-$line_before);
if ($i == 22) {
print "";
}
my $stream_len;
if ($x =~ /FlateDecode\/Length (\d+)>>/) {
$stream_len = $1;
}
if ($x =~ /FlateDecode\/Length (\d+)\//) {
print "Warn Object $i has len/len what the even is this?\n";
$stream_len = $1;
}
if ($x =~ /XML\/Length (\d+)>>/) {
$deflate = 0;
$stream_len = $1;
}
if (!$stream_len) {
die("I fail with no stream len : $x");
}
print "-->$line_before,$i,$stream=$stream_len=$x<--\n";
my $bytes = substr($file, $stream+8,$stream_len);
my $orig_bytes = $bytes; # inflate seems to mangle bytes, so take a copy
my $o;
my $d=new Compress::Raw::Zlib::Inflate();
if ($deflate) {
$d->inflate($bytes,$o);
} else {
$o = $bytes;
}
my $orig_x = $x;
my $changes;
my %change = (
'-20(2)52(0)-37(.)52(.)' => '-20(2)52(0)-37(2)52(0)', #trialling different reg ex's here
'-37(1)52(9)'=>'-37(2)52(0)', #reg ex's
'Date: 15.02.2019'=>'Date: 12.02.2020',
'[(A)[\d-]+(p)[\d-]+(p)[\d-]+(r)[\d-]+(o)[\d-]+(ve)[\d-]+(d)[\d-]+( )[\d-]+(B[^\]]+\] TJ'=>'(Approved By: George W) Tj??G-TAG??' #scrap the whole TJ, replace for Tj
);
foreach my $re (keys %change) {
my $to = $change{$re};
$re =~ s/([\(\)])/\\\1/g; # escape round brackets
print $re;
open (GW, ">tmp.gw");
print GW $re;
close (GW);
if ($o=~/$re/m) {
$o =~ s/$re/$to/mg;
print $o;
$changes++;
}
}
if ($changes) {
print "\n MADE CHANGES\n";
#split, get rid of the ? mark tag
my #remains = split('\?\?G-TAG\?\?', $o);
my $firsthalf = $remains[0];
my $secondhalf = $remains[1];
#reverse the string
$firsthalf = scalar reverse ($firsthalf);
if ($firsthalf =~ m/fT 52\.8 2F/){print "FOUND THE REVERSE"}
$firsthalf =~ s/fT 52\.8 2F/fT 52\.8 0F/;
#reg ex to back track to the nearest and thus relevant Font/F and set it to F0
#put it back in correct orientation
$firsthalf = scalar reverse ($firsthalf);
$o = join("", $firsthalf, $secondhalf);
open (WEIRD, ">tmp.weird");
print WEIRD $firsthalf;
close (WEIRD);
$changes++;
my $d = new Compress::Raw::Zlib::Deflate();
my $obytes;
my $obytes2;
my $status = $d->deflate($o, $obytes);
$d->flush($obytes2);
$bytes = $obytes . $obytes2;
if (length($bytes) != $stream_len) {
my $l = length($bytes);
print "-->$x<--\n";
warn("what do we do here $l != $stream_len");
$orig_x =~ s/$stream_len/$l/;
}
print OUT $orig_x . "stream\r\n";
print OUT $bytes . "\r";
} else {
print OUT $orig_x . "stream\r\n";
print OUT $orig_bytes . "\r";
}
open(TMP,">out/tmp.$i.bytes");
print TMP $o;
close(TMP);
return $stream + 8 + $stream_len + 1;
}
Essentially I swap out the TJ for a Tj for changing someone elses name on the document to my name, which makes it simpler to insert my change (but potentially messy). To enable this to display with capitalised letters, I had to reverse the string and swap out the font (F) it was under (F2) to F0
For the TJ line relating to date, I swapped out the TJ characters for the date I wished to change it to, this meant I had to abide by the "unfriendly" syntax TJ operator lines abide by

Copy data from huge files while they are open

I am trying to merge data from huge files to a combined file using Perl.
File will be in open condition and large amount of data is continuously being added to the files. Appending around 50,000 lines per minute.
The files are stored in a network shared folder accessed by between 10 and 30 machines.
These are JTL files generated by JMeter.
This merge runs every minute for about 6 or 7 hours, and the time taken should not be more than 30 to 40 seconds.
The process is triggered every minute by a Web Application deployed in a Linux machine.
I have written a script which stores the last line added by the individual files to the combined file in separate files.
This works fine up to 15 minutes but constantly increase the merge time.
My script
#!/usr/bin/perl
use File::Basename;
use File::Path;
$consolidatedFile = $ARGV[0];
$testEndTimestamp = $ARGV[1];
#csvFiles = #ARGV[ 2 .. $#ARGV ];
$testInProcess = 0;
$newMerge = 0;
$lastLines = "_LASTLINES";
$lastLine = "_LASTLINE";
# time() gives current time timestamp
if ( time() <= $testEndTimestamp ) {
$testInProcess = 1;
}
# File exists, has a size of zero
if ( -z $consolidatedFile ) {
mkdir $consolidatedFile . $lastLines;
$newMerge = 1;
}
open( CONSOLIDATED, ">>" . $consolidatedFile );
foreach my $file ( #csvFiles ) {
open( INPUT, "<" . $file );
#linesArray = <INPUT>;
close INPUT;
if ( $newMerge ) {
print CONSOLIDATED #linesArray[ 0 .. $#linesArray - 1 ];
open my $fh, ">", $consolidatedFile . $lastLines . "/" . basename $file . $lastLine;
print $fh $linesArray[ $#linesArray - 1 ];
close $fh;
}
else {
open( AVAILABLEFILE, "<" . $consolidatedFile . $lastLines . "/" . basename $file . $lastLine );
#lineArray = <AVAILABLEFILE>;
close AVAILABLEFILE;
$availableLastLine = $lineArray[0];
open( FILE, "<" . $file );
while ( <FILE> ) {
if ( /$availableLastLine/ ) {
last;
}
}
#grabbed = <FILE>;
close( FILE );
if ( $testInProcess ) {
if ( $#grabbed > 0 ) {
pop #grabbed;
print CONSOLIDATED #grabbed;
open( AVAILABLEFILE, ">" . $consolidatedFile . $lastLines . "/" . basename $file . $lastLine );
print AVAILABLEFILE $grabbed[ $#grabbed - 1 ];
}
close AVAILABLEFILE;
}
else {
if ( $#grabbed >= 0 ) {
print CONSOLIDATED #grabbed;
}
}
}
}
close CONSOLIDATED;
if ( !$testInProcess ) {
rmtree $consolidatedFile . $lastLines;
}
I need to optimize the script in order to reduce the time.
Is it possible to store last line in a cache?
Can anyone suggest another way for this type of merging?
Another script which stores last line in cache instead of file.
Even this does not complete merge within 1 min.
#!/usr/bin/perl
use CHI;
use File::Basename;
use File::Path;
my $cache = CHI->new(
driver => 'File',
root_dir => '/path/to/root'
);
$consolidatedFile = $ARGV[0];
$testEndTimestamp = $ARGV[1];
#csvFiles = #ARGV[ 2 .. $#ARGV ];
$testInProcess = 0;
$newMerge = 0;
$lastLines = "_LASTLINES";
$lastLine = "_LASTLINE";
# time() gives current time timestamp
if ( time() <= $testEndTimestamp ) {
$testInProcess = 1;
}
# File exists, has a size of zero
if ( -z $consolidatedFile ) {
$newMerge = 1;
}
open( CONSOLIDATED, ">>" . $consolidatedFile );
foreach my $file (#csvFiles) {
$fileLastLineKey =
$consolidatedFile . $lastLines . "_" . basename $file . $lastLine;
open( INPUT, "<" . $file );
#linesArray = <INPUT>;
close INPUT;
if ($newMerge) {
print CONSOLIDATED #linesArray[ 0 .. $#linesArray - 1 ];
$fileLastLine = $linesArray[ $#linesArray - 1 ];
$cache->set( $fileLastLineKey, $fileLastLine );
}
else {
$availableLastLine = $cache->get($fileLastLineKey);
open( FILE, "<" . $file );
while (<FILE>) {
if (/$availableLastLine/) {
last;
}
}
#grabbed = <FILE>;
close(FILE);
if ($testInProcess) {
if ( $#grabbed > 0 ) {
pop #grabbed;
print CONSOLIDATED #grabbed;
$fileLastLine = $grabbed[ $#grabbed - 1 ];
$cache->set( $fileLastLineKey, $fileLastLine );
}
}
else {
if ( $#grabbed >= 0 ) {
print CONSOLIDATED #grabbed;
$cache->remove($fileLastLineKey);
}
}
}
}
close CONSOLIDATED;
I am thinking of reading files from last line to required line and copy those lines to consolidated file.
Can anyone suggest on this???

You may want to try open the file in binmode and read it blockwise in a loop. This usually offers significant performance improvements. The following functions is an example, here i put at maximum $maxblocks blocks of a file on the array, from block $offset on, in an array passed as reference. Note that the last block may not contain the entire $block bytes when the file is not large enough.
sub file2binarray {
my $file=shift;
my $array=shift;
my $maxblocks=shift;
my $offset=shift;
my $block=2048;
$offset=0 if ((!defined($offset)) || ($offset !~/^\s*\d+\s*$/o));
$maxblocks="ALL"
if (!defined($maxblocks) || ($maxblocks!~/^\s*\d+\s*$/o));
my $size=(stat($file))[7];
my $mb=$size/$block;
$mb++ if ($mb*$block<$size);
$maxblocks=$mb-$offset if(($maxblocks eq "ALL")||
($maxblocks>$mb-$offset));
$offset*=$block;
open(IN,"$file") || die("Cannot open file <$file>\n");
binmode(IN);
$bytes_read=$block;
seek(IN,$offset,0);
my ($blk,$bytes_read,$buffer)=(0,0,"");
while (($bytes_read==$block)&& ($blk<$maxblocks)){
$bytes_read=sysread(IN,$buffer,$block);
push(#$array,$buffer);
$blk++;
}
close(IN);
}
To read the entire file at ones, e.g. you call it like this
my #array;
my $filename="somefile";
file2binarray ($filename,\#array,"ALL",0);
but probably you'd rather call it in a loop with some bookkeeping over the offset, and parse the array in between subsequent calls.
Hope this helps.

Perl Find and Replace Not working with values from array

For this script, I am pulling a csv file that includes what needs to be found and what the replacement is. Those values, $pattern1 and $replacement1 are then inserted into a find & replace function. Ideally this will take the csv key file & do an inplace replacement of the raw data file.
use English;
use strict;
use warnings;
sub inplace_sanitize {
my ( $datafile, $pattern1, $replacement1 ) = #_;
local #ARGV = ( $datafile ),
my $INPLACE_EDIT = '.back';
while ( <> ) {
s/\Q$pattern1/$replacement1/g;
#print;
}
}
sub main
{
# Select Key for Find & Replace
my $filename = 'stmre_fr_key.csv';
open(INPUT, $filename) or die "Cannot open $filename";
# Read the header line.
my $line = <INPUT>;
# Read the lines one by one.
while($line = <INPUT>)
{
chomp($line);
#Split & Assign
my ($replacement1, $pattern1) = split(',', $line);
# Select Data File
my $datafile = 'rawdata.csv';
#Find & Replace Data File
&inplace_sanitize( $datafile, $pattern1, $replacement1 );
}
}
close(INPUT);
main();
So this is not working, as it doesn't perform the replacement. Without the inplace_sanitizecall it prints out the $replacement1 & $pattern1 correctly. The inplace_sanitize works by itself if you define $replacement1 = 'replace'; and $pattern1 = 'find';. But together there it doesn't work. Any ideas?
Samples:
$replacement1 = '7306e005';
$pattern1 = 'leighs_satcon011016001_00753b94';
stmre_fr_key.csv:
find,replace
leighs_satcon011016001_00753b94,7306e005
leighs_satcon011016001_00753b95,7306e006
.
.
.

You're use of my $INPLACE_EDIT is your problem. You want to effect the global variable:
local $INPLACE_EDIT = '.back';
The same way you're treating #ARGV

Perl module to find out whether a word is a verb/noun/adjective/article/preposition

I have a list of words and I want to group them into different groups depending on whether they are verbs/adjectives/nouns/etc. So, basically I am looking for a Perl module which tells whether a word is verb/noun etc.
I googled but couldn't find what I was looking for. Thanks.

Lingua::EN::Tagger, Lingua::EN::Semtags::Engine, Lingua::EN::NamedEntity

See the Lingua::EN:: namespace in CPAN. Specifically, Link Grammar and perhaps Lingua::EN::Tagger can help you. Also WordNet provides that kind of information and you can query it using this perl module.

follow code perl help you to find all this thing in your text file in your folder only give the path of directory and it will process all file at once and save result in report.txt file strong text
#!/usr/local/bin/perl
# for loop execution
# Perl Program to calculate Factorial
sub fact
{
# Retriving the first argument
# passed with function calling
my $x = $_[0];
my #names = #{$_[1]};
my $length = $_[2];
# checking if that value is 0 or 1
if ($x < $length)
{
#print #names[$x],"\n";
use Lingua::EN::Fathom;
my $text = Lingua::EN::Fathom->new();
# Analyse contents of a text file
$dirlocation="./2015/";
$path =$dirlocation.$names[$x];
$text->analyse_file($path); # Analyse contents of a text file
$accumulate = 1;
# Analyse contents of a text string
$text->analyse_block($text_string,$accumulate);
# TO Do, remove repetition
$num_chars = $text->num_chars;
$num_words = $text->num_words;
$percent_complex_words = $text->percent_complex_words;
$num_sentences = $text->num_sentences;
$num_text_lines = $text->num_text_lines;
$num_blank_lines = $text->num_blank_lines;
$num_paragraphs = $text->num_paragraphs;
$syllables_per_word = $text->syllables_per_word;
$words_per_sentence = $text->words_per_sentence;
# comment needed
%words = $text->unique_words;
foreach $word ( sort keys %words )
{
# print("$words{$word} :$word\n");
}
$fog = $text->fog;
$flesch = $text->flesch;
$kincaid = $text->kincaid;
use strict;
use warnings;
use 5.010;
my $filename = 'report.txt';
open(my $fh, '>>', $filename) or die "Could not open file '$filename' $!";
say $fh $text->report;
close $fh;
say 'done';
print($text->report);
$x = $x+1;
fact($x,\#names,$length);
}
# Recursively calling function with the next value
# which is one less than current one
else
{
done();
}
}
# Driver Code
$a = 0;
#names = ("John Paul", "Lisa", "Kumar","touqeer");
opendir DIR1, "./2015" or die "cannot open dir: $!";
my #default_files= grep { ! /^\.\.?$/ } readdir DIR1;
$length = scalar #default_files;
print $length;
# Function call and printing result after return
fact($a,\#default_files,$length);
sub done
{
print "Done!";
}

How to make LWP and HTML::TableExtract spitting out CSV with Text::CSV

I am currently working on a little parser.
i have had very good results with the first script! This was able to run great!
It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated, so the subequent work with the data is a bit difficult. Therefore i have a second script - see below!
Note - friends helped me with the both scripts. I need to introduce myself as a true novice who needs help in migration two in one. So, you see, my Perl-knowlgedge is not so elaborated that i am able to do the migration into one on my own! Any and all help would be great!
The first script: a spider and parser: it spits out the data like this:
lfd. Nr. Schul- nummer Schulname Straße PLZ Ort Telefon Fax Schulart Webseite
1 0401 Mädchenrealschule Marienburg, Abenberg, der Diözese Eichstätt Marienburg 1 91183  Abenberg  09178/509210 Realschulen mrs-marienburg.homepage.t-online.de
2 6581 Volksschule Abenberg (Grundschule) Güssübelstr. 2 91183  Abenberg  09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
3 6913 Mittelschule Abenberg  Güssübelstr. 2 91183  Abenberg  09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
4 0402 Johann-Turmair-Realschule Staatliche Realschule Abensberg Stadionstraße 46 93326  Abensberg  09443/9143-0,12,13 09443/914330 Realschulen www.rs-abensberg.de
But i need to separate the data: with commas or someting like that!
And i have a second script. This part can do the CSV-formate. i want to ombine it with the spider-logic. But first lets have a look at the first script: with the great spider-logic.
see the code that is appropiate:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(#$row);
print OUTFILE "#$row\n";
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
}
sub cleanup() {
for ( #_ ) {
s/s+/ /g;
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
But as this-above script-unfortunatley does not take care for the separators i have had to take care for a method, that does look for separators. In order to get the data (output) separated.
So with the separation i am able to work with the data - and store it in a mysql-table.. or do something else...So here [below] are the bits - that work out the csv-formate Note - i want to put the code below into the code above - to combine the spider-logic of the above mentioned code with the logic of outputting the data in CSV-formate.
where to set in the code Question: can we identify this point to migrate the one into the other... !?
That would be amazing... I hope i could make clear what i have in mind...!? Are we able to use the benefits of the both parts (/scripts ) migrating them into one?
So the question is: where to set in with the CSV-Script into the script (above)
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20';
$html =~ tr/\r//d; # strip carriage returns
$html =~ s/ / /g; # expand spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my #cols = qw(
rownum
number
name
phone
type
website
);
my #fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
# trim leading/trailing whitespace from base fields
s/^\s+//, s/\s+$// for #$row;
# load the fields into the hash using a "hash slice"
my %h;
#h{#cols} = #$row;
# derive some fields from base fields, again using a hash slice
#h{qw/name street postal town/} = split /\n+/, $h{name};
#h{qw/phone fax/} = split /\n+/, $h{phone};
# trim leading/trailing whitespace from derived fields
s/^\s+//, s/\s+$// for #h{qw/name street postal town/};
$csv->combine(#h{#fields});
print $csv->string, "\n";
}
}
The thing is that i have had very good results with the first script! It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated...!
And i have a second script. This part can do the CSV-formate. i want to combine it with the spider-logic.
where is the part to insert? I look forward to any and all help.
if i have to be more precice - just let me know...

Since you have entered a complete script, I'll assume you want critique of the whole thing.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
Since you only use $te in one block, why are you declaring and initializing it in this outer scope? The same question applies to most of your variables -- try to declare them in the innermost scope possible.
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
In general, english variable names will enable you to collaborate with far more people than german names. I understand german, so I understand the intent of your code, but most of SO doesn't.
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
Don't use & to call subs. Just call them with workDir;. It hasn't been necessary to use & since 1994, and it can lead to a nasty gotcha because &callMySub; is a special case which doesn't do what you might think, while callMySub; does the Right Thing.
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
Generally lexical filehandles are preferred these days: open my $outfile, ">file"; Also, you should check for errors from open or use autodie; to make open die on failure.
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(#$row);
print OUTFILE "#$row\n";
This is the line to change if you want to put commas in separating your data. Look at the join function, it can do what you want.
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
It's very strange to initialize $te at the end of the loop instead of the beginning. It's much more idiomatic to declare and initialize $te at the top of the loop.
}
sub cleanup() {
for ( #_ ) {
s/s+/ /g;
Did you mean s/\s+/ /g;?
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
I haven't commented on your second script; perhaps you should ask it as a separate question.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Finding two newest files in two separate directories and merging them - perl

Related

Perl: CAM::PDF makes changes, but these don't materialise in final doc

Copy data from huge files while they are open

Perl Find and Replace Not working with values from array

Perl module to find out whether a word is a verb/noun/adjective/article/preposition

How to make LWP and HTML::TableExtract spitting out CSV with Text::CSV

Categories

Resources