I'm trying to combine an item from an Array list with a scalar $range. Here's how I'm trying to do it.
my $rowinc = 2;
my $colarray = #collet[0];
my $range = $colarray $rowinc;
chomp $range;
$sheet->Range($range)->{Value} = $ir;
shift #collet;
$sheet->Range($range)->{Value} = $sn;
shift #collet;
$sheet->Range($range)->{Value} = (join(", ", #parts));
shift #collet;
$sheet->Range($range)->{Value} = $ref;
$rowinc++;
unshift #collet, 'C';
unshift #collet, 'B';
unshift #collet, 'A';
I've tried multiple ways of doing this and to no avail. Here's the error I receive while running this particular snippet.
Scalar found where operator expected at gen1.pl line 87, near "$colarray $rowinc
"
(Missing operator before $rowinc?)
syntax error at gen1.pl line 87, near "$colarray $rowinc"
Execution of gen1.pl aborted due to compilation errors.
Press any key to continue . . .
I'm assuming that the array can't be used in that manner to denote the value for $range. The problem I'm running into is that I am using Win32::OLE to manage my excel spreadsheets because it gives me the ability to open an already existing spreadsheet. But the drawback is I cannot enter my cell ranges as integers ($row,$col) I've tried this just incrementing $row and $col I want to be able to manage this effectively instead of using a bunch of if else and what not.
What I've tried to do is my #collet = ('A', 'B', 'C', 'D');
tells me which column to print in if I start at 0 it should start printing in A col. which is good and then every time it prints in a column it shifts right so now #collet[0] should be 'B'. I know this isn't the best method but I've changed my original method to this in hopes to solve the issue. any help would be awesome!
Here's my full script for some context.
#!C:\Perl\bin
#manage IR/SRN/Parts Used
#Written for Zebra
#Author Dalton Brady
#Location Bentonville AR
use strict;
use warnings;
use POSIX qw(strftime);
use Win32::OLE;
use Win32::OLE qw( in with);
use Win32::OLE::Const 'Microsoft Excel';
use Win32::OLE::Variant;
$Win32::OLE::Warn = 3; #Die on errors
#different types of units worked on, trying to name the worksheet
#after one depending on user input have yet to add that func.
my #uut = ('VC5090', 'MK2046', 'MK2250', 'MK4900', '#pos');
my $ref = strftime '%Y-%m-%d', localtime(); #create the datestamp
my $i = 0; #increment counter for why loop
my $n = 0; #increment for do until loop
my $xfile = 'X:\dataTest\data.xls'; #location of excel file
my $book; #place for the workbook to exist
my $sheet; #worksheet exists here
my $ex; #a place to hold the excel application
my $row = 2; #track row in which data will
#be placed with the do until loop
my $col = 1;
my #parts ; # store the different parts as a list within
# an area (to be written to the spreadsheet)
my #talk = ( 'IR:', 'SN:',
'#of Parts Used: ', 'PN:',
'Units Completed: ');
my #collet = "A" .. "Z"
# start an instance of excel
$ex = Win32::OLE->new('Excel.Application');
$ex->{DisplayAlerts} = 0; #turn off alerts/textboxes/saveboxes
#check to see if excel file exists if yet open it, if not create it
if (-e $xfile) {
$book = $ex->Workbooks->Open($xfile);
$sheet = $book->Worksheets("Test");
$sheet->Activate();
}
else {
$book = $ex->Workbooks->Add()
; #create new workbook to be used because we couldn't find one
#########SETUP YOUR EXCEL FILE#############
$sheet = $book->Worksheets("Sheet1");
$sheet->Activate();
$sheet->{Name} = "Test";
$sheet->Cells("a1")->{HorizontalAlignment} = x1HAlignCenter;
$sheet->Cells("b1")->{HorizontalAlignment} = x1HAlignCenter;
$sheet->Cells("c1")->{HorizontalAlignment} = x1HAlignCenter;
$sheet->Columns("a")->{ColumnWidth} = 20;
$sheet->Columns("b")->{ColumnWidth} = 20;
$sheet->Columns("c")->{ColumnWidth} = 30;
$sheet->Range("a1")->{Value} = "IR Number";
$sheet->Range("b1")->{Value} = "Serial Number";
$sheet->Range("c1")->{Value} = "Parts Used";
$sheet->Range("d1")->{Value} = "Date";
$book->SaveAs($xfile); #Save the file we just created
}
# ask for how many units user will be
# scanning or has completed to be scanned
print $talk[4] ;
#unit count tracker, determines how many times the do while loop runs
my $uc = <>;
do {
print $talk[0]; #ask for the IR number
my $ir = <>;
chomp $ir;
print $talk[1]; #ask for uut Serial Number
my $sn = <>;
chomp $sn;
print $talk[2];
# ask for the number of parts used, to regulate
# the parts list storage into the #parts array
my $pu = <>;
while ($i < $pu) {
print $talk[3];
my $scan = <>;
chomp $scan;
push #parts, $scan;
$i++;
}
my $rowinc = 2;
my $colarray = #collet[0];
my $range = $colarray $rowinc;
chomp $range;
$sheet->Range($range)->{Value} = $ir;
shift #collet;
$sheet->Range($range)->{Value} = $sn;
shift #collet;
$sheet->Range($range)->{Value} = (join(", ", #parts));
shift #collet;
$sheet->Range($range)->{Value} = $ref;
$rowinc++;
unshift #collet, 'C';
unshift #collet, 'B';
unshift #collet, 'A';
} until ($n == $uc);
# save and exit
$book->SaveAs($xfile);
$book = $ex->WorkBooks->Close();
undef $book;
undef $ex;
Related
I am working on a program in Perl and my output is wrong and taking forever to process. The code is meant to take in a large DNA sequence file, read through it in 15 letter increments (kmers), stepping forward 1 position at a time. I'm supposed to enter the kmer sequences into a hash, with their value being the number of incidences of that kmer- meaning each key should be unique and when a duplicate is found, it should increase the count for that particular kmer. I know from my Prof. expected output file, that I have too many lines, so it is allowing duplicates and not counting correctly. It's also running 5+ minutes, so I have to Ctrl+C to escape. When I go look at kmers.txt, the file is at least written and formatted correctly.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# countKmers.pl
# Open file /scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta
# Identify all k-mers of length 15, load them into a hash
# and count the number of occurences of each k-mer. Each
# unique k-mer and its' count will be written to file
# kmers.txt
#Create an empty hash
my %kMersHash = ();
#Open a filehandle for the output file kmers.txt
unless ( open ( KMERS, ">", "kmers.txt" ) ) {
die $!;
}
#Call subroutine to load Fly Chromosome 2L
my $sequenceRef = loadSequence("/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta");
my $kMer = 15; #Set the size of the sliding window
my $stepSize = 1; #Set the step size
for (
#The sliding window's start position is 0
my $windowStart = 0;
#Prevent going past end of the file
$windowStart <= ( length($$sequenceRef) - $kMer );
#Advance the window by the step size
$windowStart += $stepSize
)
{
#Get the substring from $windowStart for length $kMer
my $kMerSeq = substr( $$sequenceRef, $windowStart, $kMer );
#Call the subroutine to iterate through the kMers
processKMers($kMerSeq);
}
sub processKMers {
my ($kMerSeq) = #_;
#Initialize $kCount with at least 1 occurrence
my $kCount = 1;
#If the key already exists, the count is
#increased and changed in the hash
if ( not exists $kMersHash{$kMerSeq} ) {
#The hash key=>value is loaded: kMer=>count
$kMersHash{$kMerSeq} = $kCount;
}
else {
#Increment the count
$kCount ++;
#The hash is updated
$kMersHash{$kMerSeq} = $kCount;
}
#Print out the hash to filehandle KMERS
for (keys %kMersHash) {
print KMERS $_, "\t", $kMersHash{$_}, "\n";
}
}
sub loadSequence {
#Get my sequence file name from the parameter array
my ($sequenceFile) = #_;
#Initialize my sequence to the empty string
my $sequence = "";
#Open the sequence file
unless ( open( FASTA, "<", $sequenceFile ) ) {
die $!;
}
#Loop through the file line-by-line
while (<FASTA>) {
#Assign the line, which is in the default
#variable to a named variable for readability.
my $line = $_;
#Chomp to get rid of end-of-line characters
chomp($line);
#Check to see if this is a FASTA header line
if ( $line !~ /^>/ ) {
#If it's not a header line append it
#to my sequence
$sequence .= $line;
}
}
#Return a reference to the sequence
return \$sequence;
}
Here's how I would write your application. The processKMers subroutine boils down to just incrementing a hash element, so I've removed that. I've also altered the identifiers to be match the snake_case that is more usual in Perl code, and I didn't see any point in load_sequence returning a reference to the sequence so I've changed it to return the string itself
use strict;
use warnings 'all';
use constant FASTA_FILE => '/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta';
use constant KMER_SIZE => 15;
use constant STEP_SIZE => 1;
my $sequence = load_sequence( FASTA_FILE );
my %kmers;
for (my $offset = 0;
$offset + KMER_SIZE <= length $sequence;
$offset += STEP_SIZE ) {
my $kmer_seq = substr $sequence, $start, KMER_SIZE;
++$kmers{$kmer_seq};
}
open my $out_fh, '>', 'kmers.txt' or die $!;
for ( keys %kmers ) {
printf $out_fh "%s\t%d\n", $_, $kmers{$_};
}
sub load_sequence {
my ( $sequence_file ) = #_;
my $sequence = "";
open my $fh, '<', $sequence_file or die $!;
while ( <$fh> ) {
next if /^>/;
chomp;
$sequence .= $_;
}
return $sequence;
}
Here's a neater way to increment a hash element without using ++ on the hash directly
my $n;
if ( exists $kMersHash{$kMerSeq} ) {
$n = $kMersHash{$kMerSeq};
}
else {
$n = 0;
}
++$n;
$kMersHash{$kMerSeq} = $n;
Everything looks fine in your code besides processKMers. The main issues:
$kCount is not persistent between calls to processKMers, so in your else statement, $kCount will always be 2
You are printing every time you call processKMers, which is what is slowing you down. Printing frequently slows down your process significantly, you should wait until the end of your program and print once.
Keeping your code mostly the same:
sub processKMers {
my ($kMerSeq) = #_;
if ( not exists $kMersHash{$kMerSeq} ) {
$kMersHash{$kMerSeq} = 1;
}
else {
$kMersHash{$kMerSeq}++;
}
}
Then you want to move your print logic to immediately after your for-loop.
EDITED: I'm attempting to create a brief script that calls for an input fixed width file and a file with the start position and length of each attribute and then outputs the file as CSV instead of fixed width. I haven't messed with removing whitespace yet and am currently focusing on building the file reader portion.
Fixed:
My current issue is that this code returns data from the third row for $StartPosition and from the fourth row for $Length when they should both be first found on the first row of COMMA. I have no idea what is prompting this behavior.
Next issue: It only reads the first record in practice_data.txt I'm guessing it's something where I need to tell COMMA to go back to the beginning?
while (my $sourceLine = <SOURCE>) {
$StartPosition = 0;
$Length = 0;
$Output = "";
$NextRecord ="";
while (my $commaLine = <COMMA>) {
my $Comma = index($commaLine, ',');
print "Comma location found at $Comma \n";
$StartPosition = substr($commaLine, 0, $Comma);
print "Start position is $StartPosition \n";
$Comma = $Comma + 1
$Length = substr($commaLine, $Comma);
print "Length is $Length \n";
$NextRecord = substr($sourceLine, $StartPosition, $Length);
$Output = "$Output . ',' . $NextRecord";
}
print OUTPUT "$Output \n";
}
practice_data.txt
1234512345John Doe 123 Mulberry Lane Columbus Ohio 43215Johnny Jane
5432154321Jason McKinny 423 Thursday Lane Columbus Ohio 43212Jase Jamie
4321543212Mike Jameson 289 Front Street Cleveland Ohio 43623James Sarah
Each record is 100 characters long.
Definitions.txt:
0,10
10,10
20,10
30,20
50,10
60,10
70,5
75,15
90,10
It always helps to provide enough information so that we can at least do some testing without having to read your code and imagine what the data must look like.
I suggest you use unpack, after first building a template from the file that holds the field specifications. Note that the A field specifier trims trailing spaces from the data.
It is all but essential to use the Text::CSV module to parse or generate well-formed CSV data. And I have used the autodie pragma to avoid having to explicitly check and report on the status of every I/O operation.
I have used this data
my_source_data.txt
12345678 ABCDE1234FGHIJK
my_field_spec.txt
0,8
10,5
15,4
19,6
And this program
use strict;
use warnings;
use 5.010;
use autodie;
use Text::CSV;
my #template;
open my $field_fh, '<', 'my_field_spec.txt';
while ( <$field_fh> ) {
my (#info) = /\d+/g;
die unless #info == 2;
push #template, sprintf '#%dA%d', #info;
}
my $template = "#template";
open my $source_fh, '<', 'my_source_data.txt';
my $csv = Text::CSV->new( { binary => 1, eol => $/ } );
while ( <$source_fh> ) {
my #fields = unpack $template;
$csv->print(\*STDOUT, \#fields);
}
output
12345678,ABCDE,1234,FGHIJK
It looks like you're slightly confused on how to read the contents of the COMMA filehandle.. Each time you read <COMMA>, you're reading another line from that file. Instead, read a line into a scalar like my $line = <FH> and use that instead:
while (my $source_line = <SOURCE>) {
$StartPosition = 0;
$Length = 0;
$Output = "";
$Input = $_;
$NextRecord ="";
while (my $comma_line = <COMMA>) {
my $Comma = index($comma_line, ',');
print "Comma location found at $Comma \n";
$StartPosition = substr($comma_line, 0, $Comma);
print "Start position is $StartPosition \n";
$Length = substr($comma_line, $Comma);
print "Length is $Length \n";
$NextRecord = substr($Input, $StartPosition, $Length) + ',';
$Output = "$Output$NextRecord";
}
print OUTPUT "$Output \n";
}
I have a problem tha bothers me a lot...
I have a file with two columns (thanks to your help in a previous question) like:
14430001 0.040
14430002 0.000
14430003 0.990
14430004 1.000
14430005 0.050
14430006 0.490
....................
the first column is coordinates the second probabilities.
I am trying to find the blocks with probability >=0.990 and to be more than 100 in size.
As output I want to be like this:
14430001 14430250
14431100 14431328
18750003 18750345
.......................
where the first column has the coordinate of the start of each block and the second the end of it.
I wrote this script:
use strict;
#use warnings;
use POSIX;
my $scores_file = $ARGV[0];
#finds the highly conserved subsequences
open my $scores_info, $scores_file or die "Could not open $scores_file: $!";
#open(my $fh, '>', $coords_file) or die;
my $count = 0;
my $cons = "";
my $newcons = "";
while( my $sline = <$scores_info>) {
my #data = split('\t', $sline);
my $coord = $data[0];
my $prob = $data[1];
if ($data[1] >= 0.990) {
#$cons = "$cons + '\n' + $sline + '\n'";
$cons = join("\n", $cons, $sline);
# print $cons;
$count++;
if($count >= 100) {
$newcons = join("\n", $newcons, $cons);
my #array = split /'\n'/, $newcons;
print #array;
}
}
else {
$cons = "";
$count = 0;
}
}
It gives me the lines with probability >=0.990 (the first if works) but the coordinates are wrong. When Im trying to print it in a file it stacks, so I have only one sample to check it.
Im terrible sorry if my explanations aren't helpful, but I am new in programming.
Please, I need your help...
Thank you very much in advance!!!
You seem to be using too much variables. Also, after splitting the array and assigning its parts to variables, use the new variables rather than the original array.
sub output {
my ($from, $to) = #_;
print "$from\t$to\n";
}
my $threshold = 0.980; # Or is it 0.990?
my $count = 0;
my ($start, $last);
while (my $sline = <$scores_info>) {
my ($coord, $prob) = split /\t/, $sline;
if ($prob >= $threshold) {
$count++;
defined $start or $start = $coord;
$last = $coord;
} else {
output($start, $last) if $count > 100;
undef $start;
$count = 0;
}
}
output($start, $last) if $count > 100;
(untested)
I have an excel sheet to which I need to empty some cells
So far this is what it looks like:
I open the sheet, and check for cells not empty in column M.
I add those cells to my array mistake
and then I would like to make black all those cells and save the file (this step not working), as that file needs to be the input to anotherprogram/
thanks!
$infile = $ARGV[0];
$columns = ReadData($infile) or die "cannot open excel table\n\n";
print "xls sheet contains $columns->[1]{maxrow} rows\n";
my $xlsstartrow;
if ( getExcel( A . 1 ) ne "text" ) {
$xlsstartrow = 2;
}
else
{
$xlsstartrow = 4;
}
check_templates();
print "done";
sub check_templates {
for ( $row = $xlsstartrow ; $row < ( $columns->[1]{maxrow} + 1 ) ; $row++ ) {
if (getExcel(M . $row) ne "" ){
$cell = "M" . $row ;
push(#mistakes,$cell);
}
}
rewritesheet(#mistakes);
}
sub rewritesheet {
my $FileName = $infile;
my $parser = Spreadsheet::ParseExcel::SaveParser->new();
my $template = $parser->Parse($FileName);
my $worksheet = $template->worksheet(0);
my $row = 0;
my $col = 0;
# Get the format from the cell
my $format = $template->{Worksheet}[$sheet]
->{Cells}[$row][$col]
->{FormatNo};
foreach (#mistakes){
$worksheet->AddCell( $_, "" );
}
$template->SaveAs($infile2);`
Empty column values in an Excel sheet and save the result?
If the whole purpose of your program is to delete all column M values from a .xls file, then the following program (adopted from your program) will do exactly that:
use strict;
use warnings;
use Spreadsheet::ParseExcel;
use Spreadsheet::ParseExcel::SaveParser;
my $infile = $ARGV[0];
(my $infile2 = $infile) =~ s/(\.xls)$/_2$1/;
my $parser = Spreadsheet::ParseExcel::SaveParser->new();
my $workbook = $parser->Parse($infile);
my $sheet = $workbook->worksheet(0);
print "xls sheet contains rows \[0 .. $sheet->{MaxRow}\]\n";
my $startrow = $sheet->get_cell(0, 0) eq 'text' ? 4-1 : 2-1;
my $col_M = ord('M') - ord('A');
for my $row ($startrow .. $sheet->{MaxRow}) {
my $c = $sheet->get_cell($row, $col_M);
if(defined $c && length($c->value) > 0) { # why check?
$sheet->AddCell($row, $col_M, undef) # delete value
}
}
$workbook->SaveAs($infile2);
print "done";
But, if you really want to clear out column M only, why would you test for values? You could just overwrite them without test. Maybe thats not all your program is required to perform? I don't know.
Regards
rbo
I am currently working on a little parser.
i have had very good results with the first script! This was able to run great!
It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated, so the subequent work with the data is a bit difficult. Therefore i have a second script - see below!
Note - friends helped me with the both scripts. I need to introduce myself as a true novice who needs help in migration two in one. So, you see, my Perl-knowlgedge is not so elaborated that i am able to do the migration into one on my own! Any and all help would be great!
The first script: a spider and parser: it spits out the data like this:
lfd. Nr. Schul- nummer Schulname Straße PLZ Ort Telefon Fax Schulart Webseite
1 0401 Mädchenrealschule Marienburg, Abenberg, der Diözese Eichstätt Marienburg 1 91183 Abenberg 09178/509210 Realschulen mrs-marienburg.homepage.t-online.de
2 6581 Volksschule Abenberg (Grundschule) Güssübelstr. 2 91183 Abenberg 09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
3 6913 Mittelschule Abenberg Güssübelstr. 2 91183 Abenberg 09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
4 0402 Johann-Turmair-Realschule Staatliche Realschule Abensberg Stadionstraße 46 93326 Abensberg 09443/9143-0,12,13 09443/914330 Realschulen www.rs-abensberg.de
But i need to separate the data: with commas or someting like that!
And i have a second script. This part can do the CSV-formate. i want to ombine it with the spider-logic. But first lets have a look at the first script: with the great spider-logic.
see the code that is appropiate:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(#$row);
print OUTFILE "#$row\n";
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
}
sub cleanup() {
for ( #_ ) {
s/s+/ /g;
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
But as this-above script-unfortunatley does not take care for the separators i have had to take care for a method, that does look for separators. In order to get the data (output) separated.
So with the separation i am able to work with the data - and store it in a mysql-table.. or do something else...So here [below] are the bits - that work out the csv-formate Note - i want to put the code below into the code above - to combine the spider-logic of the above mentioned code with the logic of outputting the data in CSV-formate.
where to set in the code Question: can we identify this point to migrate the one into the other... !?
That would be amazing... I hope i could make clear what i have in mind...!? Are we able to use the benefits of the both parts (/scripts ) migrating them into one?
So the question is: where to set in with the CSV-Script into the script (above)
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20';
$html =~ tr/\r//d; # strip carriage returns
$html =~ s/ / /g; # expand spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my #cols = qw(
rownum
number
name
phone
type
website
);
my #fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
# trim leading/trailing whitespace from base fields
s/^\s+//, s/\s+$// for #$row;
# load the fields into the hash using a "hash slice"
my %h;
#h{#cols} = #$row;
# derive some fields from base fields, again using a hash slice
#h{qw/name street postal town/} = split /\n+/, $h{name};
#h{qw/phone fax/} = split /\n+/, $h{phone};
# trim leading/trailing whitespace from derived fields
s/^\s+//, s/\s+$// for #h{qw/name street postal town/};
$csv->combine(#h{#fields});
print $csv->string, "\n";
}
}
The thing is that i have had very good results with the first script! It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated...!
And i have a second script. This part can do the CSV-formate. i want to combine it with the spider-logic.
where is the part to insert? I look forward to any and all help.
if i have to be more precice - just let me know...
Since you have entered a complete script, I'll assume you want critique of the whole thing.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
Since you only use $te in one block, why are you declaring and initializing it in this outer scope? The same question applies to most of your variables -- try to declare them in the innermost scope possible.
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
In general, english variable names will enable you to collaborate with far more people than german names. I understand german, so I understand the intent of your code, but most of SO doesn't.
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
Don't use & to call subs. Just call them with workDir;. It hasn't been necessary to use & since 1994, and it can lead to a nasty gotcha because &callMySub; is a special case which doesn't do what you might think, while callMySub; does the Right Thing.
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
Generally lexical filehandles are preferred these days: open my $outfile, ">file"; Also, you should check for errors from open or use autodie; to make open die on failure.
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(#$row);
print OUTFILE "#$row\n";
This is the line to change if you want to put commas in separating your data. Look at the join function, it can do what you want.
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
It's very strange to initialize $te at the end of the loop instead of the beginning. It's much more idiomatic to declare and initialize $te at the top of the loop.
}
sub cleanup() {
for ( #_ ) {
s/s+/ /g;
Did you mean s/\s+/ /g;?
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
I haven't commented on your second script; perhaps you should ask it as a separate question.