Nested Loop running very slowly - perl

I'm trying to run a program to check each line of one file against each line of a second file to see if some of the elements match. Each file is around 200k lines.
What I've got so far looks like this;
#!/usr/bin/perl
#gffgenefind.pl
use strict;
use warnings;
die "SNP gff\n" unless #ARGV == 4;
open( my $snp, "<", $ARGV[0] ) or die "Can't open $:";
open( my $gff, "<", $ARGV[1] ) or die "can't open $:";
open( my $outg, ">", $ARGV[2] );
open( my $outs, ">", $ARGV[3] );
my $scaffold;
my $site;
my #snplines = <$snp>;
my #gfflines = <$gff>;
foreach my $snpline (#snplines) {
my #arr = split( /\t/, $snpline );
$scaffold = $arr[0];
$site = $arr[1];
foreach my $line (#gfflines) {
my #arr1 = split( /\t/, $line );
if ( $arr1[3] <= $site and $site <= $arr1[4] and $arr1[0] eq $scaffold ) {
print $outg "$line";
print $outs "$snpline";
}
}
}
File 1 (snp) looks like this scaffold_100 10689 A C A 0 0 0 0 0 0
File 2 (gff) looks like this scaffold_1 phytozomev10 gene 750912 765975 . - . ID=Carubv10008059m.g.v1.0;Name=Carubv10008059m.g
Essentially, I'm looking to see if the first values match and if the second value from snp is within the range defined on the second file (in this case 750912 to 765975)
I've seen that nested loops are to be avoided, and was wondering if there's an alternative way for me to look through this data.
Thanks!

Firstly - lose the foreach loop. That reads your whole file into memory, when you probably don't need to.
Try instead:
while ( my $snpline = <$snp> ) {
because it reads line by line.
Generally - mixing array indicies and named variables is also bad style.
The core problem will most likely be though because each line of your first file, you're cycling all of the second file.
Edit: Note - because 'scaffold' isn't unique, amended accordingly
This seems like a good place to use a hash. E.g.
my %sites;
while ( <$snp> ) {
my ( $scaffold, $site ) = split ( /\t/ );
$sites{$site}{$scaffold}++
}
while ( <$gff> ) {
my ( $name, $tmp1, $tmp2, $range_start, $range_end ) = split ( /\t/ );
if ( $sites{$name} ) {
foreach my $scaffold ( keys %{ $sites{$name} ) {
if ( $scaffold > $range_start
and $scaffold < $range_end ) {
#do stuff with it;
print;
}
}
}
}
Hopefully you get the gist, even if it isn't specifically what you're after?

Try this Python snippet:
#!/usr/bin/env python
import sys
import contextlib
if len(sys.argv) !=5:
raise Exception('SNP gff')
snp, gff, outg, outs = sys.argv[1:]
gff_dict = {}
with open(gff) as gff_handler:
for line in gff_handler:
fields=line.split()
try:
gff_dict[fields[0]].append(fields[1:])
except KeyError:
gff_dict[fields[0]] = [fields[1:]]
with contextlib.nested(open(snp),
open(outs, 'w'),
open(outg, 'w')) as (snp_handler,
outs_handler,
outg_handler):
for line_snp in snp_handler:
fields=line_snp.split()
key = fields[0]
if key in gff_dict:
for ele in gff_dict[key]:
if ele[2] <= fields[1] <= ele[3]:
outs_handler.write(line_snp)
outg_handler.write("{0}\t{1}\n".format(key,"\t".join(ele)))

Related

Hash incorrectly tracking counts, runtime long

I am working on a program in Perl and my output is wrong and taking forever to process. The code is meant to take in a large DNA sequence file, read through it in 15 letter increments (kmers), stepping forward 1 position at a time. I'm supposed to enter the kmer sequences into a hash, with their value being the number of incidences of that kmer- meaning each key should be unique and when a duplicate is found, it should increase the count for that particular kmer. I know from my Prof. expected output file, that I have too many lines, so it is allowing duplicates and not counting correctly. It's also running 5+ minutes, so I have to Ctrl+C to escape. When I go look at kmers.txt, the file is at least written and formatted correctly.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# countKmers.pl
# Open file /scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta
# Identify all k-mers of length 15, load them into a hash
# and count the number of occurences of each k-mer. Each
# unique k-mer and its' count will be written to file
# kmers.txt
#Create an empty hash
my %kMersHash = ();
#Open a filehandle for the output file kmers.txt
unless ( open ( KMERS, ">", "kmers.txt" ) ) {
die $!;
}
#Call subroutine to load Fly Chromosome 2L
my $sequenceRef = loadSequence("/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta");
my $kMer = 15; #Set the size of the sliding window
my $stepSize = 1; #Set the step size
for (
#The sliding window's start position is 0
my $windowStart = 0;
#Prevent going past end of the file
$windowStart <= ( length($$sequenceRef) - $kMer );
#Advance the window by the step size
$windowStart += $stepSize
)
{
#Get the substring from $windowStart for length $kMer
my $kMerSeq = substr( $$sequenceRef, $windowStart, $kMer );
#Call the subroutine to iterate through the kMers
processKMers($kMerSeq);
}
sub processKMers {
my ($kMerSeq) = #_;
#Initialize $kCount with at least 1 occurrence
my $kCount = 1;
#If the key already exists, the count is
#increased and changed in the hash
if ( not exists $kMersHash{$kMerSeq} ) {
#The hash key=>value is loaded: kMer=>count
$kMersHash{$kMerSeq} = $kCount;
}
else {
#Increment the count
$kCount ++;
#The hash is updated
$kMersHash{$kMerSeq} = $kCount;
}
#Print out the hash to filehandle KMERS
for (keys %kMersHash) {
print KMERS $_, "\t", $kMersHash{$_}, "\n";
}
}
sub loadSequence {
#Get my sequence file name from the parameter array
my ($sequenceFile) = #_;
#Initialize my sequence to the empty string
my $sequence = "";
#Open the sequence file
unless ( open( FASTA, "<", $sequenceFile ) ) {
die $!;
}
#Loop through the file line-by-line
while (<FASTA>) {
#Assign the line, which is in the default
#variable to a named variable for readability.
my $line = $_;
#Chomp to get rid of end-of-line characters
chomp($line);
#Check to see if this is a FASTA header line
if ( $line !~ /^>/ ) {
#If it's not a header line append it
#to my sequence
$sequence .= $line;
}
}
#Return a reference to the sequence
return \$sequence;
}
Here's how I would write your application. The processKMers subroutine boils down to just incrementing a hash element, so I've removed that. I've also altered the identifiers to be match the snake_case that is more usual in Perl code, and I didn't see any point in load_sequence returning a reference to the sequence so I've changed it to return the string itself
use strict;
use warnings 'all';
use constant FASTA_FILE => '/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta';
use constant KMER_SIZE => 15;
use constant STEP_SIZE => 1;
my $sequence = load_sequence( FASTA_FILE );
my %kmers;
for (my $offset = 0;
$offset + KMER_SIZE <= length $sequence;
$offset += STEP_SIZE ) {
my $kmer_seq = substr $sequence, $start, KMER_SIZE;
++$kmers{$kmer_seq};
}
open my $out_fh, '>', 'kmers.txt' or die $!;
for ( keys %kmers ) {
printf $out_fh "%s\t%d\n", $_, $kmers{$_};
}
sub load_sequence {
my ( $sequence_file ) = #_;
my $sequence = "";
open my $fh, '<', $sequence_file or die $!;
while ( <$fh> ) {
next if /^>/;
chomp;
$sequence .= $_;
}
return $sequence;
}
Here's a neater way to increment a hash element without using ++ on the hash directly
my $n;
if ( exists $kMersHash{$kMerSeq} ) {
$n = $kMersHash{$kMerSeq};
}
else {
$n = 0;
}
++$n;
$kMersHash{$kMerSeq} = $n;
Everything looks fine in your code besides processKMers. The main issues:
$kCount is not persistent between calls to processKMers, so in your else statement, $kCount will always be 2
You are printing every time you call processKMers, which is what is slowing you down. Printing frequently slows down your process significantly, you should wait until the end of your program and print once.
Keeping your code mostly the same:
sub processKMers {
my ($kMerSeq) = #_;
if ( not exists $kMersHash{$kMerSeq} ) {
$kMersHash{$kMerSeq} = 1;
}
else {
$kMersHash{$kMerSeq}++;
}
}
Then you want to move your print logic to immediately after your for-loop.

Next line array if a condition is achieved

I have a text in a file F1 each sentence in line, and another file contain the part of speech(POS) of each word in the text for example:
F1 contains:
he lives in paris\n
he jokes
F2 contains:
he pro\n
lives verb\n
in prep\n
paris adv_pl\n
he pro\n
jokes verb\n
I would like to parse each sentence of F1 and extract the POS of each word. I arrived to extract the POS of the first sentence, but the program can't parse the second line. This is the code:
open( FILE, $filename ) || die "Problème d'ouverture du ficher en entrée";
open( FILEOUT, ">$filenameout" ) || die "Problème d'ouverture";
open( F, "/home/ahmed/Bureau/test/corpus.txt" ) || die " Pb pour ouvrir";
open( F2, "/home/ahmed/Bureau/test/corp.txt" ) || die " Pb pour ouvrir";
my $z;
my $y = 0;
my $l;
my $li;
my $pos;
while ( $ligne = <F> ) {
while ( $li = <F2> ) { # F2 POS
chomp($li);
# prem contain the first word of F2 in each line,
# deux contain the POS of this word
( $prem, $deux ) = ( $li =~ m/^\W*(\w+)\W+(\w+)/ );
print "premier: $prem\n";
chomp($ligne);
#val = split( / /, $ligne ); # corpus de texte
$l = #val;
while ( $y < $l ) { # $l length of sentence
$z = $val[$y];
print "z : $z\n";
if ( $z eq $prem ) {
print "true\n";
$pos .= "POSw" . $y . "=" . $deux . " ";
++$y;
} else {
last;
}
}
}
print FILEOUT "$pos\n";
$pos = "";
}
The result I had in the terminal:
premier: he
z : he
true
premier : lives
z : lives
true
premier : in
z : in
true
premier : paris
z : paris
true
premier : he
premier : jokes
The first sentence has 4 words, when it pass 4, we must go to the next line in the text, I can't arrive to solve it.
There are some issues in your script.
You must always use strict; use warnings; to show the most common syntax and/or typing errors, unused variables, etc.
You should always use the three-parameter open and no global FILEHANDLEs (see opentut).
You should use some sensible names for your filehandles, not FH, FH1, etc. but $fh_sentences and $fh_grammar (or other meaningful names).
So far for the general part. Now let's get more specific:
Your outer loop (F) reads the sentences one by one. The next loop (F2) reads the grammatical types but it does so only once for the first sentence. When the F2 file is read, subsequent calls to <F2> will always return undef because the file was already read. You have to reset the filepointer to the beginning of the file after each sentence or -- even better -- read the file F2 in advance and store its contents in a hash.
Iterating over an array of words in a sentence is easier with foreach my $word (#words). No need to do the housekeeping of index variables (like $y) yourself.
chomping and splitting the sentences should be moved outside the F2 loop because $ligne doesn't change in the loop and only burns CPU cycles.
Putting this together I end up with this:
use strict;
use warnings;
# Read the grammar file, F2, into a hash:
my %grammar;
open( my $fh_grammar, '<', 'F2' ) or die "Pb pour ouvrir F2: $!\n";
while( my $ligne = <$fh_grammar> ) {
my ($prem, $deux) = ( $ligne =~ m/^\W*(\w+)\W+(\w+)/ );
$grammar{$prem} = $deux;
}
close($fh_grammar);
# The hash is now:
# %grammar = (
# 'he' => 'pro',
# 'lives => 'verb',
# 'in' => 'prep',
# 'paris' => 'adv_pl'
# 'jokes' => 'verb'
# );
# Read the sentences from F1 and check the grammar:
open( my $fh_sentences, '<', 'F1' ) or die "Pb pour ouvrir F1: $!\n";
while( my $ligne = <$fh_sentences> ) {
my #words = split(/\s+/, $ligne );
foreach my $word (#words) {
print "z: $word\n";
if ( exists $grammar{$word} ) {
print "true; $grammar{$word}\n";
}
}
print "\n";
}
close($fh_sentences);
Output:
z: he
true; pro
z: lives
true; verb
z: in
true; prep
z: paris
true; adv_pl
z: he
true; pro
z: jokes
true; verb
You can solve the above problem in different way like :
First read the POS file and put it in hash
Code :
my $filename = "F2";
open FH2, $filename or die "Error";
my %POS_hash;
while(<FH2>)
{
chomp($_);
my #arr = split(/ /, $_); # you can change the split function
$POS_hash{$arr[0]} = $arr[1];
}
Now read your file and replace it with the POS
my $filename1 = "F1";
open FH1, $filename1 or die "Error";
while(<FH1>)
{
chomp($_);
my #arr = split(/ /, $_); # you can change the split function
foreach my $val (#arr)
{
if(exists $POS_hash{$val})
{
print "$POS_hash{$val}\t";
}
}
print "\n";
}
I believe this is a better way for your problem. Hope this will solve your problem.

compare two txt files

I'm new to Perl. I have two text files and I need to check matching string on both lists.
For example matching strings are:
line - file 1: fe/bla/blablabla/abcdefg
line - file 2: blablabla/abcdefg
There is a match!
In addition, the location (line number) of the matching strings is not the same on both files.
I tried put the lists in arrays and compare the arrays with nested loop, but the running time of the program is huge (the lists contain thousand of lines) and I believe there is another way, less naïve and more productive.
This is the way I put the data in the array:
my $list1 = /path/to/the/file;
open (FILE , '<' , $list1) or die ("Could not open the file");
while ( my $line = <FILE> ) {
chomp ($line);
$list_1[$i] = $line;
$i = $i+1;
}
close FILE;
I did it to the other list as well.
And this is my nested loop.
for ( $k = 0 ; $k < #list_1 ; $k = $k+1 ) {
for ($i = 0 ; $i < #list_2 ; $i = $i+1 ) {
if (index($list_1[$k] , $list_2[$i]) != -1) {
splice (#list_2 , $i , 1);
last;
}
}
}
As long as file2 isn't enormous, the simplest way is to build a regular expression pattern from its contents and check each line in file1 against the pattern.
You don't say what output you want, so I have printed all lines in file1 that have a match in file2.
use strict;
use warnings;
use 5.010;
use autodie;
my ($list1, $list2) = qw( /path/to/list1 /path/to/list2 );
open my $fh, '<', $list2;
my $re = join '|', map { chomp; quotemeta; } <$fh>;
$re = qr/$re/;
open $fh, '<', $list2;
while ( <$fh> ) {
print if /$re/;
}

Pass lines from 2 files to same subroutine

I'm in the process of learning how to use perl for genomics applications. I am trying to clean up paired end reads (1 forward, 1 reverse). These are stored in 2 files, but the lines match. What I'm having trouble doing is getting the relevant subroutines to read from the second file (the warnings I get are for uninitialized values).
These files are set up in 4 line blocks(fastq) where the first line is a run ID, 2nd is a sequence, 3rd is a "+", and the fourth holds quality values for the sequence in line 2.
I had no real trouble with this code when it was applied only for one file, but I think I'm misunderstanding how to handle multiple files.
Any guidance is much appreciated!
My warning in this scenario is as such : Use of uninitialized value $thisline in subtraction (-) at ./pairedendtrim.pl line 137, line 4.
#!/usr/bin/perl
#pairedendtrim.pl by AHU
use strict;
use warnings;
die "usage: readtrimmer.pl <file1> <file2> <nthreshold> " unless #ARGV == 3;
my $nthreshold = "$ARGV[2]";
open( my $fastq1, "<", "$ARGV[0]" );
open( my $fastq2, "<", "$ARGV[1]" );
my #forline;
my #revline;
while ( not eof $fastq2 and not eof $fastq1 ) {
chomp $fastq1;
chomp $fastq2;
$forline[0] = <$fastq1>;
$forline[1] = <$fastq1>;
$forline[2] = <$fastq1>;
$forline[3] = <$fastq1>;
$revline[0] = <$fastq2>;
$revline[1] = <$fastq2>;
$revline[2] = <$fastq2>;
$revline[3] = <$fastq2>;
my $ncheckfor = removen( $forline[1] );
my $ncheckrev = removen( $revline[1] );
my $fortest = 0;
if ( $ncheckfor =~ /ok/ ) { $fortest = 1 }
my $revtest = 0;
if ( $ncheckrev =~ /ok/ ) { $revtest = 1 }
if ( $fortest == 1 and $revtest == 1 ) { print "READ 1 AND READ 2" }
if ( $fortest == 1 and $revtest == 0 ) { print "Read 1 only" }
if ( $fortest == 0 and $revtest == 1 ) { print "READ 2 only" }
}
sub removen {
my ($thisline) = $_;
my $ntotal = 0;
for ( my $i = 0; $i < length($thisline) - 1; $i++ ) {
my $pos = substr( $thisline, $i, 1 );
#print "$pos\n";
if ( $pos =~ /N/ ) { $ntotal++ }
}
my $nout;
if ( $ntotal <= $nthreshold ) #threshold for N
{
$nout = "ok";
} else {
$nout = "bad";
}
return ($nout);
}
The parameters to a subroutine are in #_, not $_
sub removen {
my ($thisline) = #_;
I have a few other tips for you as well:
use autodie; anytime that you're doing file processing.
Assign the values in #ARGV to variables first thing. This quickly documents what the hold.
Do not chomp a file handle. This does not do anything. Instead apply chomp to the values returned from reading.
Do not use the strings ok and bad as boolean values.
tr can be used to count the number times a character is in a string.
The following is a cleaned up version of your code:
#!/usr/bin/perl
#pairedendtrim.pl by AHU
use strict;
use warnings;
use autodie;
die "usage: readtrimmer.pl <file1> <file2> <nthreshold> " unless #ARGV == 3;
my ( $file1, $file2, $nthreshold ) = #ARGV;
open my $fh1, '<', $file1;
open my $fh2, '<', $file2;
while ( not eof $fh2 and not eof $fh1 ) {
chomp( my #forline = map { scalar <$fh1> } ( 1 .. 4 ) );
chomp( my #revline = map { scalar <$fh2> } ( 1 .. 4 ) );
my $ncheckfor = removen( $forline[1] );
my $ncheckrev = removen( $revline[1] );
print "READ 1 AND READ 2" if $ncheckfor and $ncheckrev;
print "Read 1 only" if $ncheckfor and !$ncheckrev;
print "READ 2 only" if !$ncheckfor and $ncheckrev;
}
sub removen {
my ($thisline) = #_;
my $ntotal = $thisline =~ tr/N/N/;
return $ntotal <= $nthreshold; #threshold for N
}

How to make LWP and HTML::TableExtract spitting out CSV with Text::CSV

I am currently working on a little parser.
i have had very good results with the first script! This was able to run great!
It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated, so the subequent work with the data is a bit difficult. Therefore i have a second script - see below!
Note - friends helped me with the both scripts. I need to introduce myself as a true novice who needs help in migration two in one. So, you see, my Perl-knowlgedge is not so elaborated that i am able to do the migration into one on my own! Any and all help would be great!
The first script: a spider and parser: it spits out the data like this:
lfd. Nr. Schul- nummer Schulname Straße PLZ Ort Telefon Fax Schulart Webseite
1 0401 Mädchenrealschule Marienburg, Abenberg, der Diözese Eichstätt Marienburg 1 91183  Abenberg  09178/509210 Realschulen mrs-marienburg.homepage.t-online.de
2 6581 Volksschule Abenberg (Grundschule) Güssübelstr. 2 91183  Abenberg  09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
3 6913 Mittelschule Abenberg  Güssübelstr. 2 91183  Abenberg  09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
4 0402 Johann-Turmair-Realschule Staatliche Realschule Abensberg Stadionstraße 46 93326  Abensberg  09443/9143-0,12,13 09443/914330 Realschulen www.rs-abensberg.de
But i need to separate the data: with commas or someting like that!
And i have a second script. This part can do the CSV-formate. i want to ombine it with the spider-logic. But first lets have a look at the first script: with the great spider-logic.
see the code that is appropiate:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(#$row);
print OUTFILE "#$row\n";
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
}
sub cleanup() {
for ( #_ ) {
s/s+/ /g;
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
But as this-above script-unfortunatley does not take care for the separators i have had to take care for a method, that does look for separators. In order to get the data (output) separated.
So with the separation i am able to work with the data - and store it in a mysql-table.. or do something else...So here [below] are the bits - that work out the csv-formate Note - i want to put the code below into the code above - to combine the spider-logic of the above mentioned code with the logic of outputting the data in CSV-formate.
where to set in the code Question: can we identify this point to migrate the one into the other... !?
That would be amazing... I hope i could make clear what i have in mind...!? Are we able to use the benefits of the both parts (/scripts ) migrating them into one?
So the question is: where to set in with the CSV-Script into the script (above)
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20';
$html =~ tr/\r//d; # strip carriage returns
$html =~ s/ / /g; # expand spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my #cols = qw(
rownum
number
name
phone
type
website
);
my #fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
# trim leading/trailing whitespace from base fields
s/^\s+//, s/\s+$// for #$row;
# load the fields into the hash using a "hash slice"
my %h;
#h{#cols} = #$row;
# derive some fields from base fields, again using a hash slice
#h{qw/name street postal town/} = split /\n+/, $h{name};
#h{qw/phone fax/} = split /\n+/, $h{phone};
# trim leading/trailing whitespace from derived fields
s/^\s+//, s/\s+$// for #h{qw/name street postal town/};
$csv->combine(#h{#fields});
print $csv->string, "\n";
}
}
The thing is that i have had very good results with the first script! It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated...!
And i have a second script. This part can do the CSV-formate. i want to combine it with the spider-logic.
where is the part to insert? I look forward to any and all help.
if i have to be more precice - just let me know...
Since you have entered a complete script, I'll assume you want critique of the whole thing.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
Since you only use $te in one block, why are you declaring and initializing it in this outer scope? The same question applies to most of your variables -- try to declare them in the innermost scope possible.
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
In general, english variable names will enable you to collaborate with far more people than german names. I understand german, so I understand the intent of your code, but most of SO doesn't.
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
Don't use & to call subs. Just call them with workDir;. It hasn't been necessary to use & since 1994, and it can lead to a nasty gotcha because &callMySub; is a special case which doesn't do what you might think, while callMySub; does the Right Thing.
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
Generally lexical filehandles are preferred these days: open my $outfile, ">file"; Also, you should check for errors from open or use autodie; to make open die on failure.
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(#$row);
print OUTFILE "#$row\n";
This is the line to change if you want to put commas in separating your data. Look at the join function, it can do what you want.
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
It's very strange to initialize $te at the end of the loop instead of the beginning. It's much more idiomatic to declare and initialize $te at the top of the loop.
}
sub cleanup() {
for ( #_ ) {
s/s+/ /g;
Did you mean s/\s+/ /g;?
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
I haven't commented on your second script; perhaps you should ask it as a separate question.