Degeneracy of characters when searching for a specific sub-string - perl

I have the following script which searches for specified substrings within an input string (a DNA sequence). I was wondering if anybody could help out with being able to specify degeneracy of specific characters. For example, instead of searching for GATC (or anything consisting solely of G's, T's, A's and C's), I could instead search for GRTNA where R = A or G and where N = A, G, C or T. I would need to be able to specify quite a few of these in a long list within the script. Many thanks for any help or tips!
use warnings;
use strict;
#User Input
my $usage = "Usage (OSX Terminal): perl <$0> <FASTA File> <Results Directory + Filename>\n";
#Reading formatted FASTA/FA files
sub read_fasta {
my ($in) = #_;
my $sequence = "";
while(<$in>) {
my $line = $_;
chomp($line);
if($line =~ /^>/){ next }
else { $sequence .= $line }
}
return(\$sequence);
}
#Scanning for restriction sites and length-output
open(my $in, "<", shift);
open(my $out, ">", shift);
my $DNA = read_fasta($in);
print "DNA is: \n $$DNA \n";
my $len = length($$DNA);
print "\n DNA Length is: $len \n";
my #pats=qw( GTTAAC );
for (#pats) {
my $m = () = $$DNA =~ /$_/gi;
print "\n Total DNA matches to $_ are: $m \n";
}
my $pat=join("|",#pats);
my #cutarr = split(/$pat/, $$DNA);
for (#cutarr) {
my $len = length($_);
print $out "$len \n";
}
close($out);
close($in);

GRTNA would correspond to the pattern G[AG]T[AGCT]A.
It looks like you could do this by writing
for (#pats) {
s/R/[AG]/g;
s/N/[AGCT]/g;
}
before
my $pat = join '|', #pats;
my #cutarr = split /$pat/, $$DNA;
but I'm not sure I can help you with the requirement that "I would need to be able to specify quite a few of these in a long list within the script". I think it would be best to put your sequences in a separate text file rather than embed the list directly into the program.
By the way, wouldn't it be simpler just to
return $sequence
from your read_fasta subroutine? Returning a reference just means you have to dereference it everywhere with $$DNA. I suggest that it should look like this
sub read_fasta {
my ($fh) = #_;
my $sequence;
while (<$fh>) {
unless (/^>/) {
chomp;
$sequence .= $_;
}
}
return $sequence;
}

Related

How can I calculate the geometric center of a protein in Perl?

I have a PDB file which contains information about a specific protein. One of the information it holds is the positions of the different atoms (xyz coordinates).
The file is the following https://files.rcsb.org/view/6U9D.pdb . With this file I want to calculate the geometric center of the atoms. In theory I know what I need to do, but the script I wrote does not seem to work.
The first part of the program, before the for loop, is another part of the assignment which requires me to read the sequence and convert it from the 3 letter nomenclature to the 1 letter one. The part that interests me is the for loop until the end. I tried to pattern match in order to isolate the XYZ coordinates. Then I used a counter that I had set up in the beginning which is the $k variable. When I check the output on cygwin the only output I get is the sequence 0 0 0 instead of the sum of each dimension divided by $k. Any clue what has gone wrong?
$k=0;
open (IN, '6U9D.pdb.txt');
%amino_acid_conversion = (
ALA=>'A',TYR=>'Y',MET=>'M',LEU=>'L',CYS=>'C',
GLY=>'G',ARG=>'R',ASN=>'N',ASP=>'D',GLN=>'Q',
GLU=>'E',HIS=>'H',TRP=>'W',LYS=>'K',PHE=>'F',
PRO=>'P',SER=>'S',THR=>'T',ILE=>'I',VAL=>'V'
);
while (<IN>) {
if ($_=~m/HEADER\s+(.*)/){
print ">$1\n"; }
if ($_=~m/^SEQRES\s+\d+\s+\w+\s+\d+\s+(.*)/){
$seq.=$1;
$seq=~s/ //g;
}
}
for ($i=0;$i<=length $seq; $i+=3) {
print "$amino_acid_conversion{substr($seq,$i,3)}";
if ($_=~m/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\s+\d+\s+(\S+)\s+(\S+)\s+(\S+)/) {
$x+=$1; $y+=$2; $z+=$3; $k++;
}
}
print "\n";
#print $k;
$xgk=($x/$k); $ygk=($y/$k); $zgk=($z/$k);
print "$xgk $ygk $zgk \n";
I do not know bioinformatics but it seems like you should do something like this:
use feature qw(say);
use strict;
use warnings;
my $fn = '6U9D.pdb';
open ( my $IN, '<', $fn ) or die "Could not open file '$fn': $!";
my $seq = '';
my $x = 0;
my $y = 0;
my $z = 0;
my $k = 0;
while (<$IN>) {
if ($_ =~ m/HEADER\s+(.*)/) {
say ">$1";
}
if ($_=~m/^SEQRES\s+\d+\s+\w+\s+\d+\s+(.*)/){
$seq .= $1;
}
if ($_ =~ m/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\s+\d+\s+(\S+)\s+(\S+)\s+(\S+)/) {
$x+=$1; $y+=$2; $z+=$3; $k++;
}
}
close $IN;
$seq =~ s/ //g;
my %amino_acid_conversion = (
ALA=>'A',TYR=>'Y',MET=>'M',LEU=>'L',CYS=>'C',
GLY=>'G',ARG=>'R',ASN=>'N',ASP=>'D',GLN=>'Q',
GLU=>'E',HIS=>'H',TRP=>'W',LYS=>'K',PHE=>'F',
PRO=>'P',SER=>'S',THR=>'T',ILE=>'I',VAL=>'V'
);
my %unknown_keys;
my $conversion = '';
say "Sequence length: ", length $seq;
for (my $i=0; $i < length $seq; $i += 3) {
my $key = substr $seq, $i, 3;
if (exists $amino_acid_conversion{$key}) {
$conversion.= $amino_acid_conversion{$key};
}
else {
$unknown_keys{$key}++;
}
}
say "Conversion: $conversion";
say "Unknown keys: ", join ",", keys %unknown_keys;
say "Number of atoms: ", $k;
my $xgk=($x/$k);
my $ygk=($y/$k);
my $zgk=($z/$k);
say "Geometric center: $xgk $ygk $zgk";
This gives me the following output:
[...]
Number of atoms: 76015
Geometric center: 290.744642162734 69.196842162731 136.395842938893

subset fasta file using id file

I have been stuck with some script!
Well i made this script in 2008 and now i am using with some modifications and i get error!
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
sub getSequences ($) {
my $file = $_[0];
open (INFILE, "<$file") || die "Error1 in opening in file: $file. $!\n";
my #lines = <INFILE>;
my $header; my %header2seq = ();
foreach my $line (#lines) {
chomp $line;
if ($line =~ /^(>.+)$/o) {
$header = $1;
}
else {$header2seq {$header} .= $line; }
}
#print %header2seq;
close (INFILE);
return (\%header2seq);
}
sub MakeSpList ($) {
my $sp_list = $_[0]; my %sp_names = ();
open (INFILE2, "<$sp_list") || die "Error2 in opening in file: $sp_list. $!\n";
my #sps = <INFILE2>;
foreach my $line (#sps) { chomp $line; $sp_names {$line} = 1; }
close (INFILE2);
#print Dumper (%sp_names);
return (\%sp_names);
}
sub CompareSpList2Sequences ($$$) {
my $ref_header2seq = $_[0] ; my $ref_sp_names = $_[1]; my $file = $_[2];
open (OUTFILE, ">$file.subdata") || die ("Error3 in opening out file: $file.subdata. $!\n");
foreach my $key (keys %$ref_header2seq) {
$key =~ m/^>([A-Z]+[0-9]+[A-Z+]).+$/o;
#print "$1\n";
my $header_sub = $1;
#print $header_sub, "\n";
#print $ref_sp_names, "\n";
if (exists $ref_sp_names -> {$header_sub}) {
my $seq = $ref_header2seq -> {$key};
print OUTFILE ">$key\n$seq\n";
}
}
close (OUTFILE);
return "42";
}
my $fasta_seqs = $ARGV[0]; my $sp_list = $ARGV[1];
my $ref_header2seq = getSequences ($fasta_seqs);
my $ref_sp_names = MakeSpList ($sp_list);
CompareSpList2Sequences ($ref_header2seq , $ref_sp_names, $fasta_seqs);
exit;
What i want to do is:
i have a fasta file with sequences:
YAL004W YAL004W SGDID:S000002136, Chr I from 140760-141407, Genome Release 64-2-1, Dubious ORF, "Dubious open reading frame; unlikely to encode a functional protein, based on available experimental and comparative sequence data; completely overlaps verified gene SSA1/YAL005C" ATGGGTGTCACCAGCGGTGGCCTTAACTTCAAAGATACCGTCTTCAATGGACAACAAAGAGACATCGAAAGTACCACCACCCAAGTCGAAAATCAAGACGTGTTCTTCCTTACCCTTCTTGTCCAAACCGTAAGCAATGGCAGCGGCGGTAGGTTCGTTAATAATACGCAAGACATTCAAACCAGCAATGGTACCAGCATCCTTGGTAGCTTGTCTTTGAGAATCGTTGAA
YAL005C SSA1 SGDID:S000000004, Chr I from 141431-139503, Genome Release 64-2-1, reverse complement, Verified ORF, "ATPase involved in protein folding and NLS-directed nuclear transport; member of HSP70 family; forms chaperone complex with Ydj1p; localized to nucleus, cytoplasm, and cell wall; 98% identical with paralog Ssa2p, but subtle differences between the two proteins provide functional specificity with respect to propagation of yeast [URE3] prions and vacuolar-mediated degradations of gluconeogenesis enzymes; general targeting factor of Hsp104p to prion fibrils" ATGTCAAAAGCTGTCGGTATTGATTTAGGTACAACATACTCGTGTGTTGCTCACTTTGCTAATGATCGTGTGGACATTATTGCCAACGATCAAGGTAACAGAACCACTCCATCTTTTGTCGCTTTCACTGACACTGAAAGATTGATTGGTGATGCTGCTAAGAATCAAGCTGCTATGAATCCTTCGAATACCGTTTTCGACGCTAAGCGTTTGATCGGTAGAAACTTCAAC
and i have another file with ID's:
YAL005C
YAL012W
I want to retrieve the sequences and the all header when match with ID's file.
but i get this error: don´t print anything!
Please can you help me?
Thanks in advance.
i already searched for other methods (and i can´t get the results either) but i really want to know about this error!
no bioperl please!
OK, so - line 45 is:
if (exists $ref_sp_names -> {$header_sub}) {
Your error is telling you that $header_sub is undefined. It's set by:
my $header_sub = $1;
Which follows:
$key =~ m/^(>[A-Z])\s.+$/o;
So - this means the regex isn't matching. I don't see any > in your sample data, so it can't match it. When the match fails, $1 is undefined, hence your error. What do you get out of your print $key statements?
I would also note - .+$ is most likely redundant. Likewise - the o flag - you probably don't want that either. http://perldoc.perl.org/perlre.html#Modifiers
have you tried using Bioperl? Here's some code to get you started.
#!/usr/bin/perl
use warnings;
use strict;
use Bio::SeqIO;
my $fasta = shift; #this will just push whatever in cli in.
my $seqio_obj = Bio::SeqIO->(-file => $fasta, -format => 'fasta');
while ( my $seq = $seqio_obj->next_seq){
print $seq->id . ' = ' . $seq->seq() . "\n";
#in here you can do your fasta handling with the seq obj
}

perl hash mapping/retrieval issues with split and select columns

Perl find and replace multiple(huge) strings in one shot
P.S.This question is related to the answer for above question.
When I try to replace this code:
Snippet-1
open my $map_fh, '<', 'map.csv' or die $!;
my %replace = map { chomp; split /,/ } <$map_fh>;
close $map_fh;
with this code:
Snippet-2
my %replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
even though the key exists (as in the dumper), exists statement doesn't return the value for the key.
For same input file, it works perfectly with just split alone (Snippet-1) whereas not returning anything when i select specific columns after split(Snippet-2).
Is there some integer/string datatype mess-up happening here?
Input Mapping File
483329,Buffalo
483330,Buffalo
483337,Buffalo
Script Output
$VAR1 = {
'483329' => 'Buffalo',
'46546' => 'Chicago_CW',
'745679' => 'W. Washington',
};
1 search is ENB
2 search is 483329 **expected Buffalo here**
3 search is 483330
4 search is 483337
Perl Code
open my $map_fh, '<', $MarketMapFile or die $!;
if ($MapSelection =~ /eNodeBID/i) {
my %replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
use Data::Dumper;
print Dumper(\%replace);
}
close $map_fh;
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1, eol => $/,quote_space => 0 });
my $tmpCSVFile = $CSVFile."tmp";
open my $in_fh, '<', $CSVFile or die $!;
open my $out_fh, '>', $tmpCSVFile or die $!;
my $cnt=1;
while (my $row = $csv->getline($in_fh)) {
my $search = $row->[5];
$search =~ s/[^[:print:]]+//g;
if ($MapSelection =~ /eNodeBID/i) {
$search =~ s/(...)-(...)-//g;
$search =~ s/\(M\)//g;
}
my $match = (exists $replace{$search}) ? $replace{$search} : undef;
print "\n$cnt search is $search ";
if (defined($match)) {
$match =~ s/[^[:print:]]+//g;
print "and match is $match";
}
push #$row, $match;
#print " match is $match";
$csv->print($out_fh, $row);
$cnt++;
}
# untie %replace;
close $in_fh;
close $out_fh;
You have a problem of scope. Your code:
if ($MapSelection =~ /eNodeBID/i) {
my %replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
use Data::Dumper;
print Dumper(\%replace);
}
declares %replace within the if block. Move it outside so that it can also be seen by later code:
my %replace;
if ($MapSelection =~ /eNodeBID/i) {
%replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
use Data::Dumper;
print Dumper(\%replace);
}
Putting use strict and use warnings at the top of your code helps you find these kinds of issues.
Also, you can just use my $match = $replace{$search} since it's equivalent to your ?: operation.
Always include use strict; and use warnings; at the top of EVERY perl script. If you had done that and been maintaining good coding practice with declaring your variables, you would've gotten error:
Global symbol "%replace" requires explicit package name at
That would've let you know there was a scoping issue with your code. One way to avoid that is to use a ternary in your initialization of %replace
my %replace = ($MapSelection =~ /eNodeBID/i)
? map { chomp; (split /,/)[0,1] } <$map_fh>
: ();

Generate random pairs from a list of numbers making sure that the generated random pairs are not already present

Given a set of genes and existing pair of genes, I want to generate new pairs of genes which are not already existing.
The genes file has the following format :
123
134
23455
3242
3423
...
...
The genes pairs file has the following format :
12,345
134,23455
23455,343
3242,464452
3423,7655
...
...
But I still get few common elements between known_interactions and new_pairs. I'm not sure where the error is.
For the arguments,
perl generate_random_pairs.pl entrez_genes_file known_interactions_file 250000
I got a common elements of 15880. The number 250000 is to tell how many random pairs I want the program to generate.
#! usr/bin/perl
use strict;
use warnings;
if (#ARGV != 3) {
die "Usage: generate_random_pairs.pl <entrez_genes> <known_interactions> <number_of_interactions>\n";
}
my ($e_file, $k_file, $interactions) = #ARGV;
open (IN, $e_file) or die "Error!! Cannot open $e_file\n";
open (IN2, $k_file) or die "Error!! Cannot open $k_file\n";
my #e_file = <IN>; s/\s+\z// for #e_file;
my #k_file = <IN2>; s/\s+\z// for #k_file;
my (%known_interactions);
my %entrez_genes;
$entrez_genes{$_}++ foreach #e_file;
foreach my $line (#k_file) {
my #array = split (/,/, $line);
$known_interactions{$array[0]} = $array[1];
}
my $count = 0;
foreach my $key1 (keys %entrez_genes) {
foreach my $key2 (keys %entrez_genes) {
if ($key1 != $key2) {
if (exists $known_interactions{$key1} && ($known_interactions{$key1} == $key2)) {next;}
if (exists $known_interactions{$key2} && ($known_interactions{$key2} == $key1)) {next;}
if ($key1 < $key2) { print "$key1,$key2\n"; $count++; }
else { print "$key2,$key1\n"; $count++; }
}
if ($count == $interactions) {
die "$count\n";
}
}
}
I can see nothing wrong with your code. I wonder if you have some whitespace in your data - either after the comma or at the end of the line? It would be safer to extract just the digit fields with, for instance
my #e_file = map /\d+/g, <IN>;
Also, you would be better off keeping both elements of the pair as the hash key, so that you can just check the existence of the element. And if you make sure the lower number is always first you don't need to do two lookups.
This example should work for you. It doesn't address the random selection part of your requirement, but that wasn't in your own code and wasn't your immediate problem
use strict;
use warnings;
#ARGV = qw/ entrez_genes.txt known_interactions.txt 9 /;
if (#ARGV != 3) {
die "Usage: generate_random_pairs.pl <entrez_genes> <known_interactions> <number_of_interactions>\n";
}
my ($e_file, $k_file, $interactions) = #ARGV;
open my $fh, '<', $e_file or die "Error!! Cannot open $e_file: $!";
my #e_file = sort { $a <=> $b } map /\d+/g, <$fh>;
open $fh, '<', $k_file or die "Error!! Cannot open $k_file: $!";
my %known_interactions;
while (<$fh>) {
my $pair = join ',', sort { $a <=> $b } /\d+/g;
$known_interactions{$pair}++;
}
close $fh;
my $count = 0;
PAIR:
for my $i (0 .. $#e_file-1) {
for my $j ($i+1 .. $#e_file) {
my $pair = join ',', #e_file[$i, $j];
unless ($known_interactions{$pair}) {
print $pair, "\n";
last PAIR if ++$count >= $interactions;
}
}
}
print "\nTotal of $count interactions\n";
first of all, you are not chomping (removing newlines) from your file of known interactions. That means that given a file like:
1111,2222
you will build this hash:
$known_interactions{1111} = "2222\n";
That is probably why you are getting duplicate entries. My guess is (can't be sure without your actual input files) that these loops should work ok:
map{
chomp;
$entrez_genes{$_}++ ;
}#e_file;
and
map {
chomp;
my #array = sort(split (/,/));
$known_interactions{$array[0]} = $array[1];
}#k_file;
Also, as a general rule, I find my life is easier if I sort the interacting pair (the joys of bioinformatics :) ). That way I know that 111,222 and 222,111 will be treated in the same way and I can avoid multiple if statements like you have in your code.
Your next loop would then be (which IMHO is more readable):
my #genes=keys(%entrez_genes);
for (my $i=0; $i<=$#genes;$i++) {
for (my $k=$n; $k<=$#genes;$k++) {
next if $genes[$n] == $genes[$k];
my #pp=sort($genes[$n],$genes[$k]);
next unless exists $known_interactions{$pp[0]};
next if $known_interactions{$pp[0]} == $pp[1];
print "$pp[0], $pp[1]\n";
$count++;
die "$count\n" if $count == $interactions;
}
}

How to make LWP and HTML::TableExtract spitting out CSV with Text::CSV

I am currently working on a little parser.
i have had very good results with the first script! This was able to run great!
It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated, so the subequent work with the data is a bit difficult. Therefore i have a second script - see below!
Note - friends helped me with the both scripts. I need to introduce myself as a true novice who needs help in migration two in one. So, you see, my Perl-knowlgedge is not so elaborated that i am able to do the migration into one on my own! Any and all help would be great!
The first script: a spider and parser: it spits out the data like this:
lfd. Nr. Schul- nummer Schulname Straße PLZ Ort Telefon Fax Schulart Webseite
1 0401 Mädchenrealschule Marienburg, Abenberg, der Diözese Eichstätt Marienburg 1 91183  Abenberg  09178/509210 Realschulen mrs-marienburg.homepage.t-online.de
2 6581 Volksschule Abenberg (Grundschule) Güssübelstr. 2 91183  Abenberg  09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
3 6913 Mittelschule Abenberg  Güssübelstr. 2 91183  Abenberg  09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
4 0402 Johann-Turmair-Realschule Staatliche Realschule Abensberg Stadionstraße 46 93326  Abensberg  09443/9143-0,12,13 09443/914330 Realschulen www.rs-abensberg.de
But i need to separate the data: with commas or someting like that!
And i have a second script. This part can do the CSV-formate. i want to ombine it with the spider-logic. But first lets have a look at the first script: with the great spider-logic.
see the code that is appropiate:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(#$row);
print OUTFILE "#$row\n";
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
}
sub cleanup() {
for ( #_ ) {
s/s+/ /g;
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
But as this-above script-unfortunatley does not take care for the separators i have had to take care for a method, that does look for separators. In order to get the data (output) separated.
So with the separation i am able to work with the data - and store it in a mysql-table.. or do something else...So here [below] are the bits - that work out the csv-formate Note - i want to put the code below into the code above - to combine the spider-logic of the above mentioned code with the logic of outputting the data in CSV-formate.
where to set in the code Question: can we identify this point to migrate the one into the other... !?
That would be amazing... I hope i could make clear what i have in mind...!? Are we able to use the benefits of the both parts (/scripts ) migrating them into one?
So the question is: where to set in with the CSV-Script into the script (above)
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20';
$html =~ tr/\r//d; # strip carriage returns
$html =~ s/ / /g; # expand spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my #cols = qw(
rownum
number
name
phone
type
website
);
my #fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
# trim leading/trailing whitespace from base fields
s/^\s+//, s/\s+$// for #$row;
# load the fields into the hash using a "hash slice"
my %h;
#h{#cols} = #$row;
# derive some fields from base fields, again using a hash slice
#h{qw/name street postal town/} = split /\n+/, $h{name};
#h{qw/phone fax/} = split /\n+/, $h{phone};
# trim leading/trailing whitespace from derived fields
s/^\s+//, s/\s+$// for #h{qw/name street postal town/};
$csv->combine(#h{#fields});
print $csv->string, "\n";
}
}
The thing is that i have had very good results with the first script! It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated...!
And i have a second script. This part can do the CSV-formate. i want to combine it with the spider-logic.
where is the part to insert? I look forward to any and all help.
if i have to be more precice - just let me know...
Since you have entered a complete script, I'll assume you want critique of the whole thing.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
Since you only use $te in one block, why are you declaring and initializing it in this outer scope? The same question applies to most of your variables -- try to declare them in the innermost scope possible.
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
In general, english variable names will enable you to collaborate with far more people than german names. I understand german, so I understand the intent of your code, but most of SO doesn't.
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
Don't use & to call subs. Just call them with workDir;. It hasn't been necessary to use & since 1994, and it can lead to a nasty gotcha because &callMySub; is a special case which doesn't do what you might think, while callMySub; does the Right Thing.
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
Generally lexical filehandles are preferred these days: open my $outfile, ">file"; Also, you should check for errors from open or use autodie; to make open die on failure.
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(#$row);
print OUTFILE "#$row\n";
This is the line to change if you want to put commas in separating your data. Look at the join function, it can do what you want.
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
It's very strange to initialize $te at the end of the loop instead of the beginning. It's much more idiomatic to declare and initialize $te at the top of the loop.
}
sub cleanup() {
for ( #_ ) {
s/s+/ /g;
Did you mean s/\s+/ /g;?
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
I haven't commented on your second script; perhaps you should ask it as a separate question.