subset fasta file using id file

subset fasta file using id file - perl

I have been stuck with some script!
Well i made this script in 2008 and now i am using with some modifications and i get error!
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
sub getSequences ($) {
my $file = $_[0];
open (INFILE, "<$file") || die "Error1 in opening in file: $file. $!\n";
my #lines = <INFILE>;
my $header; my %header2seq = ();
foreach my $line (#lines) {
chomp $line;
if ($line =~ /^(>.+)$/o) {
$header = $1;
}
else {$header2seq {$header} .= $line; }
}
#print %header2seq;
close (INFILE);
return (\%header2seq);
}
sub MakeSpList ($) {
my $sp_list = $_[0]; my %sp_names = ();
open (INFILE2, "<$sp_list") || die "Error2 in opening in file: $sp_list. $!\n";
my #sps = <INFILE2>;
foreach my $line (#sps) { chomp $line; $sp_names {$line} = 1; }
close (INFILE2);
#print Dumper (%sp_names);
return (\%sp_names);
}
sub CompareSpList2Sequences ($$$) {
my $ref_header2seq = $_[0] ; my $ref_sp_names = $_[1]; my $file = $_[2];
open (OUTFILE, ">$file.subdata") || die ("Error3 in opening out file: $file.subdata. $!\n");
foreach my $key (keys %$ref_header2seq) {
$key =~ m/^>([A-Z]+[0-9]+[A-Z+]).+$/o;
#print "$1\n";
my $header_sub = $1;
#print $header_sub, "\n";
#print $ref_sp_names, "\n";
if (exists $ref_sp_names -> {$header_sub}) {
my $seq = $ref_header2seq -> {$key};
print OUTFILE ">$key\n$seq\n";
}
}
close (OUTFILE);
return "42";
}
my $fasta_seqs = $ARGV[0]; my $sp_list = $ARGV[1];
my $ref_header2seq = getSequences ($fasta_seqs);
my $ref_sp_names = MakeSpList ($sp_list);
CompareSpList2Sequences ($ref_header2seq , $ref_sp_names, $fasta_seqs);
exit;
What i want to do is:
i have a fasta file with sequences:
YAL004W YAL004W SGDID:S000002136, Chr I from 140760-141407, Genome Release 64-2-1, Dubious ORF, "Dubious open reading frame; unlikely to encode a functional protein, based on available experimental and comparative sequence data; completely overlaps verified gene SSA1/YAL005C" ATGGGTGTCACCAGCGGTGGCCTTAACTTCAAAGATACCGTCTTCAATGGACAACAAAGAGACATCGAAAGTACCACCACCCAAGTCGAAAATCAAGACGTGTTCTTCCTTACCCTTCTTGTCCAAACCGTAAGCAATGGCAGCGGCGGTAGGTTCGTTAATAATACGCAAGACATTCAAACCAGCAATGGTACCAGCATCCTTGGTAGCTTGTCTTTGAGAATCGTTGAA
YAL005C SSA1 SGDID:S000000004, Chr I from 141431-139503, Genome Release 64-2-1, reverse complement, Verified ORF, "ATPase involved in protein folding and NLS-directed nuclear transport; member of HSP70 family; forms chaperone complex with Ydj1p; localized to nucleus, cytoplasm, and cell wall; 98% identical with paralog Ssa2p, but subtle differences between the two proteins provide functional specificity with respect to propagation of yeast [URE3] prions and vacuolar-mediated degradations of gluconeogenesis enzymes; general targeting factor of Hsp104p to prion fibrils" ATGTCAAAAGCTGTCGGTATTGATTTAGGTACAACATACTCGTGTGTTGCTCACTTTGCTAATGATCGTGTGGACATTATTGCCAACGATCAAGGTAACAGAACCACTCCATCTTTTGTCGCTTTCACTGACACTGAAAGATTGATTGGTGATGCTGCTAAGAATCAAGCTGCTATGAATCCTTCGAATACCGTTTTCGACGCTAAGCGTTTGATCGGTAGAAACTTCAAC
and i have another file with ID's:
YAL005C
YAL012W
I want to retrieve the sequences and the all header when match with ID's file.
but i get this error: don´t print anything!
Please can you help me?
Thanks in advance.
i already searched for other methods (and i can´t get the results either) but i really want to know about this error!
no bioperl please!

OK, so - line 45 is:
if (exists $ref_sp_names -> {$header_sub}) {
Your error is telling you that $header_sub is undefined. It's set by:
my $header_sub = $1;
Which follows:
$key =~ m/^(>[A-Z])\s.+$/o;
So - this means the regex isn't matching. I don't see any > in your sample data, so it can't match it. When the match fails, $1 is undefined, hence your error. What do you get out of your print $key statements?
I would also note - .+$ is most likely redundant. Likewise - the o flag - you probably don't want that either. http://perldoc.perl.org/perlre.html#Modifiers

have you tried using Bioperl? Here's some code to get you started.
#!/usr/bin/perl
use warnings;
use strict;
use Bio::SeqIO;
my $fasta = shift; #this will just push whatever in cli in.
my $seqio_obj = Bio::SeqIO->(-file => $fasta, -format => 'fasta');
while ( my $seq = $seqio_obj->next_seq){
print $seq->id . ' = ' . $seq->seq() . "\n";
#in here you can do your fasta handling with the seq obj
}

Related

Degeneracy of characters when searching for a specific sub-string

I have the following script which searches for specified substrings within an input string (a DNA sequence). I was wondering if anybody could help out with being able to specify degeneracy of specific characters. For example, instead of searching for GATC (or anything consisting solely of G's, T's, A's and C's), I could instead search for GRTNA where R = A or G and where N = A, G, C or T. I would need to be able to specify quite a few of these in a long list within the script. Many thanks for any help or tips!
use warnings;
use strict;
#User Input
my $usage = "Usage (OSX Terminal): perl <$0> <FASTA File> <Results Directory + Filename>\n";
#Reading formatted FASTA/FA files
sub read_fasta {
my ($in) = #_;
my $sequence = "";
while(<$in>) {
my $line = $_;
chomp($line);
if($line =~ /^>/){ next }
else { $sequence .= $line }
}
return(\$sequence);
}
#Scanning for restriction sites and length-output
open(my $in, "<", shift);
open(my $out, ">", shift);
my $DNA = read_fasta($in);
print "DNA is: \n $$DNA \n";
my $len = length($$DNA);
print "\n DNA Length is: $len \n";
my #pats=qw( GTTAAC );
for (#pats) {
my $m = () = $$DNA =~ /$_/gi;
print "\n Total DNA matches to $_ are: $m \n";
}
my $pat=join("|",#pats);
my #cutarr = split(/$pat/, $$DNA);
for (#cutarr) {
my $len = length($_);
print $out "$len \n";
}
close($out);
close($in);

GRTNA would correspond to the pattern G[AG]T[AGCT]A.
It looks like you could do this by writing
for (#pats) {
s/R/[AG]/g;
s/N/[AGCT]/g;
}
before
my $pat = join '|', #pats;
my #cutarr = split /$pat/, $$DNA;
but I'm not sure I can help you with the requirement that "I would need to be able to specify quite a few of these in a long list within the script". I think it would be best to put your sequences in a separate text file rather than embed the list directly into the program.
By the way, wouldn't it be simpler just to
return $sequence
from your read_fasta subroutine? Returning a reference just means you have to dereference it everywhere with $$DNA. I suggest that it should look like this
sub read_fasta {
my ($fh) = #_;
my $sequence;
while (<$fh>) {
unless (/^>/) {
chomp;
$sequence .= $_;
}
}
return $sequence;
}

perl hash mapping/retrieval issues with split and select columns

Perl find and replace multiple(huge) strings in one shot
P.S.This question is related to the answer for above question.
When I try to replace this code:
Snippet-1
open my $map_fh, '<', 'map.csv' or die $!;
my %replace = map { chomp; split /,/ } <$map_fh>;
close $map_fh;
with this code:
Snippet-2
my %replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
even though the key exists (as in the dumper), exists statement doesn't return the value for the key.
For same input file, it works perfectly with just split alone (Snippet-1) whereas not returning anything when i select specific columns after split(Snippet-2).
Is there some integer/string datatype mess-up happening here?
Input Mapping File
483329,Buffalo
483330,Buffalo
483337,Buffalo
Script Output
$VAR1 = {
'483329' => 'Buffalo',
'46546' => 'Chicago_CW',
'745679' => 'W. Washington',
};
1 search is ENB
2 search is 483329 **expected Buffalo here**
3 search is 483330
4 search is 483337
Perl Code
open my $map_fh, '<', $MarketMapFile or die $!;
if ($MapSelection =~ /eNodeBID/i) {
my %replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
use Data::Dumper;
print Dumper(\%replace);
}
close $map_fh;
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1, eol => $/,quote_space => 0 });
my $tmpCSVFile = $CSVFile."tmp";
open my $in_fh, '<', $CSVFile or die $!;
open my $out_fh, '>', $tmpCSVFile or die $!;
my $cnt=1;
while (my $row = $csv->getline($in_fh)) {
my $search = $row->[5];
$search =~ s/[^[:print:]]+//g;
if ($MapSelection =~ /eNodeBID/i) {
$search =~ s/(...)-(...)-//g;
$search =~ s/\(M\)//g;
}
my $match = (exists $replace{$search}) ? $replace{$search} : undef;
print "\n$cnt search is $search ";
if (defined($match)) {
$match =~ s/[^[:print:]]+//g;
print "and match is $match";
}
push #$row, $match;
#print " match is $match";
$csv->print($out_fh, $row);
$cnt++;
}
# untie %replace;
close $in_fh;
close $out_fh;

You have a problem of scope. Your code:
if ($MapSelection =~ /eNodeBID/i) {
my %replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
use Data::Dumper;
print Dumper(\%replace);
}
declares %replace within the if block. Move it outside so that it can also be seen by later code:
my %replace;
if ($MapSelection =~ /eNodeBID/i) {
%replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
use Data::Dumper;
print Dumper(\%replace);
}
Putting use strict and use warnings at the top of your code helps you find these kinds of issues.
Also, you can just use my $match = $replace{$search} since it's equivalent to your ?: operation.

Always include use strict; and use warnings; at the top of EVERY perl script. If you had done that and been maintaining good coding practice with declaring your variables, you would've gotten error:
Global symbol "%replace" requires explicit package name at
That would've let you know there was a scoping issue with your code. One way to avoid that is to use a ternary in your initialization of %replace
my %replace = ($MapSelection =~ /eNodeBID/i)
? map { chomp; (split /,/)[0,1] } <$map_fh>
: ();

Argument is not numeric error while comparing hash values based on keys

#!/usr/bin/perl
use strict;
use Data::Dumper;
use warnings;
my #mdsum;
open (IN1,"$ARGV[0]") || die "counldn't open";
open (MYFILE, '>>md5sum-problem.txt');
open (IN2, "mdsumfile.txt");
my %knomexl=();
my %knomemdsum = ();
my #arrfile ;
my $tempkey ;
my $tempval ;
my #values ;
my $val;
my $i;
my #newarra;
my $testxl ;
my $testmdsum;
while(<IN1>){
next if /barcode/;
#arrfile = split('\t', $_);
$knomexl{$arrfile[0]} = $arrfile[2];
}
while(<IN2>){
chomp $_;
#newarra = split(/ {1,}/, $_);
$tempval = $newarra[0];
$tempkey = $newarra[1];
$tempkey=~ s/\t*$//g;
$tempval=~ s/\s*$//g;
$tempkey=~s/.tar.gz//g;
$knomemdsum{$tempkey} = $tempval;
}
#values = keys %knomexl;
foreach $i(#values){
$testxl = $knomexl{$values[$i]};
print $testxl."\n";
$testmdsum = $knomemdsum{$values[$i]};
print $testmdsum."\n";
if ( $testxl ne $testmdsum ) {
if ($testxl ne ""){
print MYFILE "Files hasving md5sum issue $i\n";
}
}
}
close (MYFILE);
I have two files one both having File name and Mdsum values and I need to check that which all file's md5sum values are not matching so I understand that in some case where Value and corresponding values will not be their and I want those cases only. Any work around on this code ? Please. This code is pretty simple but don't know why it's not working!! :( :(

#values = keys %knomexl;
foreach $i(#values){
#print Dumper $knomexl{$values[$i]};
$testxl = $knomexl{$i};
print $testxl."\n";
$testmdsum = $knomemdsum{$i};
print $testmdsum."\n";
$i is an element of #values because of the foreach, not an index, so you shouldn't use $values[$i].

Skip the problematic DAT and proceed with next DATs and out put the error or missed DATs list in separate text file in Perl

I have another question here, i have several dats and want to merge them. But the script first checks for header of all the DATs and if not matching it will throw error and stop the script. Now i want to run the script skipping the problematic dat and output the error in separate text file with list of errored DAts and reason. Could anyone please help on this. Here is what i have so far:
use strict;
my $rootdir = $ARGV[0];
die "usage: perl mergetxtfiles.pl <folder>" if ($#ARGV != 0);
#$rootdir =~ s/\\/\\\\/g;
print "\nFolder = $rootdir\n\n";
opendir(DIR, $rootdir)
or die "failed opening the directory $rootdir";
open(OF,">:utf8",'combined_'.time.'.dat')
or die "failed opening the file";
my $icr = 0;
my $cnt = 0;
my $header = '';
my $header_flag = 0;
while(my $fname = readdir(DIR)) {
# add extensions if needed
if ($fname =~ m/(\.txt)|(\.dat)|(\.csv)$/i) {
$icr++;
my $fnamepath = $rootdir.'\\'.$fname;
print "\($icr\) $fname\n";
open(IF, "<:utf8", $fnamepath)
or die "ERROR: cannot open the file\n$fnamepath ";
my $sep_icr = 0;
while(<IF>) {
my $line = $_;
chomp $line;
next if (/^$/);
$sep_icr++;
$cnt++;
my #ar = split(/\t/,$line);
if ($cnt == 1) {
$header_flag = 1;
$header = $line;
}
if ($sep_icr == 1 and $header_flag == 1) {
#print "$line \n $header\n";
if ($line ne $header) {
die "Headers are not same\n";
}
elsif (($line eq $header) and ($cnt >1)) {
print "INFO\: ignoring the same header for $fname \n";
$cnt--;
next;
}
}
print OF $line."\n";
}
print "\--Line count= $sep_icr\n\n";
close IF;
#print OF "\n";
}
}
print "\-\-\> Total line count= $cnt\n";

Named Loops
In your loop, we have to change your if-clause and the outer loop a bit:
FILE:
while(my $fname = readdir(DIR)) {
...;
if ($line ne $header) {
logger($fname, "Headers not matching");
next FILE;
}
...;
}
In Perl, loops can be labeled, so we can specify which loop we do next, instead of setting and checking flags. I used an example logging function loggeras given below, but you can substitute it with an appropriate print statement.
Logging
This is probably a bit more than asked, but here is a little logging function for flexibility. Arguments are a filename, a reason, and an optional severity. You can remove the severity code if it isn't needed. The severity is optional anyway and defaults to debug.
open my $logfile, ">>", "FILENAME" or die "..."; # open for append
sub logger {
my ($file, $reason, $severity) = (#_, 'debug');
$severity = {
debug => '',
info => 'INFO',
warn => '!WARN!',
fatal => '!!!ERROR!!!',
}->{$severity} // $severity; # transform the severity if it is a name we know
$severity .= ' ' if length $severity; # append space if we have a severity
print {$logfile} $severity . qq{$reason while processing "$file"\n};
}
If called with logger("./foo/bar", "Headers not matching", 'warn') it will output:
!WARN! Headers not matching while processing "./foo/bar"
Change the printed error message to something more machine-readable if needed.
Style tips and tricks:
If find these lines more elegant:
die "usage: ...\n" unless #ARGV;
my ($rootdir) = #ARGV;
note the newline at the end (supresses the "at line 3" etc). In scalar context, an array returns the array length. In the second line we can avoid array subscripting by assigning in list context. Surplus elements are ignored.
Instead
if ($fname =~ m/(\.txt)|(\.dat)|(\.csv)$/i) { ...; }
we can say
next unless $fname =~ m/(?: \.txt | \.dat | \.csv )$/xi;
and avoid unneccessary intendation, therefore improving readability.
I modified the regex so that all suffixes must come at the end, not only the .csv suffix, and added the /x modifier so that I can use non-semantic whitespace inside the regex.
Windows, and pretty much any OS, understand forward slashes in path names. So instead
my $fnamepath = $rootdir.'\\'.$fname;
we can write
my $fnamepath = "$rootdir/$fname";
I find that easier to write and understand.
The
while(<IF>) {
my $line = $_;
construct can be simplified to
while(my $line = <IF>) {...}
Last but not least, consider starting a habit of using filehandles with my. Often, global filehandles are not needed and can cause some bugs.

How to make LWP and HTML::TableExtract spitting out CSV with Text::CSV

I am currently working on a little parser.
i have had very good results with the first script! This was able to run great!
It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated, so the subequent work with the data is a bit difficult. Therefore i have a second script - see below!
Note - friends helped me with the both scripts. I need to introduce myself as a true novice who needs help in migration two in one. So, you see, my Perl-knowlgedge is not so elaborated that i am able to do the migration into one on my own! Any and all help would be great!
The first script: a spider and parser: it spits out the data like this:
lfd. Nr. Schul- nummer Schulname Straße PLZ Ort Telefon Fax Schulart Webseite
1 0401 Mädchenrealschule Marienburg, Abenberg, der Diözese Eichstätt Marienburg 1 91183  Abenberg  09178/509210 Realschulen mrs-marienburg.homepage.t-online.de
2 6581 Volksschule Abenberg (Grundschule) Güssübelstr. 2 91183  Abenberg  09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
3 6913 Mittelschule Abenberg  Güssübelstr. 2 91183  Abenberg  09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
4 0402 Johann-Turmair-Realschule Staatliche Realschule Abensberg Stadionstraße 46 93326  Abensberg  09443/9143-0,12,13 09443/914330 Realschulen www.rs-abensberg.de
But i need to separate the data: with commas or someting like that!
And i have a second script. This part can do the CSV-formate. i want to ombine it with the spider-logic. But first lets have a look at the first script: with the great spider-logic.
see the code that is appropiate:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(#$row);
print OUTFILE "#$row\n";
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
}
sub cleanup() {
for ( #_ ) {
s/s+/ /g;
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
But as this-above script-unfortunatley does not take care for the separators i have had to take care for a method, that does look for separators. In order to get the data (output) separated.
So with the separation i am able to work with the data - and store it in a mysql-table.. or do something else...So here [below] are the bits - that work out the csv-formate Note - i want to put the code below into the code above - to combine the spider-logic of the above mentioned code with the logic of outputting the data in CSV-formate.
where to set in the code Question: can we identify this point to migrate the one into the other... !?
That would be amazing... I hope i could make clear what i have in mind...!? Are we able to use the benefits of the both parts (/scripts ) migrating them into one?
So the question is: where to set in with the CSV-Script into the script (above)
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20';
$html =~ tr/\r//d; # strip carriage returns
$html =~ s/ / /g; # expand spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my #cols = qw(
rownum
number
name
phone
type
website
);
my #fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
# trim leading/trailing whitespace from base fields
s/^\s+//, s/\s+$// for #$row;
# load the fields into the hash using a "hash slice"
my %h;
#h{#cols} = #$row;
# derive some fields from base fields, again using a hash slice
#h{qw/name street postal town/} = split /\n+/, $h{name};
#h{qw/phone fax/} = split /\n+/, $h{phone};
# trim leading/trailing whitespace from derived fields
s/^\s+//, s/\s+$// for #h{qw/name street postal town/};
$csv->combine(#h{#fields});
print $csv->string, "\n";
}
}
The thing is that i have had very good results with the first script! It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated...!
And i have a second script. This part can do the CSV-formate. i want to combine it with the spider-logic.
where is the part to insert? I look forward to any and all help.
if i have to be more precice - just let me know...

Since you have entered a complete script, I'll assume you want critique of the whole thing.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
Since you only use $te in one block, why are you declaring and initializing it in this outer scope? The same question applies to most of your variables -- try to declare them in the innermost scope possible.
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
In general, english variable names will enable you to collaborate with far more people than german names. I understand german, so I understand the intent of your code, but most of SO doesn't.
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
Don't use & to call subs. Just call them with workDir;. It hasn't been necessary to use & since 1994, and it can lead to a nasty gotcha because &callMySub; is a special case which doesn't do what you might think, while callMySub; does the Right Thing.
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
Generally lexical filehandles are preferred these days: open my $outfile, ">file"; Also, you should check for errors from open or use autodie; to make open die on failure.
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(#$row);
print OUTFILE "#$row\n";
This is the line to change if you want to put commas in separating your data. Look at the join function, it can do what you want.
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
It's very strange to initialize $te at the end of the loop instead of the beginning. It's much more idiomatic to declare and initialize $te at the top of the loop.
}
sub cleanup() {
for ( #_ ) {
s/s+/ /g;
Did you mean s/\s+/ /g;?
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
I haven't commented on your second script; perhaps you should ask it as a separate question.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

subset fasta file using id file - perl

Related

Degeneracy of characters when searching for a specific sub-string

perl hash mapping/retrieval issues with split and select columns

Argument is not numeric error while comparing hash values based on keys

Skip the problematic DAT and proceed with next DATs and out put the error or missed DATs list in separate text file in Perl

How to make LWP and HTML::TableExtract spitting out CSV with Text::CSV

Categories

Resources