PERL Fixed Width to CSV based on Input Files

PERL Fixed Width to CSV based on Input Files - perl

EDITED: I'm attempting to create a brief script that calls for an input fixed width file and a file with the start position and length of each attribute and then outputs the file as CSV instead of fixed width. I haven't messed with removing whitespace yet and am currently focusing on building the file reader portion.
Fixed:
My current issue is that this code returns data from the third row for $StartPosition and from the fourth row for $Length when they should both be first found on the first row of COMMA. I have no idea what is prompting this behavior.
Next issue: It only reads the first record in practice_data.txt I'm guessing it's something where I need to tell COMMA to go back to the beginning?
while (my $sourceLine = <SOURCE>) {
$StartPosition = 0;
$Length = 0;
$Output = "";
$NextRecord ="";
while (my $commaLine = <COMMA>) {
my $Comma = index($commaLine, ',');
print "Comma location found at $Comma \n";
$StartPosition = substr($commaLine, 0, $Comma);
print "Start position is $StartPosition \n";
$Comma = $Comma + 1
$Length = substr($commaLine, $Comma);
print "Length is $Length \n";
$NextRecord = substr($sourceLine, $StartPosition, $Length);
$Output = "$Output . ',' . $NextRecord";
}
print OUTPUT "$Output \n";
}
practice_data.txt
1234512345John Doe 123 Mulberry Lane Columbus Ohio 43215Johnny Jane
5432154321Jason McKinny 423 Thursday Lane Columbus Ohio 43212Jase Jamie
4321543212Mike Jameson 289 Front Street Cleveland Ohio 43623James Sarah
Each record is 100 characters long.
Definitions.txt:
0,10
10,10
20,10
30,20
50,10
60,10
70,5
75,15
90,10

It always helps to provide enough information so that we can at least do some testing without having to read your code and imagine what the data must look like.
I suggest you use unpack, after first building a template from the file that holds the field specifications. Note that the A field specifier trims trailing spaces from the data.
It is all but essential to use the Text::CSV module to parse or generate well-formed CSV data. And I have used the autodie pragma to avoid having to explicitly check and report on the status of every I/O operation.
I have used this data
my_source_data.txt
12345678 ABCDE1234FGHIJK
my_field_spec.txt
0,8
10,5
15,4
19,6
And this program
use strict;
use warnings;
use 5.010;
use autodie;
use Text::CSV;
my #template;
open my $field_fh, '<', 'my_field_spec.txt';
while ( <$field_fh> ) {
my (#info) = /\d+/g;
die unless #info == 2;
push #template, sprintf '#%dA%d', #info;
}
my $template = "#template";
open my $source_fh, '<', 'my_source_data.txt';
my $csv = Text::CSV->new( { binary => 1, eol => $/ } );
while ( <$source_fh> ) {
my #fields = unpack $template;
$csv->print(\*STDOUT, \#fields);
}
output
12345678,ABCDE,1234,FGHIJK

It looks like you're slightly confused on how to read the contents of the COMMA filehandle.. Each time you read <COMMA>, you're reading another line from that file. Instead, read a line into a scalar like my $line = <FH> and use that instead:
while (my $source_line = <SOURCE>) {
$StartPosition = 0;
$Length = 0;
$Output = "";
$Input = $_;
$NextRecord ="";
while (my $comma_line = <COMMA>) {
my $Comma = index($comma_line, ',');
print "Comma location found at $Comma \n";
$StartPosition = substr($comma_line, 0, $Comma);
print "Start position is $StartPosition \n";
$Length = substr($comma_line, $Comma);
print "Length is $Length \n";
$NextRecord = substr($Input, $StartPosition, $Length) + ',';
$Output = "$Output$NextRecord";
}
print OUTPUT "$Output \n";
}

Related

Using Perl's SEEK to jump to a line in a file and continuing to read the file

My goal is to open a file containing a single column of fixed length (1 character = 2 bytes on my Mac), and then to read the lines of the file into an array, beginning and ending at specified points. The file is very long, so I am using the seek command to jump to the appropriate starting line of the file. The file is a chromosomal sequence, arranged as a single column. I am successfully jumping to the appropriate point in the file, but I am having trouble reading the sequence into the array.
my #seq = (); # to contain the stretch of sequence I am seeking to retrieve from file.
my $from_bytes = 2*$from - 2; # specifies the "start point" in terms of bytes.
seek( SEQUENCE, $from_bytes, 0 );
my $from_base = <SEQUENCE>;
push ( #seq, $from_base ); # script is going to the correct line and retrieving correct base.
my $count = $from + 1; # here I am trying to continue the read into #seq
while ( <SEQUENCE> ) {
if ( $count = $to ) { # $to specifies the line at which to stop
last;
}
else {
push( #seq, $_ );
$count++;
next;
}
}
print "seq is: #seq\n\n"; # script prints only the first base

It seems you are reading fixed width records, consisting of $to lines, and each line has 2 bytes (1 char + 1 newline). As such you can simply read each chromosome sequence with a single read. A short example:
use strict;
use warnings;
use autodie;
my $record_number = $ARGV[0];
my $lines_per_record = 4; # change to the correct value
my $record_length = $lines_per_record * 2;
my $offset = $record_length * $record_number;
my $fasta_test = "fasta_test.txt";
if (open my $SEQUENCE, '<', $fasta_test) {
my $sequence_string;
seek $SEQUENCE, $offset, 0;
my $chars_read = read($SEQUENCE, $sequence_string, $record_length);
if ($chars_read) {
my #seq = split /\n/, $sequence_string; # if you want it as an array
$sequence_string =~ s/\n//g; # if you want the chromosome sequence as a single string without newlines
print $sequence_string, "\n";
} else {
print STDERR "Failed to read record $record_number!\n";
}
close $SEQUENCE;
}
With more information one could probably present a better solution.

Parse only date from a line in text file

I have a text file with delimiters as spaces at the start of the lines.
Lines with no initial spaces should go in the first column of the CSV file; those with two spaces should go in the second column of the CSV; and those with four spaces should go in the third column.
This is all working fine as required.
In lines starting with two spaces I want that only the date should go in the second column, discarding the other data of the line. The rest should all remain as it is.
I have denoted spaces at the start of the line as # for clarity.
Text file:
Component1
##(111) Amar Sen <amar.sen#gmail.com> <No comment> 2013/04/01
####/Com/src/folder1/folder2/newfile.txt
##(1199) Prashant Singh <psinsgh#gmail.com> <No comment> 2013/04/24
####/Com/src/folder1/folder2/testfile24
####/Com/src/folder1/folder2/testfile25
####/Com/src/folder1/folder2/testfile26
##(1204) Anthony Li <anthon.li#gmail.com> <No comment> 2013/04/25
####/Com/src2
Component2(added)
Component3
Output format:
Component1,2013/04/01,/Com/src/folder1/folder2/newfile.txt
2013/04/24,/Com/src/folder1/folder2/testfile24
/Com/src/folder1/folder2/testfile25
/Com/src/folder1/folder2/testfile26
2013/04/25,/Com/src2
Component2(added)
Component3
Here's the code. Its working fine except for the change described above.
use strict;
use warnings;
my $previous_count = "-1"; #beginning, we will think, that no spaces.
my $current_count = "0"; #current default value
my $maximum_count = 3;
my $to_written = "";
my $delimiter_between_columns = ",";
my $newline_separator = ";";
my $file = 'C:\\textfile.txt';
open (my $fh, '<:encoding(UTF-8)', $file) or die "Could not open file '$file' $!";
while (my $row = <$fh>) {
# ok, read.
chomp($row);
# print "row is : $row\n";
if ($row =~ m/^(\s*)/) {
#print length($1);
$current_count = length($1) / 2; #take number of spaces divided by 2
$row =~ s/^\s+//;
if ($previous_count >= $current_count || $previous_count == $maximum_count) {
#output here
print "$to_written" . $newline_separator . "\n";
$previous_count = 0;
$to_written = "";
}
$previous_count = 0 if ($previous_count == -1);
$to_written .= $delimiter_between_columns x ($current_count - $previous_count) . "$row";
$previous_count = $current_count;
#print"\n";
}
}
print "$to_written" . $newline_separator . "\n";

You seem to have got yourself tied up in knots a little with your solution.
This program seems to do what you need. I have added some commas to your "output format" as your example has no placeholders for initial empty fields.
I have kept the hash characters for this purpose. Obviously it is trivial to change them for spaces, replacing s/^(#*)// with s/^(\s*)//.
use strict;
use warnings;
my #row;
while (<DATA>) {
chomp;
s/^(#*)//;
my $i = length($1) / 2;
if ($i == 1 and m<(\d{4}/\d{2}/\d{2})>) {
$row[$i] = $1;
}
else {
$row[$i] = $_;
}
if ($i == 2) {
print join(',', #row), ";\n";
#row = ('') x 3;
}
}
__DATA__
Component1
##(111) Amar Sen <amar.sen#gmail.com> <No comment> 2013/04/01
####/Com/src/folder1/folder2/newfile.txt
##(1199) Prashant Singh <psinsgh#gmail.com> <No comment> 2013/04/24
####/Com/src/folder1/folder2/testfile24
####/Com/src/folder1/folder2/testfile25
####/Com/src/folder1/folder2/testfile26
##(1204) Anthony Li <anthon.li#gmail.com> <No comment> 2013/04/25
####/Com/src2
output
Component1,2013/04/01,/Com/src/folder1/folder2/newfile.txt;
,2013/04/24,/Com/src/folder1/folder2/testfile24;
,,/Com/src/folder1/folder2/testfile25;
,,/Com/src/folder1/folder2/testfile26;
,2013/04/25,/Com/src2;
Update
It makes more sense to cascade values from columns one and two into subsequent rows where they are not supplied. If you remove the line #row = ('') x 3 from my program it will do just that, with this output
Component1,2013/04/01,/Com/src/folder1/folder2/newfile.txt;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile24;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile25;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile26;
Component1,2013/04/25,/Com/src2;

Unable to remove newlines from Emsembl FASTA

I'm trying to find protein motifs from an Ensembl FASTA file. I've gotten the bulk of the script done, such as retrieving the sequence ID and the sequence itself, but I am receiving some funny results.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $motif1 = qr/(HE(\D)(\D)H(\D{18})E)/x;
my $motif2 = qr/(AMEN)/x;
my $input;
my $output;
my $count_total = 0;
my $count_processed = 0;
my $total_run = 0;
my $id;
my $seq;
my $motif1_count = 0;
my $motif2_count = 0;
my $motifboth_count = 0;
############################################################################################################################
# FILEHANDLING - INPUT/OUTPUT
# User input prompting and handling
print "**********************************************************\n";
print "Question 3\n";
print "**********************************************************\n";
#opens the user input file previously assigned to varible to new variable or kills script.
open my $fh, '<', "chr2.txt" || die "Error! Cannot open file:$!\n";
#Opens and creates output file previously assigned to variable to new variable or kills script
#open(RESULTS, '>', $output)||die "Error! Cannot create output file:$!\n";
# FILE and DATA PROCESSING
############################################################################################################################
while (<$fh>) {
if (/^>(\S+)/) {
$count_total = ++$count_total; # Plus one to count
find_motifs($id, $seq) if $seq; # Passing to subroutine
$id = substr($1, 0, 15); # Taking only the first 16 characters for the id
$seq = '';
}
else {
chomp;
$seq .= $_;
}
}
print "Total proteins: $count_total \n";
print "Proteins with both motifs: $motifboth_count \n";
print "Proteins with motif 1: $motif1_count \n";
print "Proteins with motif 2: $motif2_count \n";
exit;
######################################################################################################################################
# SUBROUTINES
#
# Takes passed variables from special array
# Finds the position of motif within seq
# Checks for motif 1 presence and if found, checks for motif 2. If not found, prints motif 1 results
# If no motif 1, checks for motif 2
sub find_motifs {
my ($id, $seq) = #_;
if ($seq =~ $motif1) {
my $motif_position = index $seq, $1;
my $motif = $1;
if ($seq =~ $motif2) {
$motif1_count = ++$motif1_count;
$motif2_count = ++$motif2_count;
$motifboth_count = ++$motifboth_count;
print "$id, $motif_position, \n$motif \n";
}
else {
$motif1_count = ++$motif1_count;
print "$id, $motif_position,\n $motif\n\n";
}
}
elsif ($seq =~ $motif2) {
$motif2_count = ++$motif2_count;
}
}
What is happening is that if the motif is found at the end of one line of data and the beginning of the next one, it will return the motif with the newline in the data. This method of slurping in data has worked well before.
Sample Results:
ENSG00000119013, 6, HEHGHHKMELPDYRQWKIEGTPLE (CORRECT!)
ENSG00000142327, 123, HEVAHSWFGNAVTNATWEEMWLSE (CORRECT!)
ENSG00000151694, 410, **AECAPNEFGAEHDPDGL**
This is the problem. The motif matches but returns the first half, the newline, then prints the second half on the same line as well (which is a symptom of the larger problem - Getting rid of the newline!)
Total proteins: 13653
Proteins with both motifs: 1
Proteins with motif 1: 12
Proteins with motif 2: 22
I've tried different methods such as #seq =~ s/\r//g or `s/\n//g and at different places within the script.

It's not clear from your description, but "prints the second half on the same line as well" sounds like your output is overlaid on itself because it has a carriage-return character at the end.
This happens if you are running on a Linux system and you just chomp a line that has come from Windows.
You should replace chomp with s/\s+\z// which will remove all trailing whitespace. And because both carriage return and linefeed count as "whitespace" it will remove all possible termination characters.
By the way, you are misunderstading the purpose of the ++ operator. It also modifies the contents of the variable it is applied to so all you need is ++$motif1_count etc. Your code works as it is because the operator also returns the value of the incremented variable, so $motif1_count = ++$motif1_count first increments the variable and then assigns it to itself.
Also, you use \D in your regex. Are you aware that this matches any non-digit character? It seems a very vague classification to be useful.

How to parse through tab-delimited file in perl?

I'm new to Perl, and I've hit a mental roadblock. I need to extract information from a tab delimited file as shown below.
#name years risk total
adam 5 100 200
adam 5 50 100
adam 10 20 300
bill 20 5 100
bill 30 10 800
In this example, the tab delimited file shows length of investment, amount of money risked, and total at the end of investment.
I want to parse through this file, and for each name (e.g. adam), calculate sum of years invested 5+5, and calculate sum of earnings (200-100) + (100-50) + (300-20). I also would like to save the totals for each name (200, 100, 300).
Here's what I have tried so far:
my $filename;
my $seq_fh;
open $seq_fh, $frhitoutput
or die "failed to read input file: $!";
while (my $line = <$seq_fh>) {
chomp $line;
## skip comments and blank lines and optional repeat of title line
next if $line =~ /^\#/ || $line =~ /^\s*$/ || $line =~ /^\+/;
#split each line into array
my #line = split(/\s+/, $line);
my $yeartotal = 0;
my $earning = 0;
#$line[0] = name
#$line[1] = years
#$line[2] = start
#$line[3] = end
while (#line[0]){
$yeartotal += $line[1];
$earning += ($line[3]-$line[2]);
}
}
Any ideas of where I went wrong?

The Text::CSV module can be used to read tab-delimited data. Often much nicer than trying to manually hack yourself something up with split and so on when it comes to things like quoting, escaping, etc..

You're wrong here : while(#line[0]){
I'd do:
my $seq_fh;
my %result;
open($seq_fh, $frhitoutput) || die "failed to read input file: $!";
while (my $line = <$seq_fh>) {
chomp $line;
## skip comments and blank lines and optional repeat of title line
next if $line =~ /^\#/ || $line =~ /^\s*$/ || $line =~ /^\+/;
#split each line into array
my #line = split(/\s+/, $line);
$result{$line[0]}{yeartotal} += $line[1];
$result{$line[0]}{earning} += $line[3] - $line[2];
}

You should use hash, something like this:
my %hash;
while (my $line = <>) {
next if $line =~ /^#/;
my ($name, $years, $risk, $total) = split /\s+/, $line;
next unless defined $name and defined $years
and defined $risk and defined $total;
$hash{$name}{years} += $years;
$hash{$name}{risk} += $risk;
$hash{$name}{total} += $total;
$hash{$name}{earnings} += $total - $risk;
}
foreach my $name (sort keys %hash) {
print "$name earned $hash{$name}{earnings} in $hash{$name}{years}\n";
}

Nice opportunity to explore Perl's powerful command line options! :)
Code
Note: this code should be a command line oneliner, but it's a little bit easier to read this way. When writing it in a proper script file, you really should enable strict and warnings and use a little bit better names. This version won't compile under strict, you have to declare our $d.
#!/usr/bin/perl -nal
# collect data
$d{$F[0]}{y} += $F[1];
$d{$F[0]}{e} += $F[3] - $F[2];
# print summary
END { print "$_:\tyears: $d{$_}{y},\tearnings: $d{$_}{e}" for sort keys %d }
Output
adam: years: 20, earnings: 430
bill: years: 50, earnings: 885
Explanation
I make use of the -n switch here which basically lets your code iterate over the input records (-l tells it to use lines). The -a switch lets perl split the lines into the array #F. Simplified version:
while (defined($_ = <STDIN>)) {
chomp $_;
our(#F) = split(' ', $_, 0);
# collect data
$d{$F[0]}{y} += $F[1];
$d{$F[0]}{e} += $F[3] - $F[2];
}
%d is a hash with the names as keys and hashrefs as values, which contain years (y) and earnings (e).
The END block is executed after finishing the input line processing and outputs %d.
Use O's Deparse to view the code which is actually executed:
book:/tmp memowe$ perl -MO=Deparse tsv.pl
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
our(#F) = split(' ', $_, 0);
$d{$F[0]}{'y'} += $F[1];
$d{$F[0]}{'e'} += $F[3] - $F[2];
sub END {
print "${_}:\tyears: $d{$_}{'y'},\tearnings: $d{$_}{'e'}" foreach (sort keys %d);
}
;
}
tsv.pl syntax OK

It seems like a fixed-width file, I would use unpack for that

How to make LWP and HTML::TableExtract spitting out CSV with Text::CSV

I am currently working on a little parser.
i have had very good results with the first script! This was able to run great!
It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated, so the subequent work with the data is a bit difficult. Therefore i have a second script - see below!
Note - friends helped me with the both scripts. I need to introduce myself as a true novice who needs help in migration two in one. So, you see, my Perl-knowlgedge is not so elaborated that i am able to do the migration into one on my own! Any and all help would be great!
The first script: a spider and parser: it spits out the data like this:
lfd. Nr. Schul- nummer Schulname Straße PLZ Ort Telefon Fax Schulart Webseite
1 0401 Mädchenrealschule Marienburg, Abenberg, der Diözese Eichstätt Marienburg 1 91183  Abenberg  09178/509210 Realschulen mrs-marienburg.homepage.t-online.de
2 6581 Volksschule Abenberg (Grundschule) Güssübelstr. 2 91183  Abenberg  09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
3 6913 Mittelschule Abenberg  Güssübelstr. 2 91183  Abenberg  09178/215 09178/905060 Volksschulen home.t-online.de/home/vs-abenberg
4 0402 Johann-Turmair-Realschule Staatliche Realschule Abensberg Stadionstraße 46 93326  Abensberg  09443/9143-0,12,13 09443/914330 Realschulen www.rs-abensberg.de
But i need to separate the data: with commas or someting like that!
And i have a second script. This part can do the CSV-formate. i want to ombine it with the spider-logic. But first lets have a look at the first script: with the great spider-logic.
see the code that is appropiate:
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(#$row);
print OUTFILE "#$row\n";
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
}
sub cleanup() {
for ( #_ ) {
s/s+/ /g;
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
But as this-above script-unfortunatley does not take care for the separators i have had to take care for a method, that does look for separators. In order to get the data (output) separated.
So with the separation i am able to work with the data - and store it in a mysql-table.. or do something else...So here [below] are the bits - that work out the csv-formate Note - i want to put the code below into the code above - to combine the spider-logic of the above mentioned code with the logic of outputting the data in CSV-formate.
where to set in the code Question: can we identify this point to migrate the one into the other... !?
That would be amazing... I hope i could make clear what i have in mind...!? Are we able to use the benefits of the both parts (/scripts ) migrating them into one?
So the question is: where to set in with the CSV-Script into the script (above)
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=a&a=20';
$html =~ tr/\r//d; # strip carriage returns
$html =~ s/ / /g; # expand spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my #cols = qw(
rownum
number
name
phone
type
website
);
my #fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
# trim leading/trailing whitespace from base fields
s/^\s+//, s/\s+$// for #$row;
# load the fields into the hash using a "hash slice"
my %h;
#h{#cols} = #$row;
# derive some fields from base fields, again using a hash slice
#h{qw/name street postal town/} = split /\n+/, $h{name};
#h{qw/phone fax/} = split /\n+/, $h{phone};
# trim leading/trailing whitespace from derived fields
s/^\s+//, s/\s+$// for #h{qw/name street postal town/};
$csv->combine(#h{#fields});
print $csv->string, "\n";
}
}
The thing is that i have had very good results with the first script! It fetches the data from the page: http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=20
(note 6142 records) - But note - the data are not separated...!
And i have a second script. This part can do the CSV-formate. i want to combine it with the spider-logic.
where is the part to insert? I look forward to any and all help.
if i have to be more precice - just let me know...

Since you have entered a complete script, I'll assume you want critique of the whole thing.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::TableExtract;
use LWP::Simple;
use Cwd;
use POSIX qw(strftime);
my $te = HTML::TableExtract->new;
Since you only use $te in one block, why are you declaring and initializing it in this outer scope? The same question applies to most of your variables -- try to declare them in the innermost scope possible.
my $total_records = 0;
my $suchbegriffe = "e";
my $treffer = 50;
In general, english variable names will enable you to collaborate with far more people than german names. I understand german, so I understand the intent of your code, but most of SO doesn't.
my $range = 0;
my $url_to_process = "http://192.68.214.70/km/asps/schulsuche.asp?q=";
my $processdir = "processing";
my $counter = 50;
my $displaydate = "";
my $percent = 0;
&workDir();
Don't use & to call subs. Just call them with workDir;. It hasn't been necessary to use & since 1994, and it can lead to a nasty gotcha because &callMySub; is a special case which doesn't do what you might think, while callMySub; does the Right Thing.
chdir $processdir;
&processURL();
print "\nPress <enter> to continue\n";
<>;
$displaydate = strftime('%Y%m%d%H%M%S', localtime);
open OUTFILE, ">webdata_for_$suchbegriffe\_$displaydate.txt";
Generally lexical filehandles are preferred these days: open my $outfile, ">file"; Also, you should check for errors from open or use autodie; to make open die on failure.
&processData();
close OUTFILE;
print "Finished processing $total_records records...\n";
print "Processed data saved to $ENV{HOME}/$processdir/webdata_for_$suchbegriffe\_$displaydate.txt\n";
unlink 'processing.html';
die "\n";
sub processURL() {
print "\nProcessing $url_to_process$suchbegriffe&a=$treffer&s=$range\n";
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'tempfile.html') or die 'Unable to get page';
while( <tempfile.html> ) {
open( FH, "$_" ) or die;
while( <FH> ) {
if( $_ =~ /^.*?(Treffer <b>)(d+)( - )(d+)(</b> w+ w+ <b>)(d+).*/ ) {
$total_records = $6;
print "Total records to process is $total_records\n";
}
}
close FH;
}
unlink 'tempfile.html';
}
sub processData() {
while ( $range <= $total_records) {
getstore("$url_to_process$suchbegriffe&a=$treffer&s=$range", 'processing.html') or die 'Unable to get page';
$te->parse_file('processing.html');
my ($table) = $te->tables;
for my $row ( $table->rows ) {
cleanup(#$row);
print OUTFILE "#$row\n";
This is the line to change if you want to put commas in separating your data. Look at the join function, it can do what you want.
}
$| = 1;
print "Processed records $range to $counter";
print "\r";
$counter = $counter + 50;
$range = $range + 50;
$te = HTML::TableExtract->new;
}
It's very strange to initialize $te at the end of the loop instead of the beginning. It's much more idiomatic to declare and initialize $te at the top of the loop.
}
sub cleanup() {
for ( #_ ) {
s/s+/ /g;
Did you mean s/\s+/ /g;?
}
}
sub workDir() {
# Use home directory to process data
chdir or die "$!";
if ( ! -d $processdir ) {
mkdir ("$ENV{HOME}/$processdir", 0755) or die "Cannot make directory $processdir: $!";
}
}
I haven't commented on your second script; perhaps you should ask it as a separate question.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PERL Fixed Width to CSV based on Input Files - perl

Related

Using Perl's SEEK to jump to a line in a file and continuing to read the file

Parse only date from a line in text file

Unable to remove newlines from Emsembl FASTA

How to parse through tab-delimited file in perl?

How to make LWP and HTML::TableExtract spitting out CSV with Text::CSV

Categories

Resources