extract data from excel files and structure them into hashes - perl

I have a three columns Excel file,which has the following pattern
12 A P1
23 B P5
24 C P2
15 D P1
06 E P5
The structure underlying this data set is that,
P1 contains A and D; A corresponds to 12 and D corresponds to 15
P2 contains C; C corresponds to 24
P5 contains B and E; B corresponds to 23 and E corresponds to 06
I want to represent this kind of structure in a hashed structure i.e., use P1 as a key to point to a hash, and A is used as the key for this second level hash. Is there a way to implement this in Perl?

Spreadsheet::ParseExcel can be used to parse .xls files. Below is a sample program that builds the desired data structure.
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
use Spreadsheet::ParseExcel;
my $parser = Spreadsheet::ParseExcel->new;
my $workbook = $parser->parse( shift or die "Please provide a file\n" );
my $worksheet = $workbook->worksheet(0);
my %data;
for my $row ( 0 .. $worksheet->row_range ) {
my $value = $worksheet->get_cell( $row, 0 )->value;
my $key = $worksheet->get_cell( $row, 1 )->value;
my $super_key = $worksheet->get_cell( $row, 2 )->value;
$data{$super_key}->{$key} = $value;
}
print Dumper \%data;
Output
$VAR1 = {
'P5' => {
'E' => '06',
'B' => '23'
},
'P2' => {
'C' => '24'
},
'P1' => {
'A' => '12',
'D' => '15'
}
};

I had to process data in spreadsheets in the past. If you are dealing with a small number of Excel files, export them manually to CSV files using a spreadsheet software such as Excel. Then parse the CSV file, and store the cell values in a hash of hashes in perl:
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper::Simple;
my $file = "";
my #row = ();
my $rowidx = 1;
my %hh = (); # hash of hashes
open( INFILE, "input.csv" ) or die("Can not open input file: $!");
while ( $file = <INFILE> ) {
#row = parse($file);
chomp(#row);
$hh{ $row[2] }{ $row[1] } = $row[0];
#warn Dumper %hh; # debug
$rowidx++;
}
close(INFILE);
warn Dumper %hh;
exit;
sub parse {
my #newrow = ();
my $columns = shift; # read next row
push( #newrow, $+ ) while $columns =~ m{"([^\"\\]*(?:\\.[^\"\\]*)*)",?|([^,]+),?|,}gx; # parse and store columns to array
push( #newrow, undef ) if substr( $columns, -1, 1 ) eq ',';
return #newrow;
}
Running this gives
$ more input.csv
12,A,P1
23,B,P5
24,C,P2
15,D,P1
06,E,P5
$ ./ReadCSV.pl input.csv
%hh = (
'P5' => {
'E' => '06',
'B' => '23'
},
'P2' => {
'C' => '24'
},
'P1' => {
'A' => '12',
'D' => '15'
}
);

There's the Spreadsheet::ParseExcel function which does a pretty good job of parsing a regular *.xls spreadsheet.
Fortunately, there's an extension called Spreadsheet::XLSX that works with Spreadsheet::ParseExcel to also read *.xlsx spreadsheets too. The methods used in Spreadsheet::ParseExcel work with both *.xls and *.xlsx files if you also have Spreadsheet::XLSX also installed.

What version of excel are the files formatted in?
I have had a very good experience with reading from (and writing to) .xls files using the modules Spreadsheet::ParseExcel (Spreadsheet::WriteExcel for output)
Unfortunately, I did this 4 years ago and the .xlsx format was not as prevalent, so I can't speak for those.

Related

Using a variable as a hash key Perl 5.32

I am reading user IDs from a csv file and trying to check if that user ID exists in my hash, however I have found that my checking through if(exists($myUserIDs{"$userID"})) is never returning true, despite testing with multiple keys that are in my hash.
open $file, "<", "test.csv";
while(<$file>)
{
chomp;
#fields = split /,/, $_;
$userID = #fields[1];
if(exists($myUserIDs{"$userID"}))
{
#do something
}
}
Turns out I wrote my test csv file with spaces after each comma.
Like 1, 2, 3 instead of 1,2,3 so my keys weren't actually matching. There goes an hour of my time haha.
See if this works for you.
use strict; use warnings;
use Data::Dumper;
my (%myUserIDs, #fields);
%myUserIDs = (
'1' => 'A',
'4' => 'D'
);
print Dumper(\%myUserIDs);
while(<DATA>)
{
chomp;
#fields = split /,/, $_;
my $userID = $fields[1];
if(exists($myUserIDs{$userID})){
print "ID:$userID exists in Hash\n";
} else {
print "ID:$userID not exists in Hash\n";
}
}
__DATA__
A,1
B,2
C,3
Output:
$VAR1 = {
'4' => 'D',
'1' => 'A'
};
ID:1 exists in Hash
ID:2 not exists in Hash
ID:3 not exists in Hash

Hash to CSV/TSV conversion in Perl

I'm trying to convert a simple hash to CSV/TSV in Perl. Now, the tricky part is that I'm unable to use Text::CSV::Slurp due to some funny reason, and I'm left with using Text::CSV_XS and Text::CVS.
Problem Description:
I am able to create a CSV file from the hash that I have, but display of the values isn't how I would desire them to be.
Example:
This is how my hash looks like:
`$VAR1 = {
'2015-12-09 10:49:00' => '750 mW',
'2015-12-09 10:49:02' => '751 mW'
};`
I would want keys to be under one tab and values to be under another tab. Instead, I get a CVS which has everything in a comma-separated state.
Desired Output:
key1 value1
key2 value2
Actual Output:
key1 value1 key2 value2
This is how my code looks like as of now:
open(DATA, "+>file.csv") || die "Couldn't open file file.csv, $!";
my $csv = Text::CSV_XS->new();
if ($input == 19){
my $status = $csv->print (\*DATA, \#{[%hash1]});
}
elsif ($input == 11){
my $status = $csv->print (\*DATA, \#{[%hash2]});
}
close(DATA) || die "Couldn't close file properly";
I have went through numerous questions in Stack Overflow and Perl Monks, but I somehow haven't been able to figure out the solution to this without using Text::CSV::Slurp.
Please help.
P.S: %hash1 and %hash2 are simple hashes which have basic key-value pairing, and are not hash of hashes as of now. However, as the code develops, I may have to implement the logic on HoH as well.
If I'm reading you right, something like this is what you're after:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $VAR1 = {
'2015-12-09 10:49:00' => '750 mW',
'2015-12-09 10:49:02' => '751 mW'
};
my $csv = Text::CSV -> new ( { sep_char => "\t", eol => "\n", binary => 1 } );
foreach my $key ( sort keys %{$VAR1} ) {
$csv -> print ( \*STDOUT, [ $key, $VAR1 -> {$key} ] );
}
(Or if you're doing it with a hash, not hash ref):
foreach my $key ( sort keys %hash ) {
$csv -> print ( \*STDOUT, [ $key, $hash{$key} ] );
}
Note - this is explicitly sorting, because hashes are unordered. You look to be using an sortable date format, so this should be ok, but you may need to parse a data to an epoch and parse based on that.
Output
"2015-12-09 10:49:00" "750 mW"
"2015-12-09 10:49:02" "751 mW"
Note - TSV embeds quotes because the fields contain spaces. You can remove those by:
my $csv = Text::CSV -> new ( { sep_char => "\t",
eol => "\n",
binary => 1,
quote_char => undef } );
I would strongly suggest not using DATA as your output filehandle, as it's used already in perl. In fact, I would suggest using lexical filehandles with 3 arg open:
open ( my $output, '>', 'file.csv' ) or die $!;
# ...
$csv -> print ( $output, ### everything else

Parsing unsorted data from large fixed width text

I am mostly a Matlab user and a Perl n00b. This is my first Perl script.
I have a large fixed width data file that I would like to process into a binary file with a table of contents. My issue is that the data files are pretty large and the data parameters are sorted by time. Which makes it difficult (at least for me) to parse into Matlab. So seeing how Matlab is not that good at parsing text I thought I would try Perl. I wrote the following code which works ... at least on my small test file. However it is painfully slow when I tried it on an actual large data file. It was pieced together which lots of examples for various tasks from the web / Perl documentation.
Here is a small sample of the data file. Note: Real file has about 2000 parameter and is 1-2GB. Parameters can be text, doubles, or unsigned integers.
Param 1 filter = ALL_VALUES
Param 2 filter = ALL_VALUES
Param 3 filter = ALL_VALUES
Time Name Ty Value
---------- ---------------------- --- ------------
1.1 Param 1 UI 5
2.23 Param 3 TXT Some Text 1
3.2 Param 1 UI 10
4.5 Param 2 D 2.1234
5.3 Param 1 UI 15
6.121 Param 2 D 3.1234
7.56 Param 3 TXT Some Text 2
The basic logic of my script is to:
Read until the ---- line to build list of parameters to extract (always has "filter =").
Use the --- line to determine field widths. It is broken by spaces.
For each parameter build time and data array (while nested inside of foreach param)
In continue block write time and data to binary file. Then record name, type, and offsets in text table of contents file (used to read the file later into Matlab).
Here is my script:
#!/usr/bin/perl
$lineArg1 = #ARGV[0];
open(INFILE, $lineArg1);
open BINOUT, '>:raw', $lineArg1.".bin";
open TOCOUT, '>', $lineArg1.".toc";
my $line;
my $data_start_pos;
my #param_name;
my #template;
while ($line = <INFILE>) {
chomp $line;
if ($line =~ s/\s+filter = ALL_VALUES//) {
$line = =~ s/^\s+//;
$line =~ s/\s+$//;
push #param_name, $line;
}
elsif ($line =~ /^------/) {
#template = map {'A'.length} $line =~ /(\S+\s*)/g;
$template[-1] = 'A*';
$data_start_pos = tell INFILE;
last; #Reached start of data exit loop
}
}
my $template = "#template";
my #lineData;
my #param_data;
my #param_time;
my $data_type;
foreach $current_param (#param_name) {
#param_time = ();
#param_data = ();
seek(INFILE,$data_start_pos,0); #Jump to data start
while ($line = <INFILE>) {
if($line =~ /$current_param/) {
chomp($line);
#lineData = unpack $template, $line;
push #param_time, #lineData[0];
push #param_data, #lineData[3];
}
} # END WHILE <INFILE>
} #END FOR EACH NAME
continue {
$data_type = #lineData[2];
print TOCOUT $current_param.",".$data_type.",".tell(BINOUT).","; #Write name,type,offset to start time
print BINOUT pack('d*', #param_time); #Write TimeStamps
print TOCOUT tell(BINOUT).","; #offset to end of time/data start
if ($data_type eq "TXT") {
print BINOUT pack 'A*', join("\n",#param_data);
}
elsif ($data_type eq "D") {
print BINOUT pack('d*', #param_data);
}
elsif ($data_type eq "UI") {
print BINOUT pack('L*', #param_data);
}
print TOCOUT tell(BINOUT).","."\n"; #Write memory loc to end data
}
close(INFILE);
close(BINOUT);
close(TOCOUT);
So my questions to you good people of the web are as follows:
What am I obviously screwing up? Syntax, declaring variables when I don't need to, etc.
This is probably slow (guessing) because of the nested loops and searching the line by line over and over again. Is there a better way to restructure the loops to extract multiple lines at once?
Any other speed improvement tips you can give?
Edit: I modified the example text file to illustrate non-integer time stamps and Param Names may contain spaces.
First, you should always have 'use strict;' and 'use warnings;' pragmas in your script.
It seems like you need a simple array (#param_name) for reference, so loading those values would be straight forward as you have it. (again, adding the above pragmas would start showing you errors, including the $line = =~ s/^\s+//; line!)
I suggest you read this, to understand how you can load your data file into a
Hash of Hashes. Once you've designed the hash, you simply read and load the file data contents, and then iterate through the contents of the hash.
For example, using time as the key for the hash
%HoH = (
1 => {
name => "Param1",
ty => "UI",
value => "5",
},
2 => {
name => "Param3",
ty => "TXT",
value => "Some Text 1",
},
3 => {
name => "Param1",
ty => "UI",
value => "10",
},
);
Make sure you close the INFILE after reading in the contents, before you start processing.
So in the end, you iterate over the hash, and reference the array (instead of the file contents) for your output writes - I would imagine it would be much faster to do this.
Let me know if you need more info.
Note: if you go this route, include Data:Dumper - a significant help to printing and understanding the data in your hash!
It seems to me that embedded spaces can only occur in the last field. That makes using split ' ' feasible for this problem.
I am assuming you are not interested in the header. In addition, I am assuming you want a vector for each parameter and are not interested in timestamps.
To use data file names specified on the command line or piped through standard input, replace <DATA> with <>.
#!/usr/bin/env perl
use strict; use warnings;
my %data;
$_ = <DATA> until /^-+/; # skip header
while (my $line = <DATA>) {
$line =~ s/\s+\z//;
last unless $line =~ /\S/;
my (undef, $param, undef, $value) = split ' ', $line, 4;
push #{ $data{ $param } }, $value;
}
use Data::Dumper;
print Dumper \%data;
__DATA__
Param1 filter = ALL_VALUES
Param2 filter = ALL_VALUES
Param3 filter = ALL_VALUES
Time Name Ty Value
---------- ---------------------- --- ------------
1 Param1 UI 5
2 Param3 TXT Some Text 1
3 Param1 UI 10
4 Param2 D 2.1234
5 Param1 UI 15
6 Param2 D 3.1234
7 Param3 TXT Some Text 2
Output:
$VAR1 = {
'Param2' => [
'2.1234',
'3.1234'
],
'Param1' => [
'5',
'10',
'15'
],
'Param3' => [
'Some Text 1',
'Some Text 2'
]
};
First off, this piece of code causes the input file to be read once for every param. Which is quite in-efficient.
foreach $current_param (#param_name) {
...
seek(INFILE,$data_start_pos,0); #Jump to data start
while ($line = <INFILE>) { ... }
...
}
Also there is very rarely a reason to use a continue block. This is more style / readability, then a real problem.
Now on to make it more performant.
I packed the sections individually, so that I could process a line exactly once. To prevent it from using up tons of RAM, I used File::Temp to store the data until I was ready for it. Then I used File::Copy to append those sections into the binary file.
This is a quick implementation. If I were to add much more to it, I would split it up more than it is now.
#!/usr/bin/perl
use strict;
use warnings;
use File::Temp 'tempfile';
use File::Copy 'copy';
use autodie qw':default copy';
use 5.10.1;
my $input_filename = shift #ARGV;
open my $input, '<', $input_filename;
my #param_names;
my $template = ''; # stop uninitialized warning
my #field_names;
my $field_name_line;
while( <$input> ){
chomp;
next if /^\s*$/;
if( my ($param) = /^\s*(.+?)\s+filter = ALL_VALUES\s*$/ ){
push #param_names, $param;
}elsif( /^[\s-]+$/ ){
my #fields = split /(\s+)/;
my $pos = 0;
for my $field (#fields){
my $length = length $field;
if( substr($field, 0, 1) eq '-' ){
$template .= "\#${pos}A$length ";
}
$pos += $length;
}
last;
}else{
$field_name_line = $_;
}
}
#field_names = unpack $template, $field_name_line;
for( #field_names ){
s(^\s+){};
$_ = lc $_;
$_ = 'type' if substr('type', 0, length $_) eq $_;
}
my %temp_files;
for my $param ( #param_names ){
for(qw'time data'){
my $fh = tempfile 'temp_XXXX', UNLINK => 1;
binmode $fh, ':raw';
$temp_files{$param}{$_} = $fh;
}
}
my %convert = (
TXT => sub{ pack 'A*', join "\n", #_ },
D => sub{ pack 'd*', #_ },
UI => sub{ pack 'L*', #_ },
);
sub print_time{
my($param,$time) = #_;
my $fh = $temp_files{$param}{time};
print {$fh} $convert{D}->($time);
}
sub print_data{
my($param,$format,$data) = #_;
my $fh = $temp_files{$param}{data};
print {$fh} $convert{$format}->($data);
}
my %data_type;
while( my $line = <$input> ){
next if $line =~ /^\s*$/;
my %fields;
#fields{#field_names} = unpack $template, $line;
print_time( #fields{(qw'name time')} );
print_data( #fields{(qw'name type value')} );
$data_type{$fields{name}} //= $fields{type};
}
close $input;
open my $bin, '>:raw', $input_filename.".bin";
open my $toc, '>', $input_filename.".toc";
for my $param( #param_names ){
my $data_fh = $temp_files{$param}{data};
my $time_fh = $temp_files{$param}{time};
seek $data_fh, 0, 0;
seek $time_fh, 0, 0;
my #toc_line = ( $param, $data_type{$param}, 0+sysseek($bin, 0, 1) );
copy( $time_fh, $bin, 8*1024 );
close $time_fh;
push #toc_line, sysseek($bin, 0, 1);
copy( $data_fh, $bin, 8*1024 );
close $data_fh;
push #toc_line, sysseek($bin, 0, 1);
say {$toc} join ',', #toc_line, '';
}
close $bin;
close $toc;
I modified my code to build a Hash as suggested. I have not incorporate the output to binary yet due to time limitations. Plus I need to figure out how to reference the hash to get the data out and pack it into binary. I don't think that part should be to difficult ... hopefully
On an actual data file (~350MB & 2.0 Million lines) the following code takes approximately 3 minutes to build the hash. CPU usage was 100% on 1 of my cores (nill on the other 3) and Perl memory usage topped out at around 325MB ... until it dumped millions of lines to the prompt. However the print Dump will be replaced with a binary pack.
Please let me know if I am making any rookie mistakes.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $lineArg1 = $ARGV[0];
open(INFILE, $lineArg1);
my $line;
my #param_names;
my #template;
while ($line = <INFILE>) {
chomp $line; #Remove New Line
if ($line =~ s/\s+filter = ALL_VALUES//) { #Find parameters and build a list
push #param_names, trim($line);
}
elsif ($line =~ /^----/) {
#template = map {'A'.length} $line =~ /(\S+\s*)/g; #Make template for unpack
$template[-1] = 'A*';
my $data_start_pos = tell INFILE;
last; #Reached start of data exit loop
}
}
my $size = $#param_names+1;
my #getType = ((1) x $size);
my $template = "#template";
my #lineData;
my %dataHash;
my $lineCount = 0;
while ($line = <INFILE>) {
if ($lineCount % 100000 == 0){
print "On Line: ".$lineCount."\n";
}
if ($line =~ /^\d/) {
chomp($line);
#lineData = unpack $template, $line;
my ($inHeader, $headerIndex) = findStr($lineData[1], #param_names);
if ($inHeader) {
push #{$dataHash{$lineData[1]}{time} }, $lineData[0];
push #{$dataHash{$lineData[1]}{data} }, $lineData[3];
if ($getType[$headerIndex]){ # Things that only need written once
$dataHash{$lineData[1]}{type} = $lineData[2];
$getType[$headerIndex] = 0;
}
}
}
$lineCount ++;
} # END WHILE <INFILE>
close(INFILE);
print Dumper \%dataHash;
#WRITE BINARY FILE and TOC FILE
my %convert = (TXT=>sub{pack 'A*', join "\n", #_}, D=>sub{pack 'd*', #_}, UI=>sub{pack 'L*', #_});
open my $binfile, '>:raw', $lineArg1.'.bin';
open my $tocfile, '>', $lineArg1.'.toc';
for my $param (#param_names){
my $data = $dataHash{$param};
my #toc_line = ($param, $data->{type}, tell $binfile );
print {$binfile} $convert{D}->(#{$data->{time}});
push #toc_line, tell $binfile;
print {$binfile} $convert{$data->{type}}->(#{$data->{data}});
push #toc_line, tell $binfile;
print {$tocfile} join(',',#toc_line,''),"\n";
}
sub trim { #Trim leading and trailing white space
my (#strings) = #_;
foreach my $string (#strings) {
$string =~ s/^\s+//;
$string =~ s/\s+$//;
chomp ($string);
}
return wantarray ? #strings : $strings[0];
} # END SUB
sub findStr { #Return TRUE if string is contained in array.
my $searchStr = shift;
my $i = 0;
foreach ( #_ ) {
if ($_ eq $searchStr){
return (1,$i);
}
$i ++;
}
return (0,-1);
} # END SUB
The output is as follows:
$VAR1 = {
'Param 1' => {
'time' => [
'1.1',
'3.2',
'5.3'
],
'type' => 'UI',
'data' => [
'5',
'10',
'15'
]
},
'Param 2' => {
'time' => [
'4.5',
'6.121'
],
'type' => 'D',
'data' => [
'2.1234',
'3.1234'
]
},
'Param 3' => {
'time' => [
'2.23',
'7.56'
],
'type' => 'TXT',
'data' => [
'Some Text 1',
'Some Text 2'
]
}
};
Here is the output TOC File:
Param 1,UI,0,24,36,
Param 2,D,36,52,68,
Param 3,TXT,68,84,107,
Thanks everyone for their help so far! This is an excellent resource!
EDIT: Added Binary & TOC file writing code.

Perl - Generic subroutine which can aggregate the records and print

We receive a data file from our legacy system and we process it and load it in a database. The input file (say input.txt) can be bifurcated column-wise into two parts – the first being the data columns and the second being the numbers columns. The processing that we do on this file is to drop some of the data columns and aggregate the numbers for the columns remaining (so that each record is a unique record).
The tab delimited input file input.txt is shown below (column0 to column4 are the data columns and column5 to column7 are the numbers column):
a b c h n 1.99 2.99 9
a b c k q 100 100 10
a b c m s 9.99 8.99 11
a b d i o 0.01 0.01 12
a b d j p -12.19 11.11 13
a b e l r 9 9 14
The tab delimited output file output.txt is shown below:
a b c 111.98 111.98
a b d -12.18 11.12
a b e 9 9
The following perl script aggregates the numbers by keeping column0, column1 and column2. The script is working fine.
use strict;
my $INPUT_FILE=shift #ARGV || die "You must supply the input as the first argument!!!\n";
my $OUTPUT_FILE=shift #ARGV || die "You must supply the output file as the second argument!!!\n";
open(my $out, ">", $OUTPUT_FILE) or die "Cannot open $OUTPUT_FILE for writing!\n";
open(my $in, "<", $INPUT_FILE) or die "Cannot open $INPUT_FILE for processing!\n";
my $data;
while (<$in>)
{
s/\r?\n$//;
my #cols = split(/\t/);
$data->{$cols[0]}->{$cols[1]}->{$cols[2]}->[0] += $cols[5];
$data->{$cols[0]}->{$cols[1]}->{$cols[2]}->[1] += $cols[6];
}
close $in;
foreach my $lev1 (sort keys %{$data})
{
foreach my $lev2 (sort keys %{$data->{$lev1}})
{
foreach my $lev3 (sort keys %{$data->{$lev1}->{$lev2}})
{
my $dataVal = $data->{$lev1}->{$lev2}->{$lev3}->[0];
my $dataVal2 = $data->{$lev1}->{$lev2}->{$lev3}->[1];
print $out "$lev1\t$lev2\t$lev3\t$dataVal\t$dataVal2\n";
}
}
}
close $out;
Question: We apply the same logic in many different perl scripts. I want to create a generic subroutine which can be sourced in all those different script using “require” statement. The subroutine should aggregate and print the output. This subroutine should accept the arguments as to which columns I need for aggregation (currently column0 to column2) and the numbers from which columns should be aggregated (currently column5 and column6). Please advice.
One way to approach the problem is to begin by consolidating all of your parameters. Rather than scattering constants like 0, 5, 6, and "\t" throughout your program, bundle them up.
my %opt = (
input_file => 'input.dat',
output_file => 'output.dat',
keep_cols => [0,1,2],
agg_cols => [5,6],
join_char => "\t",
);
Then you might think about how you would make your current script more modular -- something along these lines:
use strict;
use warnings; # Don't forget this.
run(#ARGV);
sub run {
my %opt = get_args(#_);
$opt{data} = read_input_file(%opt);
write_output_file(%opt);
}
sub get_args {
}
sub read_input_file {
}
sub write_output_file {
}
Finally, I would suggest that you flatten your data structure. Rather than using a multi-level hash, which can be a bit awkward to type and read, simply join your various hash keys into a composite string, using any safe delimiter. Inside read_input_file(), you might have some code like this:
my #cols = split $opt{join_char}, $line;
my $i = 0;
my $k = join $opt{join_char}, #cols[ #{$opt{keep_cols}} ];
$data{$k}[$i ++] += $_ for #cols[ #{$opt{agg_cols }} ];
My attempt on it using DBD::CSV. I wrapped it in a Moose class as that is what I wanted to try.
package MyDataParser;
use Moose;
use MooseX::Types::Path::Class;
use DBI;
has _dbd => ( is => 'ro', isa => 'Object', lazy_build => 1,);
has data_file => (is => 'rw', isa => 'Path::Class::File', required => 1, coerce => 1);
has label_columns => (
traits => ['Array'],
is => 'rw',
isa => 'ArrayRef[Int]',
required => 1,
handles => {
list_label_columns => 'elements',
add_label_column => 'push',
}
);
has data_columns => (
traits => ['Array'],
is => 'rw',
isa => 'ArrayRef[Int]',
required => 1,
handles => {
list_data_columns => 'elements',
add_data_column => 'push',
}
);
has _sql_query => (is => 'rw', isa => 'Str', lazy_build => 1,);
sub get_totals {
my $self = shift;
my $ar = $self->_dbd->selectall_arrayref($self->_sql_query);
die $DBI::errstr if $DBI::err;
foreach my $row (#$ar) {
print "#$row\n";
}
}
sub _build__dbd {
my $self = shift;
my $dbh = DBI->connect ("dbi:CSV:");
$dbh->{csv_tables}{data} = {
sep_char => "\t",
file => $self->data_file,
col_names => ['column1' .. 'column8'],
};
return $dbh;
}
sub _build__sql_query {
my $self = shift;
my #label_column_names = map {'column' . $_} $self->list_label_columns;
my #data_columns = map {"SUM(column$_)"} $self->list_data_columns;
my $labels_str = join ', ', #label_column_names;
my $data_columns_str = join ', ', #data_columns;
my $query = qq/SELECT $labels_str, $data_columns_str FROM data GROUP BY $labels_str/;
return $query;
}
package main;
use strict;
use warnings;
my $df = MyDataParser->new(data_file => 'data.txt', label_columns => [1,2,3], data_columns => [6,7,8]);
$df->get_totals;
You're right, your current solution can be generalized. The first issue is to identify the hard-coded pieces of your program that may well be required to vary in future projects.
Only you know for sure what you want to generalize, but FM's hash of options offers you a very good guess. Let me focus on two of these options,
key_cols => [0,1,2],
agg_cols => [5,6],
where I've changed keep_cols to key_cols, since we're going to use them as keys in our data hash.
Think of your current statements
# version 1, key cols and agg cols hardcoded
$data->{$cols[0]}->{$cols[1]}->{$cols[2]}->[0] += $cols[5];
$data->{$cols[0]}->{$cols[1]}->{$cols[2]}->[1] += $cols[6];
as loops over the arrays referenced by these two options. Looping over the agg_cols is the easy part:
# version 2, generic agg cols, but key cols still hardcoded
my #agg_cols = #$opt{agg_cols};
for my $i (0..$#agg_cols}) {
$data->{$cols[0]}->{$cols[1]}->{$cols[2]}->[$i] += $cols[$agg_col[$i]];
}
Now to loop over the key_cols, just make a temporary copy of your $data ref, and index it more deeply on each pass:
# version 3, generic agg cols and key cols
my #agg_cols = #$opt{agg_cols};
my #key_cols = #$opt{key_cols};
my $current_ref = $data;
for my $key_col (#key_cols) {
$current_ref = $current_ref->{$cols[$key_col]};
}
for my $i (0..$#agg_cols}) {
$current_ref->[$i] += $cols[$agg_col[$i]];
}
This code belongs inside your while <$in> loop, except that you will want to refactor by reading your agg_cols and key_cols options just once at the top.

How can I convert these strings to a hash in Perl?

I wish to convert a single string with multiple delimiters into a key=>value hash structure. Is there a simple way to accomplish this? My current implementation is:
sub readConfigFile() {
my %CONFIG;
my $index = 0;
open(CON_FILE, "config");
my #lines = <CON_FILE>;
close(CON_FILE);
my #array = split(/>/, $lines[0]);
my $total = #array;
while($index < $total) {
my #arr = split(/=/, $array[$index]);
chomp($arr[1]);
$CONFIG{$arr[0]} = $arr[1];
$index = $index + 1;
}
while ( ($k,$v) = each %CONFIG ) {
print "$k => $v\n";
}
return;
}
where 'config' contains:
pub=3>rec=0>size=3>adv=1234 123 4.5 6.00
pub=1>rec=1>size=2>adv=111 22 3456 .76
The last digits need to be also removed, and kept in a separate key=>value pair whose name can be 'ip'. (I have not been able to accomplish this without making the code too lengthy and complicated).
What is your configuration data structure supposed to look like? So far the solutions only record the last line because they are stomping on the same hash keys every time they add a record.
Here's something that might get you closer, but you still need to figure out what the data structure should be.
I pass in the file handle as an argument so my subroutine isn't tied to a particular way of getting the data. It can be from a file, a string, a socket, or even the stuff below DATA in this case.
Instead of fixing things up after I parse the string, I fix the string to have the "ip" element before I parse it. Once I do that, the "ip" element isn't a special case and it's just a matter of a double split. This is a very important technique to save a lot of work and code.
I create a hash reference inside the subroutine and return that hash reference when I'm done. I don't need a global variable. :)
use warnings;
use strict;
use Data::Dumper;
readConfigFile( \*DATA );
sub readConfigFile
{
my( $fh ) = shift;
my $hash = {};
while( <$fh> )
{
chomp;
s/\s+(\d*\.\d+)$/>ip=$1/;
$hash->{ $. } = { map { split /=/ } split />/ };
}
return $hash;
}
my $hash = readConfigFile( \*DATA );
print Dumper( $hash );
__DATA__
pub=3>rec=0>size=3>adv=1234 123 4.5 6.00
pub=1>rec=1>size=2>adv=111 22 3456 .76
This gives you a data structure where each line is a separate record. I choose the line number of the record ($.) as the top-level key, but you can use anything that you like.
$VAR1 = {
'1' => {
'ip' => '6.00',
'rec' => '0',
'adv' => '1234 123 4.5',
'pub' => '3',
'size' => '3'
},
'2' => {
'ip' => '.76',
'rec' => '1',
'adv' => '111 22 3456',
'pub' => '1',
'size' => '2'
}
};
If that's not the structure you want, show us what you'd like to end up with and we can adjust our answers.
I am assuming that you want to read and parse more than 1 line. So, I chose to store the values in an AoH.
#!/usr/bin/perl
use strict;
use warnings;
my #config;
while (<DATA>) {
chomp;
push #config, { split /[=>]/ };
}
for my $href (#config) {
while (my ($k, $v) = each %$href) {
print "$k => $v\n";
}
}
__DATA__
pub=3>rec=0>size=3>adv=1234 123 4.5 6.00
pub=1>rec=1>size=2>adv=111 22 3456 .76
This results in the printout below. (The while loop above reads from DATA.)
rec => 0
adv => 1234 123 4.5 6.00
pub => 3
size => 3
rec => 1
adv => 111 22 3456 .76
pub => 1
size => 2
Chris
The below assumes the delimiter is guaranteed to be a >, and there is no chance of that appearing in the data.
I simply split each line based on '>'. The last value will contain a key=value pair, then a space, then the IP, so split this on / / exactly once (limit 2) and you get the k=v and the IP. Save the IP to the hash and keep the k=v pair in the array, then go through the array and split k=v on '='.
Fill in the hashref and push it to your higher-scoped array. This will then contain your hashrefs when finished.
(Having loaded the config into an array)
my #hashes;
for my $line (#config) {
my $hash; # config line will end up here
my #pairs = split />/, $line;
# Do the ip first. Split the last element of #pairs and put the second half into the
# hash, overwriting the element with the first half at the same time.
# This means we don't have to do anything special with the for loop below.
($pairs[-1], $hash->{ip}) = (split / /, $pairs[-1], 2);
for (#pairs) {
my ($k, $v) = split /=/;
$hash->{$k} = $v;
}
push #hashes, $hash;
}
The config file format is sub-optimal, shall we say. That is, there are easier formats to parse and understand. [Added: but the format is already defined by another program. Perl is flexible enough to deal with that.]
Your code slurps the file when there is no real need.
Your code only pays attention to the last line of data in the file (as Chris Charley noted while I was typing this up).
You also have not allowed for comment lines or blank lines - both are a good idea in any config file and they are easy to support. [Added: again, with the pre-defined format, this is barely relevant, but when you design your own files, do remember it.]
Here's an adaptation of your function into somewhat more idiomatic Perl.
#!/bin/perl -w
use strict;
use constant debug => 0;
sub readConfigFile()
{
my %CONFIG;
open(CON_FILE, "config") or die "failed to open file ($!)\n";
while (my $line = <CON_FILE>)
{
chomp $line;
$line =~ s/#.*//; # Remove comments
next if $line =~ /^\s*$/; # Ignore blank lines
foreach my $field (split(/>/, $line))
{
my #arr = split(/=/, $field);
$CONFIG{$arr[0]} = $arr[1];
print ":: $arr[0] => $arr[1]\n" if debug;
}
}
close(CON_FILE);
while (my($k,$v) = each %CONFIG)
{
print "$k => $v\n";
}
return %CONFIG;
}
readConfigFile; # Ignores returned hash
Now, you need to explain more clearly what the structure of the last field is, and why you have an 'ip' field without the key=value notation. Consistency makes life easier for everybody. You also need to think about how multiple lines are supposed to be handled. And I'd explore using a more orthodox notation, such as:
pub=3;rec=0;size=3;adv=(1234,123,4.5);ip=6.00
Colon or semi-colon as delimiters are fairly conventional; parentheses around comma separated items in a list are not an outrageous convention. Consistency is paramount. Emerson said "A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines", but consistency in Computer Science is a great benefit to everyone.
Here's one way.
foreach ( #lines ) {
chomp;
my %CONFIG;
# Extract the last digit first and replace it with an end of
# pair delimiter.
s/\s*([\d\.]+)\s*$/>/;
$CONFIG{ip} = $1;
while ( /([^=]*)=([^>]*)>/g ) {
$CONFIG{$1} = $2;
}
print Dumper ( \%CONFIG );
}