Related
I have a large .csv file (2 - 3 million records). I need to concatenate the first three fields (with underscores) and append it to each record, then I need to sort file based on that new field and three other fields. I am able to do that (am testing it with a 4 record file for now) - but I'm not sure how to write it back to the file in the same .csv form - instead of the way Data::Dumper formats each line as a separate variable. Here is the code I have so far - I have a couple of Print (to screen) lines to see what it's doing -
#!/usr/bin/perl/
use strict;
use warnings;
use Data::Dumper;
my $filename = '/testpath/test.csv';
#$filename = 'test.csv';
open my $FH, $filename
or die "Could not read from $filename <$!>, program halting.";
# Read the header line.
chomp(my $line = <$FH>);
my #fields = split(/,/, $line);
#print "Field Names:\n", Dumper(#fields), $/;
print Dumper(#fields), $/;
my #data;
# Read the lines one by one.
while($line = <$FH>) {
# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
chomp($line);
my #fields = split(/,/, $line);
unshift #fields, join '_', #fields[0..2];
push #data, \#fields;
}
close $FH;
print "Unsorted:\n", Dumper(#data); #, $/;
#data = sort {
$a->[0] cmp $b->[0] ||
$a->[20] cmp $b->[20] ||
$a->[23] cmp $b->[23] ||
$a->[26] cmp $b-> [26]
} #data;
open my $OFH, '>', '/testpath/parsedTest.csv';
print $OFH Dumper(#data);
close $OFH;
exit;
I'm assuming it's in the "print $OFH Dumper(#data);" line that I need to re-format it back to its original form.
And please be kind as I am a novice.
__________EDIT__________________________________
Here are the four lines from the test .csv file - first line is the header record:
STORE_NBR,CONTROL_NBR,LINE_NBR,SALES_NBR,QTY_MISTINT,REASON_CODE,MISTINT_COMM,SZ_CDE,TINTER_MODEL,TINTER_SERL_NBR,SPECTRO_MODEL,SPECTRO_SERL_NBR,EMP_NBR,TRAN_DATE,TRAN_TIME,CDS_ADL_FLD,PROD_NBR,PALETTE,COLOR_ID,INIT_TRAN_DATE,GALLONS_MISTINTED,UPDATE_EMP_NBR,UPDATE_TRAN_DATE,GALLONS,FORM_SOURCE,UPDATE_TRAN_TIME,SOURCE_IND,CANCEL_DATE,COLOR_TYPE,CANCEL_EMP_NBR,NEED_EXTRACTED,MISTINT_MQ_XTR,DATA_SOURCE,GUID,QUEUE_NAME,BROKER_NAME,MESSAGE_ID,PUT_TIME,CREATED_TS
1334,53927,1,100551589,1,6,Bad Shercolor Match,16,IFC 8112NP,01DX8005513,,,77,10/23/2015,95816,,OV0020001,,MANUAL,10/21/2015,1,0,,1,MAN,,CUST,,CUSTOM MATCH,0,TRUE,TRUE,O,5394A0E67FFF4D01A0D9AD16FA29ABB1,POS.MISTINT.V0000.UP.Q,PROD_SMISC_BK,414D512050524F445F504F533133333464EB2956052C0020,10/23/2015 10:45,10/23/2015 10:45
2525,67087,1,650462328,1,4,Tinted Wrong Product,14,IFC 8012NP,Standalone-5,,,11,10/23/2015,104314,,A91W00353,,,10/20/2015,0.25,0,,0.25,,,COMP,,CUSTOM MATCH,0,TRUE,TRUE,O,1AC5D8742D47435EA05343D57372AD32,POS.MISTINT.V0000.UP.Q,PROD_SMISC_BK,414D512050524F445F504F533235323531C2295605350020,10/23/2015 10:46,10/23/2015 10:47
1350,163689,1,650462302,1,3,Tinted Wrong Color,14,IFC 8012NP,06DX8006805,,,1,10/23/2015,104907,,A91W00351,COLOR,6233,10/23/2015,0.25,0,,0.5,ENG,,SW,,PALETTE,0,TRUE,TRUE,O,F1A072BCC548412FA22052698B5B0C28,POS.MISTINT.V0000.UP.Q,PROD_SMISC_BK,414D512050524F445F504F53313335307BC12956053C0020,10/23/2015 10:52,10/23/2015 10:52
Hope that's not too convoluted to read.
Data::Dumper outputs a format that is valid perl, and is good for debugging, but not for writing a CSV file.
You could write the CSV by hand:
foreach my $row (#data) {
print $OFG join(',', #$row), "\n";
}
but you really should use a specialized module, in this case Text::CSV, both for reading and writing the CSV – it will handle all the border cases (such as fields with embedded commas).
The synopsis contains a good example of both reading and writing; I won't repeat that here.
You don't have to rebuild the line if you just store it in #data too!
my #data;
while(my $line = <$FH>) {
chomp($line);
my #fields = split(/,/, $line);
push #data, [ "$line\n", join('_', #fields[0..2]), #fields[19, 22, 25] ];
}
#data = sort {
$a->[1] cmp $b->[1] ||
$a->[2] cmp $b->[2] ||
$a->[3] cmp $b->[3] ||
$a->[4] cmp $b->[4]
} #data;
print($OFH $_->[0]) for #data;
If your input didn't contain NULs, you could even use the following faster approach:
print $OFH
map { /[^\0]*\z/g }
sort
map {
chomp;
my #fields = split /,/;
join("\0", join('_', #fields[0..2]), #fields[19, 22, 25], "$_\n")
}
<$FH>;
But yeah, you should probably use a legit CSV parser.
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 1 });
my #data;
while (my $row = $csv->getline($FH)) {
push #data, [ join('_', #$row[0..2]), $row ];
}
#data = sort {
$a->[0] cmp $b->[0] ||
$a->[1][19] cmp $b->[1][19] ||
$a->[1][22] cmp $b->[1][22] ||
$a->[1][25] cmp $b->[1][25]
} #data;
$csv->say($OFH, $_->[1]) for #data;
The following is the fast approach using a CSV parser:
use Text::CSV_XS qw( );
my $csv = Text::CSV_XS->new({ binary => 1, auto_diag => 2 });
print $OFH
map { /[^\0]*\z/g }
sort
map {
$csv->parse($_);
my #fields = $csv->fields();
join("\0", join('_', #fields[0..2]), #fields[19, 22, 25], $_)
}
<$FH>;
Was unable to use the Text::CVS_XS because it was not available on our server, unfortunately - but did find adding this single "print" line worked -
open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedTest.csv';
print $OFH join(',', #$_), $/ for #data;
close $OFH;
Tested out fine with the small file, now to test on the actual file!
I have data for each and every student, e.g
Student Name Score
Jack 89
Jill 70
Sandy 40
Now I'm trying to plot these in a bar chart using GD::Graph::Bar, but since I'm pretty new to perl and modules, I see that I can manually declare all the X and Y values from the chart to be plotted.
But since I don't know the names and scores of each of the student(pulled from a text file)
I want to be able to do the values automatically,
I was thinking hash keys and values was a good approach. So I placed everything in a hash table, %hash(student name)=(score)
Can anyone help me plot this as a bar chart or guide me? Or would you recommend a different approach?
Thanks
"Update
This is the part where I can plot the graph manually by entering the student names.
my $graph = GD::Graph::bars->new(800, 800);
#data = (
["Jack","Jill"],
['30','50'],
);
$graph->set(
x_label => 'Students',
y_label => 'Scores',
title => 'Student Vs. Scores',
y_max_value => 60,
y_tick_number => 8,
y_label_skip => 2
) or die $graph->error;
my $gd = $graph->plot(\#data) or die $graph->error;
open(IMG, '>file.png') or die $!;
binmode IMG;
print IMG $gd->png;
Assuming your data file is as follows, using tab delimiters.
Student Name Score
Jack 89
Jill 70
Sandy 40
You could do something like this, pushing your x axis and y axis values from your data file to arrays.
use strict;
use warnings;
use CGI qw( :standard );
use GD::Graph::bars;
open my $fh, '<', 'data.txt' or die $!;
my (#x, #y);
while (<$fh>) {
next if $. == 1; # skip header line
push #x, (split /\t/)[0]; # push 'Student Names' into #x array
push #y, (split /\t/)[1]; # push 'Score' into #y array
}
close $fh;
my $graph = GD::Graph::bars->new(800, 800);
$graph->set(
x_label => 'Students',
y_label => 'Scores',
title => 'Student Vs. Scores',
) or warn $graph->error;
my #data = (\#x, \#y);
$graph->plot(\#data) or die $graph->error();
print header(-type=>'image/jpeg'), $graph->gd->jpeg;
Giving you for example:
If you are wanting to use multiple y axis values, assuming you have another tab delimiter column with for example Score2, you could easily do something like this.
my (#x, #y, #y2);
while (<$fh>) {
next if $. == 1;
push #x, (split /\t/)[0];
push #y, (split /\t/)[1];
push #y2, (split /\t/)[2];
}
And change your #data array to:
my #data = (\#x, \#y, \#y2);
And your result would be:
According to the documentation, you need to pass an array of arrays to the plot method of GD::Graph::bars. It sounds like you already have a hash so you need to convert it to an array of arrays. There are a number of ways to do this, but here's an example:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %hash = (
Larry => 15,
Curly => 16,
Moe => 20
);
my (#names, #scores);
while (my ($name, $score) = each %hash) {
push #names, $name;
push #scores, $score;
}
my #data = (\#names, \#scores);
print Dumper(\#data);
# $VAR1 = [
# [
# 'Moe',
# 'Curly',
# 'Larry'
# ],
# [
# 20,
# 16,
# 15
# ]
# ];
However you do it, make sure you preserve the order in the inner arrays.
I adapted the code from the samples directory in GD::Graph:
use warnings;
use strict;
use GD::Graph::bars;
use GD::Graph::Data;
my %students = (
Jack => 89,
Jill => 70,
Sandy => 40,
);
my #scores;
my #names;
for (keys %students) {
push #names, $_;
push #scores, $students{$_};
}
my $data = GD::Graph::Data->new([
[#names],
[#scores],
]) or die GD::Graph::Data->error;
my $my_graph = GD::Graph::bars->new();
$my_graph->set(
x_label => 'Name',
y_label => 'Score',
title => 'A Simple Bar Chart',
) or warn $my_graph->error;
$my_graph->plot($data) or die $my_graph->error();
save_chart($my_graph, 'graph');
sub save_chart {
my $chart = shift or die "Need a chart!";
my $name = shift or die "Need a name!";
local(*OUT);
my $ext = $chart->export_format;
open(OUT, ">$name.$ext") or
die "Cannot open $name.$ext for write: $!";
binmode OUT;
print OUT $chart->gd->$ext();
close OUT;
}
data.txt
Name:xyz
ID:1
Value: 1 2 3 4 5 6 7 8 9 ...
ID:2
Value: 9 8 7 6 5 4 3 2 1..
ID:3
Value: 90 89 88....
Name:abc
ID:11
value:...
Intial file.txt
## Header
..
data
data
data
..
Final expected file.txt
## Header xyz_1,xyz_2,xyz_3,abc_11,...
..
data 1 9 90
data 2 8 89
data 3 7 88
data 4 6
..
Current output file.txt
## Header xyz_1,xyz_2,xyz_3,abc_11,...
...
data, 1 2 3 4 5 6 7 8 9 ..,9 8 7 6 5 4 3 2 1 ..,90 89 88
data
data
...
Code
#!/usr/local/bin/perl
use diagnostics;
use strict;
use warnings;
use Tie::File;
my #name_id;
my %test;
#local $/ = '';
open my $fh, '<', 'data.txt' or die "failed: $!";
my %var;
while (<$fh>) {
chomp;
if (m/^([A-Z:]+):\s*(.*)/) {
$var{$1} = $2;
if (exists($var{Name}) && exists($var{ID}) && exists($var{value}) && $1 eq 'value') {
my $var_name = "$var{Name}_$var{ID}";
push #name_id, $var_name;
$test{$var_name} = $var{value};
}
}
}
# print join "\n\t", #test{#name_id};
my $match = "## Header";
tie my #lines, 'Tie::File', 'file.txt' or die "failed : $!";
for my $line (#lines) {
if ($line =~ /^($match.*)/) {
$line = $1 . "," . join ',', #name_id;
}
}
untie #lines;
my $match = "data";
tie my #lines, 'Tie::File', 'file.txt' or die "failed : $!";
my $i = 0;
for my $line (#lines) {
if ($line =~ /^($match.*)/) {
$line = $1 . "," . join(',', map { $test{$_}->[$i] } #name_id);
$i++;
}
}
untie #lines;
Have a problem with this line $line = $1 . "," . join (',', map { $test{$_}->[$i]} #name_id); it throws the error
Can't use string ("1 2 3 4 5 6 7 8 9 .."...) as an ARRAY ref while "strict refs" in use at test.pl line 46, line 80. at test.pl line 46
I think the hash(%test) value I had is a string and I can't split it as an array. Please let me know how to convert it to an array. I tried doing $test{$var_name} = [qw($var{value})]; it didnt work.
You may be interested in this refactoring of your code that seems to do what you want.
#!/usr/local/bin/perl
use strict;
use warnings;
use Tie::File;
open my $fh, '<', 'data.txt' or die "failed: $!";
my #name_id;
my %test;
my %var;
while (<$fh>) {
chomp;
if (my ($key, $val) = /^(\w+):\s*(.*)/) {
$var{$key} = $val;
if ($key eq 'value') {
my $var_name = "$var{Name}_$var{ID}";
push #name_id, $var_name;
$test{$var_name} = [ split ' ', $var{value} ];
}
}
}
tie my #lines, 'Tie::File', 'file.txt' or die "failed : $!";
my $count = 0;
for my $line (#lines) {
if ($line =~ /^## Header/) {
$line .= ' ' . join ',', #name_id;
}
elsif ($line =~ /^data/) {
$line .= ' ' . join ' ', map { $test{$_}[$count] // '' } #name_id;
$count++;
}
}
untie #lines;
output (file.txt)
## Header xyz_1,xyz_2 ,xyz_3
data 1 9 90
data 2 8 89
data 3 7 88
data 4 6
This is surely not right:
$test{$_}->[$i]
Because $test{$_} can only contain a string of some sort.
If you have a string and want to split into an arrayref so the above works, do this:
$test{$var_name} = [split /\s+/, $var{value}];
I have no idea what the code is supposed to accomplish which means that it may run, but I can't tell if it does what it is meant to. The odd variable names (like $test and $var_name didn't help me to understand the purpose).
I'm not too sure I followed your code, but I thought I'd post how to transpose the numbers (unless your code already does that :-) ).
#!/usr/bin/perl
use strict;
use warnings;
my (%data, $name);
while (<DATA>) {
if (/^Name:(.+)/) {
$name = $1
}
elsif (/^Value/) {
# transpose
my $r = 0;
push #{ $data{$name}[$r++] }, $_ for /\d+/g;
}
}
use Data::Dumper; print Dumper \%data;
__DATA__
Name:xyz
ID:1
Value: 1 2 3 4 5 6 7 8 9
ID:2
Value: 9 8 7 6 5 4 3 2 1
ID:3
Value: 90 89 88 87 86 85 84 83 82
Name:abc
ID:11
The dumped results are:
$VAR1 = {
'xyz' => [
[
'1',
'9',
'90'
],
[
'2',
'8',
'89'
],
[
'3',
'7',
'88'
],
[
'4',
'6',
'87'
],
[
'5',
'5',
'86'
],
[
'6',
'4',
'85'
],
[
'7',
'3',
'84'
],
[
'8',
'2',
'83'
],
[
'9',
'1',
'82'
]
]
};
I'm new to Perl. This is a sample csv entry similar to mine.
I would like to parse this, tried briefly Text::CSV, but no luck. The issue here is newline and
commas inside fields. How could I parse this file in Perl? Thanks for the help.
1,A,"Length of x, where x is y"
2,B,"Set A to “10”, an invalid state"
3,C,"Solve
A+B and
B+A
"
4,D, Set C to B
This code (taken directly from Text::CSV documentation):
#!/usr/bin/perl
use strict;
use Text::CSV;
use Data::Dumper;
my $rows;
my $csv = Text::CSV->new ( { binary => 1 } ) # should set binary attribute.
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<", "test.csv" or die "test.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
push #{$rows}, $row;
}
$csv->eof or $csv->error_diag();
close $fh;
# This gets rid of spaces at start and end of string
# as well as newlines within the fields.
for ( 0 .. scalar #{$rows}-1 ) {
$rows->[$_][2] =~ s/^\s*//;
$rows->[$_][2] =~ s/\n/ /gms;
}
print Dumper($rows);
Produces the following output:
$VAR1 = [
[
'1',
'A',
'Length of x, where x is y'
],
[
'2',
'B',
'Set A to “10”, an invalid state'
],
[
'3',
'C',
'Solve A+B and B+A '
],
[
'4',
'D',
'Set C to B'
]
];
Which (I'm guessing) is what you want to achieve.
Thanks Everyone who commented, I figured it out. The thing I didn't do was
{ binary => 1, eol => $/ }
Here is the working code:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
use Text::CSV;
open(my $Fh, "<", 'datalane_csr.csv');
my $csv = Text::CSV->new ({ binary => 1, eol => $/ });
while (my $row = $csv->getline ($Fh)) {
say #$row[2];
}
close(CSV);
Thanks once again. And sorry for the post.
But I have a small issue, the '"' is shown as wierd characters when I print them.
I am mostly a Matlab user and a Perl n00b. This is my first Perl script.
I have a large fixed width data file that I would like to process into a binary file with a table of contents. My issue is that the data files are pretty large and the data parameters are sorted by time. Which makes it difficult (at least for me) to parse into Matlab. So seeing how Matlab is not that good at parsing text I thought I would try Perl. I wrote the following code which works ... at least on my small test file. However it is painfully slow when I tried it on an actual large data file. It was pieced together which lots of examples for various tasks from the web / Perl documentation.
Here is a small sample of the data file. Note: Real file has about 2000 parameter and is 1-2GB. Parameters can be text, doubles, or unsigned integers.
Param 1 filter = ALL_VALUES
Param 2 filter = ALL_VALUES
Param 3 filter = ALL_VALUES
Time Name Ty Value
---------- ---------------------- --- ------------
1.1 Param 1 UI 5
2.23 Param 3 TXT Some Text 1
3.2 Param 1 UI 10
4.5 Param 2 D 2.1234
5.3 Param 1 UI 15
6.121 Param 2 D 3.1234
7.56 Param 3 TXT Some Text 2
The basic logic of my script is to:
Read until the ---- line to build list of parameters to extract (always has "filter =").
Use the --- line to determine field widths. It is broken by spaces.
For each parameter build time and data array (while nested inside of foreach param)
In continue block write time and data to binary file. Then record name, type, and offsets in text table of contents file (used to read the file later into Matlab).
Here is my script:
#!/usr/bin/perl
$lineArg1 = #ARGV[0];
open(INFILE, $lineArg1);
open BINOUT, '>:raw', $lineArg1.".bin";
open TOCOUT, '>', $lineArg1.".toc";
my $line;
my $data_start_pos;
my #param_name;
my #template;
while ($line = <INFILE>) {
chomp $line;
if ($line =~ s/\s+filter = ALL_VALUES//) {
$line = =~ s/^\s+//;
$line =~ s/\s+$//;
push #param_name, $line;
}
elsif ($line =~ /^------/) {
#template = map {'A'.length} $line =~ /(\S+\s*)/g;
$template[-1] = 'A*';
$data_start_pos = tell INFILE;
last; #Reached start of data exit loop
}
}
my $template = "#template";
my #lineData;
my #param_data;
my #param_time;
my $data_type;
foreach $current_param (#param_name) {
#param_time = ();
#param_data = ();
seek(INFILE,$data_start_pos,0); #Jump to data start
while ($line = <INFILE>) {
if($line =~ /$current_param/) {
chomp($line);
#lineData = unpack $template, $line;
push #param_time, #lineData[0];
push #param_data, #lineData[3];
}
} # END WHILE <INFILE>
} #END FOR EACH NAME
continue {
$data_type = #lineData[2];
print TOCOUT $current_param.",".$data_type.",".tell(BINOUT).","; #Write name,type,offset to start time
print BINOUT pack('d*', #param_time); #Write TimeStamps
print TOCOUT tell(BINOUT).","; #offset to end of time/data start
if ($data_type eq "TXT") {
print BINOUT pack 'A*', join("\n",#param_data);
}
elsif ($data_type eq "D") {
print BINOUT pack('d*', #param_data);
}
elsif ($data_type eq "UI") {
print BINOUT pack('L*', #param_data);
}
print TOCOUT tell(BINOUT).","."\n"; #Write memory loc to end data
}
close(INFILE);
close(BINOUT);
close(TOCOUT);
So my questions to you good people of the web are as follows:
What am I obviously screwing up? Syntax, declaring variables when I don't need to, etc.
This is probably slow (guessing) because of the nested loops and searching the line by line over and over again. Is there a better way to restructure the loops to extract multiple lines at once?
Any other speed improvement tips you can give?
Edit: I modified the example text file to illustrate non-integer time stamps and Param Names may contain spaces.
First, you should always have 'use strict;' and 'use warnings;' pragmas in your script.
It seems like you need a simple array (#param_name) for reference, so loading those values would be straight forward as you have it. (again, adding the above pragmas would start showing you errors, including the $line = =~ s/^\s+//; line!)
I suggest you read this, to understand how you can load your data file into a
Hash of Hashes. Once you've designed the hash, you simply read and load the file data contents, and then iterate through the contents of the hash.
For example, using time as the key for the hash
%HoH = (
1 => {
name => "Param1",
ty => "UI",
value => "5",
},
2 => {
name => "Param3",
ty => "TXT",
value => "Some Text 1",
},
3 => {
name => "Param1",
ty => "UI",
value => "10",
},
);
Make sure you close the INFILE after reading in the contents, before you start processing.
So in the end, you iterate over the hash, and reference the array (instead of the file contents) for your output writes - I would imagine it would be much faster to do this.
Let me know if you need more info.
Note: if you go this route, include Data:Dumper - a significant help to printing and understanding the data in your hash!
It seems to me that embedded spaces can only occur in the last field. That makes using split ' ' feasible for this problem.
I am assuming you are not interested in the header. In addition, I am assuming you want a vector for each parameter and are not interested in timestamps.
To use data file names specified on the command line or piped through standard input, replace <DATA> with <>.
#!/usr/bin/env perl
use strict; use warnings;
my %data;
$_ = <DATA> until /^-+/; # skip header
while (my $line = <DATA>) {
$line =~ s/\s+\z//;
last unless $line =~ /\S/;
my (undef, $param, undef, $value) = split ' ', $line, 4;
push #{ $data{ $param } }, $value;
}
use Data::Dumper;
print Dumper \%data;
__DATA__
Param1 filter = ALL_VALUES
Param2 filter = ALL_VALUES
Param3 filter = ALL_VALUES
Time Name Ty Value
---------- ---------------------- --- ------------
1 Param1 UI 5
2 Param3 TXT Some Text 1
3 Param1 UI 10
4 Param2 D 2.1234
5 Param1 UI 15
6 Param2 D 3.1234
7 Param3 TXT Some Text 2
Output:
$VAR1 = {
'Param2' => [
'2.1234',
'3.1234'
],
'Param1' => [
'5',
'10',
'15'
],
'Param3' => [
'Some Text 1',
'Some Text 2'
]
};
First off, this piece of code causes the input file to be read once for every param. Which is quite in-efficient.
foreach $current_param (#param_name) {
...
seek(INFILE,$data_start_pos,0); #Jump to data start
while ($line = <INFILE>) { ... }
...
}
Also there is very rarely a reason to use a continue block. This is more style / readability, then a real problem.
Now on to make it more performant.
I packed the sections individually, so that I could process a line exactly once. To prevent it from using up tons of RAM, I used File::Temp to store the data until I was ready for it. Then I used File::Copy to append those sections into the binary file.
This is a quick implementation. If I were to add much more to it, I would split it up more than it is now.
#!/usr/bin/perl
use strict;
use warnings;
use File::Temp 'tempfile';
use File::Copy 'copy';
use autodie qw':default copy';
use 5.10.1;
my $input_filename = shift #ARGV;
open my $input, '<', $input_filename;
my #param_names;
my $template = ''; # stop uninitialized warning
my #field_names;
my $field_name_line;
while( <$input> ){
chomp;
next if /^\s*$/;
if( my ($param) = /^\s*(.+?)\s+filter = ALL_VALUES\s*$/ ){
push #param_names, $param;
}elsif( /^[\s-]+$/ ){
my #fields = split /(\s+)/;
my $pos = 0;
for my $field (#fields){
my $length = length $field;
if( substr($field, 0, 1) eq '-' ){
$template .= "\#${pos}A$length ";
}
$pos += $length;
}
last;
}else{
$field_name_line = $_;
}
}
#field_names = unpack $template, $field_name_line;
for( #field_names ){
s(^\s+){};
$_ = lc $_;
$_ = 'type' if substr('type', 0, length $_) eq $_;
}
my %temp_files;
for my $param ( #param_names ){
for(qw'time data'){
my $fh = tempfile 'temp_XXXX', UNLINK => 1;
binmode $fh, ':raw';
$temp_files{$param}{$_} = $fh;
}
}
my %convert = (
TXT => sub{ pack 'A*', join "\n", #_ },
D => sub{ pack 'd*', #_ },
UI => sub{ pack 'L*', #_ },
);
sub print_time{
my($param,$time) = #_;
my $fh = $temp_files{$param}{time};
print {$fh} $convert{D}->($time);
}
sub print_data{
my($param,$format,$data) = #_;
my $fh = $temp_files{$param}{data};
print {$fh} $convert{$format}->($data);
}
my %data_type;
while( my $line = <$input> ){
next if $line =~ /^\s*$/;
my %fields;
#fields{#field_names} = unpack $template, $line;
print_time( #fields{(qw'name time')} );
print_data( #fields{(qw'name type value')} );
$data_type{$fields{name}} //= $fields{type};
}
close $input;
open my $bin, '>:raw', $input_filename.".bin";
open my $toc, '>', $input_filename.".toc";
for my $param( #param_names ){
my $data_fh = $temp_files{$param}{data};
my $time_fh = $temp_files{$param}{time};
seek $data_fh, 0, 0;
seek $time_fh, 0, 0;
my #toc_line = ( $param, $data_type{$param}, 0+sysseek($bin, 0, 1) );
copy( $time_fh, $bin, 8*1024 );
close $time_fh;
push #toc_line, sysseek($bin, 0, 1);
copy( $data_fh, $bin, 8*1024 );
close $data_fh;
push #toc_line, sysseek($bin, 0, 1);
say {$toc} join ',', #toc_line, '';
}
close $bin;
close $toc;
I modified my code to build a Hash as suggested. I have not incorporate the output to binary yet due to time limitations. Plus I need to figure out how to reference the hash to get the data out and pack it into binary. I don't think that part should be to difficult ... hopefully
On an actual data file (~350MB & 2.0 Million lines) the following code takes approximately 3 minutes to build the hash. CPU usage was 100% on 1 of my cores (nill on the other 3) and Perl memory usage topped out at around 325MB ... until it dumped millions of lines to the prompt. However the print Dump will be replaced with a binary pack.
Please let me know if I am making any rookie mistakes.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $lineArg1 = $ARGV[0];
open(INFILE, $lineArg1);
my $line;
my #param_names;
my #template;
while ($line = <INFILE>) {
chomp $line; #Remove New Line
if ($line =~ s/\s+filter = ALL_VALUES//) { #Find parameters and build a list
push #param_names, trim($line);
}
elsif ($line =~ /^----/) {
#template = map {'A'.length} $line =~ /(\S+\s*)/g; #Make template for unpack
$template[-1] = 'A*';
my $data_start_pos = tell INFILE;
last; #Reached start of data exit loop
}
}
my $size = $#param_names+1;
my #getType = ((1) x $size);
my $template = "#template";
my #lineData;
my %dataHash;
my $lineCount = 0;
while ($line = <INFILE>) {
if ($lineCount % 100000 == 0){
print "On Line: ".$lineCount."\n";
}
if ($line =~ /^\d/) {
chomp($line);
#lineData = unpack $template, $line;
my ($inHeader, $headerIndex) = findStr($lineData[1], #param_names);
if ($inHeader) {
push #{$dataHash{$lineData[1]}{time} }, $lineData[0];
push #{$dataHash{$lineData[1]}{data} }, $lineData[3];
if ($getType[$headerIndex]){ # Things that only need written once
$dataHash{$lineData[1]}{type} = $lineData[2];
$getType[$headerIndex] = 0;
}
}
}
$lineCount ++;
} # END WHILE <INFILE>
close(INFILE);
print Dumper \%dataHash;
#WRITE BINARY FILE and TOC FILE
my %convert = (TXT=>sub{pack 'A*', join "\n", #_}, D=>sub{pack 'd*', #_}, UI=>sub{pack 'L*', #_});
open my $binfile, '>:raw', $lineArg1.'.bin';
open my $tocfile, '>', $lineArg1.'.toc';
for my $param (#param_names){
my $data = $dataHash{$param};
my #toc_line = ($param, $data->{type}, tell $binfile );
print {$binfile} $convert{D}->(#{$data->{time}});
push #toc_line, tell $binfile;
print {$binfile} $convert{$data->{type}}->(#{$data->{data}});
push #toc_line, tell $binfile;
print {$tocfile} join(',',#toc_line,''),"\n";
}
sub trim { #Trim leading and trailing white space
my (#strings) = #_;
foreach my $string (#strings) {
$string =~ s/^\s+//;
$string =~ s/\s+$//;
chomp ($string);
}
return wantarray ? #strings : $strings[0];
} # END SUB
sub findStr { #Return TRUE if string is contained in array.
my $searchStr = shift;
my $i = 0;
foreach ( #_ ) {
if ($_ eq $searchStr){
return (1,$i);
}
$i ++;
}
return (0,-1);
} # END SUB
The output is as follows:
$VAR1 = {
'Param 1' => {
'time' => [
'1.1',
'3.2',
'5.3'
],
'type' => 'UI',
'data' => [
'5',
'10',
'15'
]
},
'Param 2' => {
'time' => [
'4.5',
'6.121'
],
'type' => 'D',
'data' => [
'2.1234',
'3.1234'
]
},
'Param 3' => {
'time' => [
'2.23',
'7.56'
],
'type' => 'TXT',
'data' => [
'Some Text 1',
'Some Text 2'
]
}
};
Here is the output TOC File:
Param 1,UI,0,24,36,
Param 2,D,36,52,68,
Param 3,TXT,68,84,107,
Thanks everyone for their help so far! This is an excellent resource!
EDIT: Added Binary & TOC file writing code.