Best way to compare multi-dimension hash in perl - perl

I want to write a script that can compare two multi-dimension hash to see whether they match or not. This two hash has same value of key, the comparison will done in a pair mean starting with $j=0 first compare $line[0] $line[1] then follow by $line[2] $line[3]
$hash{"key"}{$key}{"$row $col"}= "$line[$j] $line[$j+1]";
$hash1{"key"}{$key1}{"$row1 $col1"}="$line1[$j] $line1[$j+1]";
my $line and $line1 is take from a huge file and will process the huge file line by line. For example:
#line = 1 2 3 4 5 6 7 8 #fst line from file1
#line1 = 1 2 3 3 4 5 6 7 7 #fst line from file2
when $key and $key1 match the scripts will continue to compare the $row $col then lastly compare $line[$j] $line[$j+1]. I am trying to use one of the module Test::More that mention at previous similar question but it can only compare the 1st not matching and the output is in default format.
not ok 1 - data structures should be the same
# Failed test 'data structures should be the same'
# at hash_check1.pl line 80.
# Structures begin differing at:
# $got->{key}{100}{2 3} = '62 19'
# $expected->{key}{100}{2 3} = '12 24'
# Tests were run but no plan was declared and done_testing() was not seen.
Any best way to compare this type of multi-dimension hash?
8/22 Edited
If Test::More module is able to compare all the $key match for two hash i can take that as well but if i can output the matching and non matching in my own prefer format will be great. For example i would like to output like
$key at $row $col no match with value $line[$j] $line[$j+1] ( expected value $line1[$j] $line1[$j+1] )
8/22 Edited
Below is part of my code
use strict;
use warnings;
use Test::More;
open ( FILE1 , '<', "file.txt" ) or die $!;
open ( FILE2 , '<' , "file1.txt" ) or die $!;
chomp (my #file1 = <FILE1>);
chomp (my #file2 = <FILE2>);
my %hash=();
my %hash1=();
for ( $i =0 ; $i<=511 ; $i++ ) {
my #line = split(" ",$file1[$i]);
my #line1 = split(" ",$file2[$i]);
my $key = ($i+1)*10;
my $key1 = ($i+2)*10;
for ( $j=0; $j<=15 ; $j+=2){
my $col = hex($j);
my $col1 = hex($j+1);
$hash{"key"}{$key}{"$row $col1"}= "$line[$j] $line[$j+1]";
$hash1{"key"}{$key1}{"$row1 $col1"}= "$line1[$j] $line1[$j+1]";
}
}
##############comparison part start here###################
is_deeply(\%hash, \%hash1, 'data structures should be the same'); #can only print one mismatch
################Any better way?###########

Here is an example that will compare nested hashes with string values:
#! /usr/bin/env perl
use feature qw(say);
use warnings;
use strict;
my %hash;
my %hash1;
$hash{"key"}{A}{"2 3"}= "1 2";
$hash1{"key"}{A}{"2 3"}="2 3";
$hash{"key"}{B}{"2 3"}= "1 2";
$hash1{"key"}{C}{"2 3"}="2 3";
compare(\%hash, \%hash1);
sub compare {
return _compare( #_, "" );
}
sub get_key_str { return $_[0] . '{' . $_[1] . '}' }
sub _process_key {
my ( $h1, $h2, $info, $key, $str1, $str2 ) = #_;
if ( exists $h2->{$key} ) {
my $val1 = $h1->{$key};
my $val2 = $h2->{$key};
if ( ref $val1 eq "HASH" and ref $val2 eq "HASH" ) {
_compare( $val1, $val2, get_key_str( $info, $key ) );
}
else {
die "Expected string value" if ref $val1 or ref $val2;
if ( $val1 ne $val2 ) {
say "Value '$val1' in $str1 hash for key " . get_key_str( $info, $key )
. " does not match value '$val2' in $str2 hash";
}
}
}
else {
my $cur_key = get_key_str( $info, $key );
say "Got key $cur_key in $str1 hash, but missing in $str2 hash";
}
}
sub _compare {
my ( $h1, $h2, $info ) = #_;
my %processed_keys;
for (keys %$h1) {
_process_key( $h1, $h2, $info, $_, "first", "second" );
$processed_keys{$_}++;
}
for (keys %$h2) {
next if exists $processed_keys{$_};
_process_key( $h2, $h1, $info, $_, "second", "first" );
}
}
Output:
Got key {key}{B} in first hash, but missing in second hash
Value '1 2' in first hash for key {key}{A}{2 3} does not match value '2 3' in second hash
Got key {key}{C} in second hash, but missing in first hash

Related

perl match part of string in two files

I'm using a perl script to look for matches between columns in two tab-delimited files. However for one column I only want to look for a partial match between two strings in two columns.
It concerns $row[4] of $table2 and $row{d} of $table1.
The values in $row[4] of $table2 look like this:
'xxxx'.
The values in $row{d} of $table1 look like this:
'xxxx.aaa'.
If the part before the '.' is the same, there is a match. If not, there is no match. I'm not sure how to implement this in my script. This is what I have so far. I only looks for complete matches between different columns. '...' denotes code that is not important for this question
#! /usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
local $Data::Dumper::Useqq = 1;
use Getopt::Long qw(GetOptions);
...
...
chomp( my #header_table2 = split /\t/, <$table2> );
my %lookup;
while(<$table2>){
chomp;
my #row = split(/\t/);
$lookup{ $row[0] }{ $row[1] }{ $row[4] }{ $row[5] }{ $row[6] }{ $row[7] }{ $row[8] } = [ $row[9], $row[10] ];
}
my #header = do {
my $header = <$table1>;
$header =~ s/\t?\n\z//;
split /\t/, $header;
};
print $table3 join ("\t", #header, qw/ name1 name2 /), "\n";
{
no warnings 'uninitialized';
while(<$table1>){
s/\t?\n\z//;
my %row;
#row{#header} = split /\t/;
print $table3 join ( "\t", #row{#header},
#{ $lookup{ $row{a} }{ $row{b} }{ $row{c} }{ $row{d} }{ $row{e} }{ $row{f} }{ $row{g} }
// [ "", "" ] }), "\n";
}
}
This is looking like a job for a database
The solution below isn't going to work, because you are building your %lookup hash with nine levels of keys ($row[0] .. $row[8]) , and accessing it with only seven levels ($row{a} .. $row{g}), so you're going to have to edit in the real situation
I see no reason to next your hashes so deeply. A single key formed by using join on the relevant fields will work fine and probably a little faster. I also see no reason to extract table2 fields into an array and table1 fields into a hash. An array seems fine in both cases
I've solved your immediate problem by copying each #row from table1 into array #key, and removing the last dot and anything following from the fourth element before building the $key string
In view of your history of adding a spare tab character before the newline at the end of each record, I've also added four die statements that verify the size of the header row and columns rows before continuing. You will probably need to tweak those values according to your real data
use strict;
use warnings 'all';
use Data::Dumper;
local $Data::Dumper::Useqq = 1;
use Getopt::Long qw(GetOptions);
use constant TABLE1_COLUMNS => 9;
use constant TABLE2_COLUMNS => 11;
open my $table2, '<', 'table2.txt' or die $!;
my #header_table2 = do {
my $header = <$table2>;
$header =~ s/\t?\n\z//;
split /\t/, $header;
};
die "Incorrect table 2 header count " . scalar #header_table2
unless #header_table2 == TABLE2_COLUMNS;
my %lookup;
while ( <$table2> ) {
chomp;
my #row = split /\t/;
die "Incorrect table 2 column count " . scalar #row
unless #row == TABLE2_COLUMNS;
my $key = do {
local $" = "\n";
"#row[0..8]";
};
$lookup{ $key } = [ #row[9,10] ];
}
open my $table1, '<', 'table1.txt' or die $!;
my #header = do {
my $header = <$table1>;
$header =~ s/\t?\n\z//;
split /\t/, $header;
};
die "Incorrect table 1 header count " . scalar #header
unless #header == TABLE1_COLUMNS;
open my $table3, '>', 'table3.txt' or die $!;
print $table3 join ("\t", #header, qw/ name1 name2 /), "\n";
while ( <$table1> ) {
s/\t?\n\z//;
my #row = split /\t/;
die "Incorrect table 1 column count " . scalar #row
unless #row == TABLE1_COLUMNS;
my $key = do {
my #key = #row;
$key[3] =~ s/\.[^.]*\z//;
local $" = "\n";
"#key";
};
my $lookup = $lookup{ $key } // [ "", "" ];
print $table3 join("\t", #row, #$lookup), "\n";
}
You're going to have a scoping problem because your array #row and your hash %row both exist in completely different scopes.
But if you have variables (say, $foo and $bar) and you want to know if $foo starts with the contents of $bar followed by a dot, then you can do that using a regular expression check like this:
if ($foo =~ /^$bar\./) {
# match
} else {
# no match
}

Perl: Compare Two CSV Files and Print out matches (modifying this code)

I am very new at perl and had discovered the solution at:
Perl: Compare Two CSV Files and Print out differences
I have gone through dozens of other solutions and this comes closest, except that instead of finding the differences between 2 CSV files, I want to find where the second CSV file matches the first one in column and row. How could I modify the following script to find the matches in column/row instead of the differences. I am hoping to dissect this code and learn arrays from there, but wanted to find out the solution to this application. Much thanks.
use strict;
my #arr1;
my #arr2;
my $a;
open(FIL,"a.txt") or die("$!");
while (<FIL>)
{chomp; $a=$_; $a =~ s/[\t;, ]*//g; push #arr1, $a if ($a ne '');};
close(FIL);
open(FIL,"b.txt") or die("$!");
while (<FIL>)
{chomp; $a=$_; $a =~ s/[\t;, ]*//g; push #arr2, $a if ($a ne '');};
close(FIL);
my %arr1hash;
my %arr2hash;
my #diffarr;
foreach(#arr1) {$arr1hash{$_} = 1; }
foreach(#arr2) {$arr2hash{$_} = 1; }
foreach $a(#arr1)
{
if (not defined($arr2hash{$a}))
{
push #diffarr, $a;
}
}
foreach $a(#arr2)
{
if (not defined($arr1hash{$a}))
{
push #diffarr, $a;
}
}
print "Diff:\n";
foreach $a(#diffarr)
{
print "$a\n";
}
# You can print to a file instead, by: print FIL "$a\n";
ok, I realize that this was more what I was looking for:
use strict;
use warnings;
use feature qw(say);
use autodie;
use constant {
FILE_1 => "file1.txt",
FILE_2 => "file2.txt",
};
#
# Load Hash #1 with value from File #1
#
my %hash1;
open my $file1_fh, "<", FILE_1;
while ( my $value = <$file1_fh> ) {
chomp $value;
$hash1{$value} = 1;
}
close $file1_fh;
#
# Load Hash #2 with value from File #2
#
my %hash2;
open my $file2_fh, "<", FILE_2;
while ( my $value = <$file2_fh> ) {
chomp $value;
$hash2{$value} = 1;
}
close $file2_fh;
Now I want to search file2's hash to check if there are ANY matches from file1's hash. That is where I am stuck
With new code suggestion, code now looks like this
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;
use constant {
FILE_1 => "masterlist.csv",
FILE_2 => "pastebin.csv",
};
#
# Load Hash #1 with value from File #1
#
my %hash1;
open my $file1_fh, "<", FILE_1;
while ( my $value = <$file1_fh> ) {
chomp $value;
$hash1{$value} = 1;
}
close $file1_fh;
my %hash2;
open my $file2_fh, "<", FILE_2;
while ( my $value = <$file2_fh> ) {
chomp $value;
if ( $hash1{$value} ) {
print "Match found $value\n";
$hash2{$value}++;
}
}
close $file2_fh;
print "Matches found:\n";
foreach my $key ( keys %hash2 ) {
print "$key found $hash2{$key} times\n";
}
I updated one part with split() and it seems to work, but have to test more to confirm if it fits the solution I'm looking for or I have more work to do one it
#
# Load Hash #1 with value from File #1
#
my %hash1;
open my $file1_fh, "<", FILE_1;
while ( my $value = <$file1_fh> ) {
chomp $value;
$hash1{$value} = ( %hash1, (split(/,/, $_))[1,2] );
}
close $file1_fh;
So, with your code there - you've read in 'file1' to a hash.
Why not instead of reading file 2 into a hash, do instead:
my %hash2;
open my $file2_fh, "<", FILE_2;
while ( my $value = <$file2_fh> ) {
chomp $value;
if ( $hash1{$value} ) {
print "Match found $value\n";
$hash2{$value}++;
}
}
close $file2_fh;
print "Matches found:\n";
foreach my $key ( keys %hash2 ) {
print "$key found $hash2{$key} times\n";
}
I think this code identifies every place that a data field in file A matches a data field in file B (at least it does on my limited test data):
use strict;
use warnings;
my #arr1;
my #arr2;
# a.txt -> #arr1
my $file_a_name = "poster_a.txt";
open(FIL,$file_a_name) or die("$!");
my $a_line_counter = 0;
while (my $a_line = <FIL>)
{
$a_line_counter = $a_line_counter + 1;
chomp($a_line);
my #fields = (split /,/,$a_line);
my $num_fields = scalar(#fields);
s{^\s+|\s+$}{}g foreach #fields;
push #arr1, \#fields if ( $num_fields ne 0);
};;
close(FIL);
my $file_b_name = "poster_b.txt";
open(FIL,$file_b_name) or die("$!");
while (my $b_line = <FIL>)
{
chomp($b_line);
my #fields = (split /,/,$b_line);
my $num_fields = scalar(#fields);
s{^\s+|\s+$}{}g foreach #fields;
push #arr2, \#fields if ( $num_fields ne 0)
};
close(FIL);
# b.txt -> #arr2
#print "\n",#arr2, "\n";
my #match_array;
my $file_a_line_ctr = 1;
foreach my $file_a_line_fields (#arr1)
{
my $file_a_column_ctr = 1;
foreach my $file_a_line_field (#{$file_a_line_fields})
{
my $file_b_line_ctr = 1;
foreach my $file_b_line_fields(#arr2)
{
my $file_b_column_ctr = 1;
foreach my $file_b_field (#{$file_b_line_fields})
{
if ( $file_b_field eq $file_a_line_field )
{
my $match_info =
"$file_a_name line $file_a_line_ctr column $file_a_column_ctr" .
" (${file_a_line_field}) matches: " .
"$file_b_name line $file_b_line_ctr column $file_b_column_ctr ";
push(#match_array, $match_info);
print "$match_info \n";
}
$file_b_column_ctr = $file_b_column_ctr + 1;
}
$file_b_line_ctr = $file_b_line_ctr + 1;
}
$file_a_column_ctr = $file_a_column_ctr + 1;
}
$file_a_line_ctr = $file_a_line_ctr + 1;
}
print "there were ", scalar(#match_array)," matches\n";

rearrange data from one column to a row

I have the below data and I need to make the second column as the header. Any help is appreciated.
Data:
IBM,Voltality,7,73894756.93897434897
IBM,Market,100,983874.34324
GOOG,Sanity,15,8932748
GOOG,Rate Jump,25,873476378.234234
MBLY,Market,340,23423423432.6783
Output:
PRODUCT|Market|Rate Jump|Sanity|Voltality
IBM|100,983874.34324|||7,73894756.93897434897
GOOG||25,873476378.234234|15,8932748|||
MBLY|340,23423423432.6783|||
Code (incomplete / not sure hot to get to the end):
#!/usr/bin/perl
use strict;
use Getopt::Long;
use warnings;
use Data::Dumper;
my $valsep = ',';
my ( %type, %keys, %ccy, %cnt, %avg );
while (<>) {
chomp;
my ( $product, $reason, $count, $lat ) = split /,/;
my $key = "$product,$reason";
if ( not exists( $type{$reason} ) ) {
$type{$reason} = $reason;
}
$ccy{$key} = $product;
$cnt{$key} = $count;
$avg{$key} = $lat;
}
close(INPUT);
print Dumper ( \%ccy );
print Dumper ( \%type );
my ( %pair, %details );
foreach my $rows ( sort keys %ccy ) {
print "the key is : $rows and $ccy{$rows}\n";
foreach my $res ( sort keys %type ) {
print "The type is : $res and $type{$res}\n";
}
}
You just need to keep track of your columns and row data when parsing the data structure.
The following demonstrates:
#!/usr/bin/perl
use strict;
use warnings;
my $fh = \*DATA;
my %columns;
my %rows;
while (<$fh>) {
chomp;
my ( $company, $col, $vals ) = split ',', $_, 3;
# Track Columns for later labeling
$columns{$col}++;
$rows{$company}{$col} = $vals;
}
my #columns = sort keys %columns;
# Header
print join( '|', 'PRODUCT', #columns ), "\n";
for my $company ( sort keys %rows ) {
print join( '|', $company, map { $_ // '' } #{ $rows{$company} }{#columns} ), "\n";
}
__DATA__
IBM,Voltality,7,73894756.93897434897
IBM,Market,100,983874.34324
GOOG,Sanity,15,8932748
GOOG,Rate Jump,25,873476378.234234
MBLY,Market,340,23423423432.6783
Outputs:
PRODUCT|Market|Rate Jump|Sanity|Voltality
GOOG||25,873476378.234234|15,8932748|
IBM|100,983874.34324|||7,73894756.93897434897
MBLY|340,23423423432.6783|||
The following code will do the job; rather than using several hashes, I've put all the data in a hash of hashes. I've put comments in the script to explain what is happening in case you are not sure. You can, of course, delete them in your script.
#!/usr/bin/perl
use warnings;
use strict;
my %market;
while (<DATA>) {
next unless /\w/;
# remove line endings
chomp;
# split line by commas -- only split into three parts
my #col = split ",", $_, 3;
# save the data as $market{col0}{col1} = col2
$market{$col[0]}{$col[1]} = $col[2];
}
# create an output file
my $outfile = 'output.txt';
open( my $fh, ">", $outfile ) or die "Could not open $outfile: $!";
my #headers = ('Market','Rate Jump','Sanity','Volatility');
# print out the header line, joined by |
print { $fh } join('|', 'PRODUCT', #headers) . "\n";
# for each product in the market data
for my $p (sort keys %market) {
# print the product name
print { $fh } join('|', $p,
# go through the headers using map (map acts like a "for" loop)
# if the relevant property exists in the market data, print it;
# if not, print nothing
map { $market{$p}{$_} // '' } #headers) . "\n";
}
# this is the input data. You might be reading yours in from a file
__DATA__
IBM,Voltality,7,73894756.93897434897
IBM,Market,100,983874.34324
GOOG,Sanity,15,8932748
GOOG,Rate Jump,25,873476378.234234
MBLY,Market,340,23423423432.6783
Output:
PRODUCT|Market|Rate Jump|Sanity|Volatility
GOOG||25,873476378.234234|15,8932748|
IBM|100,983874.34324|||7,73894756.93897434897
MBLY|340,23423423432.6783|||

Perl compare individual elements of two arrays

I have two files with two columns each:
FILE1
A B
1 #
2 #
3 !
4 %
5 %
FILE 2
A B
3 #
4 !
2 &
1 %
5 ^
The Perl script must compare column A in both both files, and only if they are equal, column B of FIlE 2 must be printed
So far I have the following code but all I get is an infinite loop with # from column B
use strict;
use warnings;
use 5.010;
print "enter site:"."\n";
chomp(my $s = <>);
print "enter protein:"."\n";
chomp(my $p = <>);
open( FILE, "< $s" ) or die;
open( OUT, "> PSP.txt" ) or die;
open( FILE2, "< $p" ) or die;
my #firstcol;
my #secondcol;
my #thirdcol;
while ( <FILE> )
{
next if $. <2;
chomp;
my #cols = split;
push #firstcol, $cols[0];
push #secondcol, $cols[1]."\t"."\t".$cols[3]."\t"."\t"."\t"."N\/A"."\n";
}
my #firstcol2;
my #secondcol2;
my #thirdcol2;
while ( <FILE2> )
{
next if $. <2;
my #cols2 = split(/\t/, $_);
push #firstcol2, $cols2[0];
push #secondcol2, $cols2[4]."\n";
}
my $size = #firstcol;
my $size2 = #firstcol2;
for (my $i = 0; $i <= #firstcol ; $i++) {
for (my $j = 0; $j <= #firstcol2; $j++) {
if ( $firstcol[$i] eq $firstcol2[$j] )
{
print $secondcol2[$i];
}
}
}
my (#first, #second);
while(<first>){
chomp;
my $foo = split / /, $_;
push #first , $foo;
}
while(<second>){
chomp;
my $bar = split / / , $_;
push #second, $bar;
}
my %first = #first;
my %second = #second;
Build a hash of the first file as %first and second file as %second with first column as key and second column as value.
for(keys %first)
{
print $second{$_} if exists $second{$_}
}
I couldn't check it as I am on mobile. hope that gives you an idea.
I assume that column A is ordered and that you actually want to compare the first entry in File 1 to the first entry in File 2, and so on.
If that's true, you have nested loop that you don't need. Simplify your last while as such:
for my $i (0..$#firstcol) {
if ( $firstcol[$i] eq $firstcol2[$i] )
{
print $secondcol2[$i];
}
}
Also, if you're at all concerned about the files being of different length, then you can adjust the loop:
use List::Util qw(min);
for my $i (0..min($#firstcol, $#firstcol2)) {
Additional Note: You aren't chomping your data in the second file loop while ( <FILE2> ). That might introduce a bug later.
If your files are called file1.txt and file2.txt the next:
use Modern::Perl;
use Path::Class;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } file("file$_.txt")->slurp for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
say $line2->[1] if ($line1->[0] eq $line2->[0]);
}
prints:
B
^
equals in column1 only the lines A and 5
without the CPAN modules - produces the same result
use strict;
use warnings;
my $files;
#{$files->{$_}} = map { [split /\s+/] } grep { !/^\s*$/ } do { local(#ARGV)="file$_.txt";<> } for (1..2);
for my $line1 (#{$files->{1}}) {
my $line2 = shift #{$files->{2}};
print $line2->[1],"\n" if ($line1->[0] eq $line2->[0]);
}

Parsing unsorted data from large fixed width text

I am mostly a Matlab user and a Perl n00b. This is my first Perl script.
I have a large fixed width data file that I would like to process into a binary file with a table of contents. My issue is that the data files are pretty large and the data parameters are sorted by time. Which makes it difficult (at least for me) to parse into Matlab. So seeing how Matlab is not that good at parsing text I thought I would try Perl. I wrote the following code which works ... at least on my small test file. However it is painfully slow when I tried it on an actual large data file. It was pieced together which lots of examples for various tasks from the web / Perl documentation.
Here is a small sample of the data file. Note: Real file has about 2000 parameter and is 1-2GB. Parameters can be text, doubles, or unsigned integers.
Param 1 filter = ALL_VALUES
Param 2 filter = ALL_VALUES
Param 3 filter = ALL_VALUES
Time Name Ty Value
---------- ---------------------- --- ------------
1.1 Param 1 UI 5
2.23 Param 3 TXT Some Text 1
3.2 Param 1 UI 10
4.5 Param 2 D 2.1234
5.3 Param 1 UI 15
6.121 Param 2 D 3.1234
7.56 Param 3 TXT Some Text 2
The basic logic of my script is to:
Read until the ---- line to build list of parameters to extract (always has "filter =").
Use the --- line to determine field widths. It is broken by spaces.
For each parameter build time and data array (while nested inside of foreach param)
In continue block write time and data to binary file. Then record name, type, and offsets in text table of contents file (used to read the file later into Matlab).
Here is my script:
#!/usr/bin/perl
$lineArg1 = #ARGV[0];
open(INFILE, $lineArg1);
open BINOUT, '>:raw', $lineArg1.".bin";
open TOCOUT, '>', $lineArg1.".toc";
my $line;
my $data_start_pos;
my #param_name;
my #template;
while ($line = <INFILE>) {
chomp $line;
if ($line =~ s/\s+filter = ALL_VALUES//) {
$line = =~ s/^\s+//;
$line =~ s/\s+$//;
push #param_name, $line;
}
elsif ($line =~ /^------/) {
#template = map {'A'.length} $line =~ /(\S+\s*)/g;
$template[-1] = 'A*';
$data_start_pos = tell INFILE;
last; #Reached start of data exit loop
}
}
my $template = "#template";
my #lineData;
my #param_data;
my #param_time;
my $data_type;
foreach $current_param (#param_name) {
#param_time = ();
#param_data = ();
seek(INFILE,$data_start_pos,0); #Jump to data start
while ($line = <INFILE>) {
if($line =~ /$current_param/) {
chomp($line);
#lineData = unpack $template, $line;
push #param_time, #lineData[0];
push #param_data, #lineData[3];
}
} # END WHILE <INFILE>
} #END FOR EACH NAME
continue {
$data_type = #lineData[2];
print TOCOUT $current_param.",".$data_type.",".tell(BINOUT).","; #Write name,type,offset to start time
print BINOUT pack('d*', #param_time); #Write TimeStamps
print TOCOUT tell(BINOUT).","; #offset to end of time/data start
if ($data_type eq "TXT") {
print BINOUT pack 'A*', join("\n",#param_data);
}
elsif ($data_type eq "D") {
print BINOUT pack('d*', #param_data);
}
elsif ($data_type eq "UI") {
print BINOUT pack('L*', #param_data);
}
print TOCOUT tell(BINOUT).","."\n"; #Write memory loc to end data
}
close(INFILE);
close(BINOUT);
close(TOCOUT);
So my questions to you good people of the web are as follows:
What am I obviously screwing up? Syntax, declaring variables when I don't need to, etc.
This is probably slow (guessing) because of the nested loops and searching the line by line over and over again. Is there a better way to restructure the loops to extract multiple lines at once?
Any other speed improvement tips you can give?
Edit: I modified the example text file to illustrate non-integer time stamps and Param Names may contain spaces.
First, you should always have 'use strict;' and 'use warnings;' pragmas in your script.
It seems like you need a simple array (#param_name) for reference, so loading those values would be straight forward as you have it. (again, adding the above pragmas would start showing you errors, including the $line = =~ s/^\s+//; line!)
I suggest you read this, to understand how you can load your data file into a
Hash of Hashes. Once you've designed the hash, you simply read and load the file data contents, and then iterate through the contents of the hash.
For example, using time as the key for the hash
%HoH = (
1 => {
name => "Param1",
ty => "UI",
value => "5",
},
2 => {
name => "Param3",
ty => "TXT",
value => "Some Text 1",
},
3 => {
name => "Param1",
ty => "UI",
value => "10",
},
);
Make sure you close the INFILE after reading in the contents, before you start processing.
So in the end, you iterate over the hash, and reference the array (instead of the file contents) for your output writes - I would imagine it would be much faster to do this.
Let me know if you need more info.
Note: if you go this route, include Data:Dumper - a significant help to printing and understanding the data in your hash!
It seems to me that embedded spaces can only occur in the last field. That makes using split ' ' feasible for this problem.
I am assuming you are not interested in the header. In addition, I am assuming you want a vector for each parameter and are not interested in timestamps.
To use data file names specified on the command line or piped through standard input, replace <DATA> with <>.
#!/usr/bin/env perl
use strict; use warnings;
my %data;
$_ = <DATA> until /^-+/; # skip header
while (my $line = <DATA>) {
$line =~ s/\s+\z//;
last unless $line =~ /\S/;
my (undef, $param, undef, $value) = split ' ', $line, 4;
push #{ $data{ $param } }, $value;
}
use Data::Dumper;
print Dumper \%data;
__DATA__
Param1 filter = ALL_VALUES
Param2 filter = ALL_VALUES
Param3 filter = ALL_VALUES
Time Name Ty Value
---------- ---------------------- --- ------------
1 Param1 UI 5
2 Param3 TXT Some Text 1
3 Param1 UI 10
4 Param2 D 2.1234
5 Param1 UI 15
6 Param2 D 3.1234
7 Param3 TXT Some Text 2
Output:
$VAR1 = {
'Param2' => [
'2.1234',
'3.1234'
],
'Param1' => [
'5',
'10',
'15'
],
'Param3' => [
'Some Text 1',
'Some Text 2'
]
};
First off, this piece of code causes the input file to be read once for every param. Which is quite in-efficient.
foreach $current_param (#param_name) {
...
seek(INFILE,$data_start_pos,0); #Jump to data start
while ($line = <INFILE>) { ... }
...
}
Also there is very rarely a reason to use a continue block. This is more style / readability, then a real problem.
Now on to make it more performant.
I packed the sections individually, so that I could process a line exactly once. To prevent it from using up tons of RAM, I used File::Temp to store the data until I was ready for it. Then I used File::Copy to append those sections into the binary file.
This is a quick implementation. If I were to add much more to it, I would split it up more than it is now.
#!/usr/bin/perl
use strict;
use warnings;
use File::Temp 'tempfile';
use File::Copy 'copy';
use autodie qw':default copy';
use 5.10.1;
my $input_filename = shift #ARGV;
open my $input, '<', $input_filename;
my #param_names;
my $template = ''; # stop uninitialized warning
my #field_names;
my $field_name_line;
while( <$input> ){
chomp;
next if /^\s*$/;
if( my ($param) = /^\s*(.+?)\s+filter = ALL_VALUES\s*$/ ){
push #param_names, $param;
}elsif( /^[\s-]+$/ ){
my #fields = split /(\s+)/;
my $pos = 0;
for my $field (#fields){
my $length = length $field;
if( substr($field, 0, 1) eq '-' ){
$template .= "\#${pos}A$length ";
}
$pos += $length;
}
last;
}else{
$field_name_line = $_;
}
}
#field_names = unpack $template, $field_name_line;
for( #field_names ){
s(^\s+){};
$_ = lc $_;
$_ = 'type' if substr('type', 0, length $_) eq $_;
}
my %temp_files;
for my $param ( #param_names ){
for(qw'time data'){
my $fh = tempfile 'temp_XXXX', UNLINK => 1;
binmode $fh, ':raw';
$temp_files{$param}{$_} = $fh;
}
}
my %convert = (
TXT => sub{ pack 'A*', join "\n", #_ },
D => sub{ pack 'd*', #_ },
UI => sub{ pack 'L*', #_ },
);
sub print_time{
my($param,$time) = #_;
my $fh = $temp_files{$param}{time};
print {$fh} $convert{D}->($time);
}
sub print_data{
my($param,$format,$data) = #_;
my $fh = $temp_files{$param}{data};
print {$fh} $convert{$format}->($data);
}
my %data_type;
while( my $line = <$input> ){
next if $line =~ /^\s*$/;
my %fields;
#fields{#field_names} = unpack $template, $line;
print_time( #fields{(qw'name time')} );
print_data( #fields{(qw'name type value')} );
$data_type{$fields{name}} //= $fields{type};
}
close $input;
open my $bin, '>:raw', $input_filename.".bin";
open my $toc, '>', $input_filename.".toc";
for my $param( #param_names ){
my $data_fh = $temp_files{$param}{data};
my $time_fh = $temp_files{$param}{time};
seek $data_fh, 0, 0;
seek $time_fh, 0, 0;
my #toc_line = ( $param, $data_type{$param}, 0+sysseek($bin, 0, 1) );
copy( $time_fh, $bin, 8*1024 );
close $time_fh;
push #toc_line, sysseek($bin, 0, 1);
copy( $data_fh, $bin, 8*1024 );
close $data_fh;
push #toc_line, sysseek($bin, 0, 1);
say {$toc} join ',', #toc_line, '';
}
close $bin;
close $toc;
I modified my code to build a Hash as suggested. I have not incorporate the output to binary yet due to time limitations. Plus I need to figure out how to reference the hash to get the data out and pack it into binary. I don't think that part should be to difficult ... hopefully
On an actual data file (~350MB & 2.0 Million lines) the following code takes approximately 3 minutes to build the hash. CPU usage was 100% on 1 of my cores (nill on the other 3) and Perl memory usage topped out at around 325MB ... until it dumped millions of lines to the prompt. However the print Dump will be replaced with a binary pack.
Please let me know if I am making any rookie mistakes.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $lineArg1 = $ARGV[0];
open(INFILE, $lineArg1);
my $line;
my #param_names;
my #template;
while ($line = <INFILE>) {
chomp $line; #Remove New Line
if ($line =~ s/\s+filter = ALL_VALUES//) { #Find parameters and build a list
push #param_names, trim($line);
}
elsif ($line =~ /^----/) {
#template = map {'A'.length} $line =~ /(\S+\s*)/g; #Make template for unpack
$template[-1] = 'A*';
my $data_start_pos = tell INFILE;
last; #Reached start of data exit loop
}
}
my $size = $#param_names+1;
my #getType = ((1) x $size);
my $template = "#template";
my #lineData;
my %dataHash;
my $lineCount = 0;
while ($line = <INFILE>) {
if ($lineCount % 100000 == 0){
print "On Line: ".$lineCount."\n";
}
if ($line =~ /^\d/) {
chomp($line);
#lineData = unpack $template, $line;
my ($inHeader, $headerIndex) = findStr($lineData[1], #param_names);
if ($inHeader) {
push #{$dataHash{$lineData[1]}{time} }, $lineData[0];
push #{$dataHash{$lineData[1]}{data} }, $lineData[3];
if ($getType[$headerIndex]){ # Things that only need written once
$dataHash{$lineData[1]}{type} = $lineData[2];
$getType[$headerIndex] = 0;
}
}
}
$lineCount ++;
} # END WHILE <INFILE>
close(INFILE);
print Dumper \%dataHash;
#WRITE BINARY FILE and TOC FILE
my %convert = (TXT=>sub{pack 'A*', join "\n", #_}, D=>sub{pack 'd*', #_}, UI=>sub{pack 'L*', #_});
open my $binfile, '>:raw', $lineArg1.'.bin';
open my $tocfile, '>', $lineArg1.'.toc';
for my $param (#param_names){
my $data = $dataHash{$param};
my #toc_line = ($param, $data->{type}, tell $binfile );
print {$binfile} $convert{D}->(#{$data->{time}});
push #toc_line, tell $binfile;
print {$binfile} $convert{$data->{type}}->(#{$data->{data}});
push #toc_line, tell $binfile;
print {$tocfile} join(',',#toc_line,''),"\n";
}
sub trim { #Trim leading and trailing white space
my (#strings) = #_;
foreach my $string (#strings) {
$string =~ s/^\s+//;
$string =~ s/\s+$//;
chomp ($string);
}
return wantarray ? #strings : $strings[0];
} # END SUB
sub findStr { #Return TRUE if string is contained in array.
my $searchStr = shift;
my $i = 0;
foreach ( #_ ) {
if ($_ eq $searchStr){
return (1,$i);
}
$i ++;
}
return (0,-1);
} # END SUB
The output is as follows:
$VAR1 = {
'Param 1' => {
'time' => [
'1.1',
'3.2',
'5.3'
],
'type' => 'UI',
'data' => [
'5',
'10',
'15'
]
},
'Param 2' => {
'time' => [
'4.5',
'6.121'
],
'type' => 'D',
'data' => [
'2.1234',
'3.1234'
]
},
'Param 3' => {
'time' => [
'2.23',
'7.56'
],
'type' => 'TXT',
'data' => [
'Some Text 1',
'Some Text 2'
]
}
};
Here is the output TOC File:
Param 1,UI,0,24,36,
Param 2,D,36,52,68,
Param 3,TXT,68,84,107,
Thanks everyone for their help so far! This is an excellent resource!
EDIT: Added Binary & TOC file writing code.