Whats the best way to summarize data from a file that has around 2 million records in Perl?
For eg: A file like this,
ABC|XYZ|DEF|EGH|100
ABC|XYZ|DEF|FGH|200
SDF|GHT|WWW|RTY|1000
SDF|GHT|WWW|TYU|2000
Needs to be summarized on the first 3 columns like this,
ABC|XYZ|DEF|300
SDF|GHT|WWW|3000
Chris
Assuming there are always five columns, the fifth of which is numeric, and you always want the first three columns to be the key...
use warnings;
use strict;
my %totals_hash;
while (<>)
{
chomp;
my #cols = split /\|/;
my $key = join '|', #cols[0..2];
$totals_hash{$key} += $cols[4];
}
foreach (sort keys %totals_hash)
{
print $_, '|', $totals_hash{$_}, "\n";
}
You can use a hash as:
my %hash;
while (<DATA>) {
chomp;
my #tmp = split/\|/; # split each line on |
my $value = pop #tmp; # last ele is the value
pop #tmp; # pop unwanted entry
my $key = join '|',#tmp; # join the remaining ele to form key
$hash{$key} += $value; # add value for this key
}
# print hash key-values.
for(sort keys %hash) {
print $_ . '|'.$hash{$_}."\n";
}
Ideone link
Presuming your input file has its records in separate lines.
perl -n -e 'chomp;#a=split/\|/;$h{join"|",splice#a,0,3}+=pop#a;END{print map{"$_: $h{$_}\n"}keys%h}' < inputfile
1-2-3-4 I declare A CODE-GOLF WAR!!! (Okay, a reasonably readable code-golf dust-up.)
my %sums;
m/([^|]+\|[^|]+\|[^|]+).*?\|(\d+)/ and $sums{ $1 } += $2 while <>;
print join( "\n", ( map { "$_|$sums{$_}" } sort keys %sums ), '' );
Sort to put all records with the same first 3 triplets next to each other. Iterate through and kick out a subtotal when a different set of triplets appears.
$prevKey="";
$subtotal=0;
open(INPUTFILE, "<$inFile");
#lines=<INPUTFILE>;
close (INPUTFILE);
open(OUTFILE, ">$outFile");
#sorted=sort(#lines);
foreach $line(#lines){
#parts=split(/\|/g, $line);
$value=pop(#parts);
$value-=0; #coerce string to a number
$key=$parts[0]."|".$parts[1]."|".$parts[2];
if($key ne $prevKey){
print OUTFILE "$prevKey|$subtotal\n";
$prevKey=$key;
$subtotal=0;
}
$subtotal+=$value;
}
close(OUTFILE);
If sorting 2 million chokes your box then you may have to put each record into a file based on the group and then do the subtotal for each file.
Related
If I have a colon-delimited file name FILE and I do:
cat FILE|perl -F: -lane 'my %hash = (); $hash{#F[0]} = #F[2]'
to assign the first and 3rd tokens as the key => value pairs for the hash..
1) Is that a sane way to assign key value pairs to a hash?
2) What is the simplest way to now find all keys with shared values and list them?
Assume FILE looks like:
Mike:34:Apple:Male
Don:23:Corn:Male
Jared:12:Apple:Male
Beth:56:Maize:Female
Sam:34:Apple:Male
David:34:Apple:Male
Desired Output: Keys with value "Apple": Mike,Jared,David,Sam
Your example won't work as you want because the -n option puts a while loop around your one-line program, so the hash you declare is created and destoyed for every record in the file. You could get around that by not declaring the hash, and so making it a persistent package variable which will retain all values stored in it.
You can then write push #{ $hash{$F[2]} }, $F[0] but notice that it should be $F[0] etc. and not #F[0], and I have used push to create a list of column 1 values for each column 3 value instead of just a list of one-to-one values relating each column 1 value with its column 3 value.
To clarify, your method produces a hash looking like this, which has to be searched to produce the display that you want.
(
Beth => "Maize",
David => "Apple",
Don => "Corn",
Jared => "Apple",
Mike => "Apple",
Sam => "Apple",
)
while mine creates this, which as you can see is pretty much already in the form you want.
(
Apple => ["Mike", "Jared", "Sam", "David"],
Corn => ["Don"],
Maize => ["Beth"],
)
But I think this problem is a bit too big to be solved with a one-line Perl program. The solution below expects the path to the input file as a command-line parameter, like this
> perl prog.pl colons.csv
but it will default to myfile.csv if no file is specified.
use strict;
use warnings;
our #ARGV = 'myfile.csv' unless #ARGV;
my %data;
while (<>) {
my #fields = split /:/;
push #{ $data{$fields[2]} }, $fields[0];
}
while (my ($k, $v) = each %data) {
next unless #$v > 1;
printf qq{Keys with value "%s": %s\n}, $k, join ', ', #$v;
}
output
Keys with value "Apple": Mike, Jared, Sam, David
use strict;
use warnings;
open my $in, '<', 'in.txt';
my %data;
while(<$in>){
chomp;
my #split = split/:/;
$data{$split[0]} = $split[2];
}
my $query = 'Apple';
print "Keys with value $query = ";
foreach my $name (keys %data){
print "$name " if $data{$name} eq $query;
}
print "\n";
Arrays are used to hold list of values, so use an array.
perl -F: -lane'
push #{ $h{$F[2]} }, $F[0];
END {
for my $fruit (keys %h) {
next if #{ $h{$fruit} } < 2;
print "$fruit: ", join(",", #{ $h{$fruit} });
}
}
' FILE
The END block is executed on exit. In it, we iterate over the keys of the hash. If the value of the current hash element is an array with only one element, it's skipped. Otherwise, we prints the key followed by contents of the array referenced by the hash element.
Here is another way:
perl -F: -lane'
push #{ $h{$F[2]} }, $F[0];
}{
print "$_: ", join(",", #{ $h{$_} }) for grep { #{$h{$_}} > 1 } keys %h;
' file
We read each line and create hash of arrays using third column as key and first column as list of values for matching key. In the END block we iterate over our hash using grep and filter keys whose array count greater than 1 and print the key followed by array elements.
It doesn't have to be a one liner,
Good. It's not going to be...
Is that a sane way to assign key value pairs to a hash?
You're simply assigning the key value pairs as:
$hash{"key"} = "value";
Which is about as simple as it gets. There might be a way of doing it via map. However, the main issue I see is what should happen if you have duplicate keys.
Let's say your file looks like this:
Mike:34:Apple:Male
Don:23:Corn:Male
Jared:12:Apple:Male
Beth:56:Maize:Female
Sam:34:Apple:Male
David:34:Apple:Male # Note this entry is here twice!
David:35:Wheat:Male # Note this entry is here twice!
Let's do a simple assignment loop:
my %hash;
while my $line ( <$fh> ) {
chomp $line;
my ($name, $age, $category, $sex) = split /:/, $line;
$hash{$name} = $category;
}
When you get to $hash{David}, it will first be set to Apple, but then you change the value to Wheat. There are four ways you can handle this:
Use whatever the last value is. No change in the loop.
Use the first value and ignore subsequent values. Simple enough to do.
If that happens, it's an error. Abort the program and report the error.
Keep all values.
This last one is the most interesting because it involves a reference to an array as the values for your hash:
my %hash;
while my $line ( <$fh> ) {
chomp $line;
my ($name, $age, $category, $sex) = split /:/, $line;
$hash{$name} = [] if not exists $hash{$name}; # I'm making this an array reference
push #{ $hash{$name} }, $category;
}
Now, each value in my hash is a reference to an array:
my #values = #{ $hash{David} ); # The values of David...
print "David is in categories " . join ( ", ", #values ) . "\n";
This will print out David is in categories Wheat, Apple
What is the simplest way to now find all keys with shared values and list them?
The easiest way is to create a second hash that's keyed by your value. In this hash, you will need to use an array reference. Let's assume no duplicate names for now:
my %hash;
my %indexed_hash;
while my $line ( <$fh> ) {
chomp $line;
my ($name, $age, $category, $sex) = split /:/, $line;
$hash{$name} = $category;
my $indexed_hash{$category} = [] if not exist $indexed_hash{$category};
push #{ $indexed_hash{$category} }, $name;
}
Now, if I want to find all the duplicates of Apple:
my #names = #{ $indexed_hash{Apple} };
print "The following are in 'Apple': " . join ( ", " #names ) . "\n";
Since we're getting into references, we could take things a step further and store all of your values of your file in your hash. Again, for simplicity, I am assuming that you will have one and only one entry per name:
my %hash;
while my $line ( <$fh> ) {
chomp $line;
my ($name, $age, $category, $sex) = split /:/, $line;
$hash{$name}->{AGE} = $age;
$hash{$name}->{CATEGORY} = $category;
$hash{$name}->{SEX} = $sex;
}
for my $name ( sort keys %hash ) {
print "$name Information:\n";
print " Age: " . $hash{$name}->{AGE} . "\n";
printf "Category: %s\n", $hash{$name}->{CATEGORY};
print " Sex: #{[$hash{$name}->{SEX}]}\n\n";
}
That last two statements are easier ways of interpolating complex data structures into a string. The printf is fairly clear. The second #{[...]} is a neat little trick.
What have you tried?
If you reverse the hash into a list of value => key pairs then use List::Util's pairs() against the list, you can transform the hash into a hash of values => key arrayrefs. i.e. ( foo => [ 'bar', 'baz' ] ), grep {#{$hash{$_}} > 1} keys %hash, and print the results.
I have .txt file with 3 columns. I want to compare the first with the second column and if values from the first column appear in the second column, I want to delete that entry in the second and third column (first column should not be modified). The result should be stored in a new file.
Example input:
Col 1 Col 2 Col 3
VIBHAR_02293_1 VIBHAR_00819_2 tatatattattata
VIBHAR_00819_2 VIBHAR_00819_4 tattavgaggagag
VIBHAR_00705_3 VIBHAR_00705_7 attaggaccaggat
VIBHAR_00819_4 VIBHAR_02153_9 ccagggattattat
Example output:
VIBHAR_02293_1 VIBHAR_00705_7 attaggaccaggat
VIBHAR_00819_2 VIBHAR_02153_9 ccagggattattat
VIBHAR_00705_3
VIBHAR_00819_4
I tried using following code but it did not work:
while($line=(<File>))
{
chomp($line);
#F=split('\t',$line);
$hash{$F[1]}=$F[2];
if ($F[0] eq $F[1])
{
# print "$line\n";
delete($hash{keys});
}
}
If the format of the columns which I posted above is not good then, only my question is enough I guess.
#!/usr/bin/perl
use warnings;
use strict;
my %H;
while (<>) {
chomp;
my #F = split /\t/;
$H{$F[0]} = [$., $F[1], $F[2]];
}
my #col1;
my #col23;
for my $col1 (sort { $H{$a}[0] <=> $H{$b}[0] } keys %H) {
push #col1, $col1;
next if exists $H{ $H{$col1}[1] };
push #col23, [#{ $H{$col1} }[1,2]];
}
for my $i (0 .. $#col1) {
print $col1[$i];
print "\t", join "\t", #{ $col23[$i] } if $i < #col23;
print "\n";
}
Do you really want to "move up" the values in columns 2 and 3?
# my code as follows
use strict;
use FileHandle;
my #LISTS = ('incoming');
my $WORK ="c:\";
my $OUT ="c:\";
foreach my $list (#LISTS) {
my $INFILE = $WORK."test.dat";
my $OUTFILE = $OUT."TEST.dat";
while (<$input>) {
chomp;
my($f1,$f2,$f3,$f4,$f5,$f6,$f7) = split(/\|/);
push #sum, $f4,$f7;
}
}
while (#sum) {
my ($key,$value)= {shift#sum, shift#sum};
$hash{$key}=0;
$hash{$key} += $value;
}
while my $key (#sum) {
print $output2 sprintf("$key1\n");
# print $output2 sprintf("$key ===> $hash{$key}\n");
}
close($input);
close($output);
I am getting errors Unintialized error at addition (+) If I use 2nd print
I get HASH(0x19a69451) values if I use 1st Print.
I request you please correct me.
My output should be
unique Id ===> Total Revenue ($f4==>$f7)
This is wrong:
"c:\";
Perl reads that as a string starting with c:";\n.... Or in other words, it is a run away string. You need to write the last character as \\ to escape the \ and prevent it from escaping the subsequent " character
You probably want to use parens instead of braces:
my ($key, $value) = (shift #sum, shift #sum);
You would get that Unintialized error at addition (+) warning if the #sum array has an odd number of elements.
See also perltidy.
You should not enter the second while loop :
while my $key (#sum) {
because the previous one left the array #sum empty.
You could change to:
while (<$input>) {
chomp;
my #tmp = split(/\|/);
$hash{$tmp[3]} += $tmp[6];
}
print Dumper \%hash;
I have a csv file with following sample data.
o-option(alphabetical)
v-value(numerical)
number1,o1,v1,o2,v2,o3,v3,o4,v4,o5,v5,o6,v6
number2,o1,v11,o2,v22,o3,v33,o44,v44,o5,v55,o6,v66
and so on....
Required output.
NUM,o1,o2,o3,o4,o44,o5,o6
number1,v1,v2,v3,v4,,v5,v6
number2,v11,v22,v33,,v44,v55,v66
and so on...
In this data, all the options are same i.e. o1,o2,etc through out the file but option 4 value is changing, i.e. o4,o44, etc. In total there are about 9 different option values at o4 field. Can anyone please help me with the perl code to get the required output.
I have written the below code but still not getting the required output.
my #values;
my #options;
my %hash;
while (<STDIN>) {
chomp;
my ($srn,$o1,$v1,$o2,$v2,$o3,$v3,$o4,$v4,$o5,$v5,$o6,$v6) = split /[,\n]/, $_;
push #values, [$srn,$v1,$v2,$v3,$v4,$v5,$v6];
push #options, $o1,$o2,$o3,$o4,$o5,$o6;
}
#printing the header values
my #out = grep(!$hash{$_}++,#options);
print 'ID,', join(',', sort #out), "\n";
#printing the values.
for my $i ( 0 .. $#values) {
print #{$values[$i]}, "\n";
}
Output:
ID,o1,o2,o3,o4,o44,o5,o6
number1,v1,v2,v3,v4,v5,v6
number2,v1,v2,v3,v44,v5,v6
As from the above output, when the value 44 comes, it comes under option4 and hence the other values are shifting to left. The values are not mapping with the options. Please suggest.
If you want to line the numeric values up in columns based on the value of the preceding options values, store your data rows as hashes, using the options as the keys to the hash.
use strict;
use warnings;
my (#data, %all_opts);
while (<DATA>) {
chomp;
my %h = ('NUM', split /,/, $_);
push #data, \%h;
#all_opts{keys %h} = 1;
}
my #header = sort keys %all_opts;
print join(",", #header), "\n";
for my $d (#data){
my #vals = map { defined $d->{$_} ? $d->{$_} : '' } #header;
print join(",", #vals), "\n";
}
__DATA__
number1,o1,v1,o2,v2,o3,v3,o4,v4,o5,v5,o6,v6
number2,o1,v11,o2,v22,o3,v33,o44,v44,o5,v55,o6,v66
Is this what you're after?
use strict;
use warnings;
use 5.010;
my %header;
my #store;
while (<DATA>) {
chomp;
my ($srn, %f) = split /,/;
#header{ keys %f } = 1;
push #store, [ $srn, { %f } ];
}
# header
my #cols = sort keys %header;
say join q{,} => 'NUM', #cols;
# rows
for my $row (#store) {
say join q{,} => $row->[0],
map { $row->[1]->{ $_ } || q{} } #cols;
}
__DATA__
number1,o1,v1,o2,v2,o3,v3,o4,v4,o5,v5,o6,v6
number2,o1,v11,o2,v22,o3,v33,o44,v44,o5,v55,o6,v66
Which outputs:
NUM,o1,o2,o3,o4,o44,o5,o6
number1,v1,v2,v3,v4,,v5,v6
number2,v11,v22,v33,,v44,v55,v66
Make one pass through the file identifying all the different option values, build an array of those values.
Make second pass through the file:
for each record
initialise an associative array from your list of option value
parse the assigning values for the options you have
use your list of option values to iterate the associative array printing the values
You might look at the CPAN module DBD::AnyData. One of the neater modules out there. It allows you to manipulate a CSV file like it was a database. And much more.
I have a pipe delimited text file containing, among other things, a date and a number indicating the lines sequence elsewhere in the program. What I'm hoping to do is from that file create a hash using the year as the key and the value being the maximum sequence for that year (I essentially need to implement an auto-incremented key per year) e.g from
2000|1
2003|9
2000|5
2000|21
2003|4
I would end with a hash like:
%hash = {
2000 => 21,
2003 => 9
}
I've managed to split the file into the year and sequence parts (not very well I think) like so:
my #dates = map {
my #temp = split /\|/;
join "|", (split /\//, $temp[1])[-1], $temp[4] || 0; #0 because some records
#mightn't have a sequence
} #info
Is there something succint I could do to create a hash using that data?
Thanks
If I understand you, you were almost there. All you needed to do was return the key and value from map and sort by sequence instead of joining them.
my %hash =
map #$_,
sort { $a->[1] <=> $b->[1] }
map {
my #temp = split /\|/;
my $date = (split /\//, $temp[1])[-1];
my $seq = $temp[4] || 0; #0 because some records mightn't have a sequence
[ $date, $seq ]
} #info;
But just iterating through with for and setting hash only if the current sequence
is higher than the previous maximum for that date is probably a better idea.
Be careful with those {}; where you said
%hash = {
2000 => 21,
2003 => 9
}
you meant () instead (or to be assigning to a reference $hash), since the {} there create an anonymous hash and return a reference to it.
Here's how you could write that .. not too sure why you want/need to use map (please explain)
#!/usr/bin/perl -w
use strict;
use warnings;
my %hash;
while(<DATA>) {
chomp();
my ($year,$sequence)=split('\|');
$sequence = 0 unless (defined ($sequence));
next if (exists $hash{$year} and $sequence < $hash{$year});
$hash{$year}=$sequence;
}
__DATA__
2000|1
2003|9
2000|5
2000|21
2003|4
I added the $sequence = 0 unless defined ($sequence); because of that comment in your snippet. I believe I might understand your intent there.. (either the input format is valid/consistent, or it is not ..)
map operates on each item in a list and builds a list of results to pass on. So, you can't really do the sort of checks you want (keep the maximum sequence value) as you go unless you build a scratch hash that winds up containing exactly the data you are trying to build as the return value of the `map.
my %results = map {
my( $y, $s ) = split '[|]', $_;
seq_is_gt_year_seq( $y, $s )
? ( $y, $s )
: ();
} #year_pipe_seq;
To implement seq_is_gt_year_seq() we wind up having to build a temporary hash that stores each year and its max sequence value for lookup.
You should use an approach that builds the lookup incrementally, like a for or while loop.
map { BLOCK } LIST always usually (unless BLOCK sometimes evaluates to an empty list) returns a list that is least as large as LIST, and may not be the way to go if you do want to simply overwrite duplicate keys with the latest data. Something like:
my %hash;
for (#info) {
my #temp = split /\|/;
my $key = (split /\//, $temp[1]);
my $value = $temp[4] || 0;
$hash{$key} = $value unless defined $hash{$key} && $hash{$key}>=$value;
}
will work. The last line conditionally updates the hash table, which is something you can't do (or at least can't do very conveniently) inside a map statement.
If there's any chance you can perform this processing as the file is read, then I'd do it. Something like this:
my %year_count;
while (my $line = <$fh>){
chomp $line;
my ($year, $num) = split /\|/, $line;
if ($num > $year_count{$year} || !defined $year_count{$year})
$year_count{$year} = $num;
}
}
if you want to use an array, map isn't really the best choice (since you're not transforming the list, you're processing it down to something different). To be honest the most sensible array-processing would probably be the same as the above, but in a foreach instead:
my %year_count;
foreach my $line (#info){
my ($year, $num) = split /\|/, $line;
if ($num > $year_count{$year} || !defined $year_count{$year})
$year_count{$year} = $num;
}
}