Selectively counting delimited field values and creating a hash using map - perl

I have a pipe delimited text file containing, among other things, a date and a number indicating the lines sequence elsewhere in the program. What I'm hoping to do is from that file create a hash using the year as the key and the value being the maximum sequence for that year (I essentially need to implement an auto-incremented key per year) e.g from
2000|1
2003|9
2000|5
2000|21
2003|4
I would end with a hash like:
%hash = {
2000 => 21,
2003 => 9
}
I've managed to split the file into the year and sequence parts (not very well I think) like so:
my #dates = map {
my #temp = split /\|/;
join "|", (split /\//, $temp[1])[-1], $temp[4] || 0; #0 because some records
#mightn't have a sequence
} #info
Is there something succint I could do to create a hash using that data?
Thanks

If I understand you, you were almost there. All you needed to do was return the key and value from map and sort by sequence instead of joining them.
my %hash =
map #$_,
sort { $a->[1] <=> $b->[1] }
map {
my #temp = split /\|/;
my $date = (split /\//, $temp[1])[-1];
my $seq = $temp[4] || 0; #0 because some records mightn't have a sequence
[ $date, $seq ]
} #info;
But just iterating through with for and setting hash only if the current sequence
is higher than the previous maximum for that date is probably a better idea.
Be careful with those {}; where you said
%hash = {
2000 => 21,
2003 => 9
}
you meant () instead (or to be assigning to a reference $hash), since the {} there create an anonymous hash and return a reference to it.

Here's how you could write that .. not too sure why you want/need to use map (please explain)
#!/usr/bin/perl -w
use strict;
use warnings;
my %hash;
while(<DATA>) {
chomp();
my ($year,$sequence)=split('\|');
$sequence = 0 unless (defined ($sequence));
next if (exists $hash{$year} and $sequence < $hash{$year});
$hash{$year}=$sequence;
}
__DATA__
2000|1
2003|9
2000|5
2000|21
2003|4
I added the $sequence = 0 unless defined ($sequence); because of that comment in your snippet. I believe I might understand your intent there.. (either the input format is valid/consistent, or it is not ..)

map operates on each item in a list and builds a list of results to pass on. So, you can't really do the sort of checks you want (keep the maximum sequence value) as you go unless you build a scratch hash that winds up containing exactly the data you are trying to build as the return value of the `map.
my %results = map {
my( $y, $s ) = split '[|]', $_;
seq_is_gt_year_seq( $y, $s )
? ( $y, $s )
: ();
} #year_pipe_seq;
To implement seq_is_gt_year_seq() we wind up having to build a temporary hash that stores each year and its max sequence value for lookup.
You should use an approach that builds the lookup incrementally, like a for or while loop.

map { BLOCK } LIST always usually (unless BLOCK sometimes evaluates to an empty list) returns a list that is least as large as LIST, and may not be the way to go if you do want to simply overwrite duplicate keys with the latest data. Something like:
my %hash;
for (#info) {
my #temp = split /\|/;
my $key = (split /\//, $temp[1]);
my $value = $temp[4] || 0;
$hash{$key} = $value unless defined $hash{$key} && $hash{$key}>=$value;
}
will work. The last line conditionally updates the hash table, which is something you can't do (or at least can't do very conveniently) inside a map statement.

If there's any chance you can perform this processing as the file is read, then I'd do it. Something like this:
my %year_count;
while (my $line = <$fh>){
chomp $line;
my ($year, $num) = split /\|/, $line;
if ($num > $year_count{$year} || !defined $year_count{$year})
$year_count{$year} = $num;
}
}
if you want to use an array, map isn't really the best choice (since you're not transforming the list, you're processing it down to something different). To be honest the most sensible array-processing would probably be the same as the above, but in a foreach instead:
my %year_count;
foreach my $line (#info){
my ($year, $num) = split /\|/, $line;
if ($num > $year_count{$year} || !defined $year_count{$year})
$year_count{$year} = $num;
}
}

Related

Parse report in blocks to CSV

I have lots of data dumps in a pretty huge amount of data structured as follow
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet
Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all
Which I would like to transform to something like:
Key1,Key2,Key3,Key5
Value,Other value,Maybe another value yet,
Different value,,Invaluable,Has no value at all
I mean:
Generate a collection of all the keys
Generate a header line with all the Keys
Map all the values to their correct "columns" (notice that in this example I have no "Key4", and Key3/Key5 interchanged)
Possibly in Perl, since it would be easier to use in various environments.
But I am not sure if this format is unusual, or if there is a tool that already does this.
This is fairly easy using hashes and the Text::CSV_XS module:
use strict;
use warnings;
use Text::CSV_XS;
my #rows;
my %headers;
{
local $/ = "";
while (<DATA>) {
chomp;
my %record;
for my $line (split(/\n/)) {
next unless $line =~ /^([^:]+):\.+\s(.+)/;
$record{$1} = $2;
$headers{$1} = $1;
}
push(#rows, \%record);
}
}
unshift(#rows, \%headers);
my $csv = Text::CSV_XS->new({binary => 1, auto_diag => 1, eol => $/});
$csv->column_names(sort(keys(%headers)));
for my $row_ref (#rows) {
$csv->print_hr(*STDOUT, $row_ref);
}
__DATA__
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet
Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all
Output:
Key1,Key2,Key3,Key5
Value,"Other value","Maybe another value yet",
"Different value",,Invaluable,"Has no value at all"
If your CSV format is 'complicated' - e.g. it contains commas, etc. - then use one of the Text::CSV modules. But if it isn't - and this is often the case - I tend to just work with split and join.
What's useful in your scenario, is that you can map key-values within a record quite easily using a regex. Then use a hash slice to output:
#!/usr/bin/env perl
use strict;
use warnings;
#set paragraph mode - records are blank line separated.
local $/ = "";
my #rows;
my %seen_header;
#read STDIN or files on command line, just like sed/grep
while ( <> ) {
#multi - line pattern, that matches all the key-value pairs,
#and then inserts them into a hash.
my %this_row = m/^(\w+):\.+ (.*)$/gm;
push ( #rows, \%this_row );
#add the keys we've seen to a hash, so we 'know' what we've seen.
$seen_header{$_}++ for keys %this_row;
}
#extract the keys, make them unique and ordered.
#could set this by hand if you prefer.
my #header = sort keys %seen_header;
#print the header row
print join ",", #header, "\n";
#iterate the rows
foreach my $row ( #rows ) {
#use a hash slice to select the values matching #header.
#the map is so any undefined values (missing keys) don't report errors, they
#just return blank fields.
print join ",", map { $_ // '' } #{$row}{#header},"\n";
}
This for you sample input, produces:
Key1,Key2,Key3,Key5,
Value,Other value,Maybe another value yet,,
Different value,,Invaluable,Has no value at all,
If you want to be really clever, then most of that initial building of the loop can be done with:
my #rows = map { { m/^(\w+):\.+ (.*)$/gm } } <>;
The problem then is - you would need to build up the 'headers' array still, and that means a bit more complicated:
$seen_header{$_}++ for map { keys %$_ } #rows;
It works, but I don't think it's as clear about what's happening.
However the core of your problem may be the file size - that's where you have a bit of a problem, because you need to read the file twice - first time to figure out which headings exist throughout the file, and then second time to iterate and print:
#!/usr/bin/env perl
use strict;
use warnings;
open ( my $input, '<', 'your_file.txt') or die $!;
local $/ = "";
my %seen_header;
while ( <$input> ) {
$seen_header{$_}++ for m/^(\w+):/gm;
}
my #header = sort keys %seen_header;
#return to the start of file:
seek ( $input, 0, 0 );
while ( <$input> ) {
my %this_row = m/^(\w+):\.+ (.*)$/gm;
print join ",", map { $_ // '' } #{$this_row}{#header},"\n";
}
This will be slightly slower, as it'll have to read the file twice. But it won't use nearly as much memory footprint, because it isn't holding the whole file in memory.
Unless you know all your keys in advance, and you can just define them, you'll have to read the file twice.
This seems to work with the data you've given
use strict;
use warnings 'all';
my %data;
while ( <> ) {
next unless /^(\w+):\W*(.*\S)/;
push #{ $data{$1} }, $2;
}
use Data::Dump;
dd \%data;
output
{
Key1 => ["Value", "Different value"],
Key2 => ["Other value"],
Key3 => ["Maybe another value yet", "Invaluable"],
Key5 => ["Has no value at all"],
}

Perl list all keys in hash with identical values

If I have a colon-delimited file name FILE and I do:
cat FILE|perl -F: -lane 'my %hash = (); $hash{#F[0]} = #F[2]'
to assign the first and 3rd tokens as the key => value pairs for the hash..
1) Is that a sane way to assign key value pairs to a hash?
2) What is the simplest way to now find all keys with shared values and list them?
Assume FILE looks like:
Mike:34:Apple:Male
Don:23:Corn:Male
Jared:12:Apple:Male
Beth:56:Maize:Female
Sam:34:Apple:Male
David:34:Apple:Male
Desired Output: Keys with value "Apple": Mike,Jared,David,Sam
Your example won't work as you want because the -n option puts a while loop around your one-line program, so the hash you declare is created and destoyed for every record in the file. You could get around that by not declaring the hash, and so making it a persistent package variable which will retain all values stored in it.
You can then write push #{ $hash{$F[2]} }, $F[0] but notice that it should be $F[0] etc. and not #F[0], and I have used push to create a list of column 1 values for each column 3 value instead of just a list of one-to-one values relating each column 1 value with its column 3 value.
To clarify, your method produces a hash looking like this, which has to be searched to produce the display that you want.
(
Beth => "Maize",
David => "Apple",
Don => "Corn",
Jared => "Apple",
Mike => "Apple",
Sam => "Apple",
)
while mine creates this, which as you can see is pretty much already in the form you want.
(
Apple => ["Mike", "Jared", "Sam", "David"],
Corn => ["Don"],
Maize => ["Beth"],
)
But I think this problem is a bit too big to be solved with a one-line Perl program. The solution below expects the path to the input file as a command-line parameter, like this
> perl prog.pl colons.csv
but it will default to myfile.csv if no file is specified.
use strict;
use warnings;
our #ARGV = 'myfile.csv' unless #ARGV;
my %data;
while (<>) {
my #fields = split /:/;
push #{ $data{$fields[2]} }, $fields[0];
}
while (my ($k, $v) = each %data) {
next unless #$v > 1;
printf qq{Keys with value "%s": %s\n}, $k, join ', ', #$v;
}
output
Keys with value "Apple": Mike, Jared, Sam, David
use strict;
use warnings;
open my $in, '<', 'in.txt';
my %data;
while(<$in>){
chomp;
my #split = split/:/;
$data{$split[0]} = $split[2];
}
my $query = 'Apple';
print "Keys with value $query = ";
foreach my $name (keys %data){
print "$name " if $data{$name} eq $query;
}
print "\n";
Arrays are used to hold list of values, so use an array.
perl -F: -lane'
push #{ $h{$F[2]} }, $F[0];
END {
for my $fruit (keys %h) {
next if #{ $h{$fruit} } < 2;
print "$fruit: ", join(",", #{ $h{$fruit} });
}
}
' FILE
The END block is executed on exit. In it, we iterate over the keys of the hash. If the value of the current hash element is an array with only one element, it's skipped. Otherwise, we prints the key followed by contents of the array referenced by the hash element.
Here is another way:
perl -F: -lane'
push #{ $h{$F[2]} }, $F[0];
}{
print "$_: ", join(",", #{ $h{$_} }) for grep { #{$h{$_}} > 1 } keys %h;
' file
We read each line and create hash of arrays using third column as key and first column as list of values for matching key. In the END block we iterate over our hash using grep and filter keys whose array count greater than 1 and print the key followed by array elements.
It doesn't have to be a one liner,
Good. It's not going to be...
Is that a sane way to assign key value pairs to a hash?
You're simply assigning the key value pairs as:
$hash{"key"} = "value";
Which is about as simple as it gets. There might be a way of doing it via map. However, the main issue I see is what should happen if you have duplicate keys.
Let's say your file looks like this:
Mike:34:Apple:Male
Don:23:Corn:Male
Jared:12:Apple:Male
Beth:56:Maize:Female
Sam:34:Apple:Male
David:34:Apple:Male # Note this entry is here twice!
David:35:Wheat:Male # Note this entry is here twice!
Let's do a simple assignment loop:
my %hash;
while my $line ( <$fh> ) {
chomp $line;
my ($name, $age, $category, $sex) = split /:/, $line;
$hash{$name} = $category;
}
When you get to $hash{David}, it will first be set to Apple, but then you change the value to Wheat. There are four ways you can handle this:
Use whatever the last value is. No change in the loop.
Use the first value and ignore subsequent values. Simple enough to do.
If that happens, it's an error. Abort the program and report the error.
Keep all values.
This last one is the most interesting because it involves a reference to an array as the values for your hash:
my %hash;
while my $line ( <$fh> ) {
chomp $line;
my ($name, $age, $category, $sex) = split /:/, $line;
$hash{$name} = [] if not exists $hash{$name}; # I'm making this an array reference
push #{ $hash{$name} }, $category;
}
Now, each value in my hash is a reference to an array:
my #values = #{ $hash{David} ); # The values of David...
print "David is in categories " . join ( ", ", #values ) . "\n";
This will print out David is in categories Wheat, Apple
What is the simplest way to now find all keys with shared values and list them?
The easiest way is to create a second hash that's keyed by your value. In this hash, you will need to use an array reference. Let's assume no duplicate names for now:
my %hash;
my %indexed_hash;
while my $line ( <$fh> ) {
chomp $line;
my ($name, $age, $category, $sex) = split /:/, $line;
$hash{$name} = $category;
my $indexed_hash{$category} = [] if not exist $indexed_hash{$category};
push #{ $indexed_hash{$category} }, $name;
}
Now, if I want to find all the duplicates of Apple:
my #names = #{ $indexed_hash{Apple} };
print "The following are in 'Apple': " . join ( ", " #names ) . "\n";
Since we're getting into references, we could take things a step further and store all of your values of your file in your hash. Again, for simplicity, I am assuming that you will have one and only one entry per name:
my %hash;
while my $line ( <$fh> ) {
chomp $line;
my ($name, $age, $category, $sex) = split /:/, $line;
$hash{$name}->{AGE} = $age;
$hash{$name}->{CATEGORY} = $category;
$hash{$name}->{SEX} = $sex;
}
for my $name ( sort keys %hash ) {
print "$name Information:\n";
print " Age: " . $hash{$name}->{AGE} . "\n";
printf "Category: %s\n", $hash{$name}->{CATEGORY};
print " Sex: #{[$hash{$name}->{SEX}]}\n\n";
}
That last two statements are easier ways of interpolating complex data structures into a string. The printf is fairly clear. The second #{[...]} is a neat little trick.
What have you tried?
If you reverse the hash into a list of value => key pairs then use List::Util's pairs() against the list, you can transform the hash into a hash of values => key arrayrefs. i.e. ( foo => [ 'bar', 'baz' ] ), grep {#{$hash{$_}} > 1} keys %hash, and print the results.

Selecting highest count of element except when...

So i have been working on this perl script that will analyze and count the same letters in different line spaces. I have implemented the count to a hash but am having trouble excluding a " - " character from the output results of this hash. I tried using delete command or next if, but am not getting rid of the - count in the output.
So with this input:
#extract = ------------------------------------------------------------------MGG-------------------------------------------------------------------------------------
And following code:
#Count selected amino acids.
my %counter = ();
foreach my $extract(#extract) {
#next if $_ =~ /\-/; #This line code does not function correctly.
$counter{$_}++;
}
sub largest_value_mem (\%) {
my $counter = shift;
my ($key, #keys) = keys %$counter;
my ($big, #vals) = values %$counter;
for (0 .. $#keys) {
if ($vals[$_] > $big) {
$big = $vals[$_];
$key = $keys[$_];
}
}
$key
}
I expect the most common element to be G, same as the output. If there is a tie in the elements, say G = M, if there is a way to display both in that would be great but not necessary. Any tips on how to delete or remove the '-' is much appreciated. I am slowly learning perl language.
Please let me know if what I am asking is not clear or if more information is needed, thanks again kindly for all the comments.
Your data doesn't entirely make sense, since it's not actually working perl code. I'm guessing that it's a string divided into characters. After that it sounds like you just want to be able to find the highest frequency character, which is essentially just a sort by descending count.
Therefore the following demonstrates how to count your characters and then sort the results:
use strict;
use warnings;
my $str = '------------------------------------------------------------------MGG-------------------------------------------------------------------------------------';
my #chars = split '', $str;
#Count Characteres
my %count;
$count{$_}++ for #chars;
delete $count{'-'}; # Don't count -
# Sort keys by count descending
my #keys = sort {$count{$b} <=> $count{$a}} keys %count;
for my $key (#keys) {
print "$key $count{$key}\n";
}
Outputs:
G 2
M 1
foreach my $extract(#extract) {
#next if $_ =~ /\-/
$_ setting is suppressed by $extract here.
(In this case, $_ keeps value from above, e.g. routine argument list, previous match, etc.)
Also, you can use character class for better readability:
next if $extract=~/[-]/;

How to get the top keys from a hash by value

I have a hash that I sorted by values greatest to least. How would I go about getting the top 5? There was a post on here that talked about getting only one value.
What is the easiest way to get a key with the highest value from a hash in Perl?
I understand that so would lets say getting those values add them to an array and delete the element in the hash and then do the process again?
Seems like there should be an easier way to do this then that though.
My hash is called %words.
Edited Took out code as the question answered without really needing it.
Your question is how to get the five highest values from your hash. You have this code:
my #keys = sort {
$words{$b} <=> $words{$a}
or
"\L$a" cmp "\L$b"
} keys %words;
Where you have your sorted hash keys. Take the five top keys from there?
my #highest = splice #keys, 0, 5; # also deletes the keys from the array
my #highest = #keys[0..4]; # non-destructive solution
Also some comments on your code:
open( my $filehandle0, '<', $file0 ) || die "Could not open $file0\n";
It is a good idea to include the error message $! in your die statement to get valuable information for why the open failed.
for (#words) {
s/[\,|\.|\!|\?|\:|\;|\"]//g;
}
Like I said in the comment, you do not need to escape characters or use alternations in a character class bracket. Use either:
s/[,.!?:;"]//g for #words; #or
tr/,.!?:;"//d for #words;
This next part is a bit odd.
my #stopwords;
while ( my $line = <$filehandle1> ) {
chomp $line;
my #linearray = split( " ", $line );
push( #stopwords, #linearray );
}
for my $w ( my #stopwords ) {
s/\b\Q$w\E\B//ig;
}
You read in the stopwords from a file... and then you delete the stopwords from $_? Are you even using $_ at this point? Moreover, you are redeclaring the #stopwords array in the loop header, which will effectively mean your new array will be empty, and your loop will never run. This error is silent, it seems, so you might never notice.
my %words = %words_count;
Here you make a copy of %words_count, which seems to be redundant, since you never use it again. If you have a big hash, this can decrease performance.
my $key_count = 0;
$key_count = keys %words;
This can be done in one line: my $key_count = keys %words. More readable, in my opinion.
$value_count = $words{$key} + $value_count;
Can also be abbreviated with the += operator: $value_cont += $words{$key}
It is very good that you use strict and warnings.
If performance isn't a big deal
(sort {$words{$a} <=> $words{$b}} keys %words)[0..4])
if you absolutely need killer speed, a selection sort which terminates after 5 iterations is probably the best thing for you.
my #results;
for (0..4) {
my $maxkey;
my $max = 0;
for my $key (keys %words){
if ($max < $words{$key}){
$maxkey = $key;
$max = $words{$key};
}
}
push #results, $maxkey;
delete $words{$maxkey};
}
say join(","=>#results);
There's CPAN module for that, Sort::Key::Top.
It has a straight-forward interface and an efficient XS implementation:
use Sort::Key::Top qw(rnkeytop);
my #results = rnkeytop { $words{$_} } 5 => keys %words;

Perl - Summarize Data in File

Whats the best way to summarize data from a file that has around 2 million records in Perl?
For eg: A file like this,
ABC|XYZ|DEF|EGH|100
ABC|XYZ|DEF|FGH|200
SDF|GHT|WWW|RTY|1000
SDF|GHT|WWW|TYU|2000
Needs to be summarized on the first 3 columns like this,
ABC|XYZ|DEF|300
SDF|GHT|WWW|3000
Chris
Assuming there are always five columns, the fifth of which is numeric, and you always want the first three columns to be the key...
use warnings;
use strict;
my %totals_hash;
while (<>)
{
chomp;
my #cols = split /\|/;
my $key = join '|', #cols[0..2];
$totals_hash{$key} += $cols[4];
}
foreach (sort keys %totals_hash)
{
print $_, '|', $totals_hash{$_}, "\n";
}
You can use a hash as:
my %hash;
while (<DATA>) {
chomp;
my #tmp = split/\|/; # split each line on |
my $value = pop #tmp; # last ele is the value
pop #tmp; # pop unwanted entry
my $key = join '|',#tmp; # join the remaining ele to form key
$hash{$key} += $value; # add value for this key
}
# print hash key-values.
for(sort keys %hash) {
print $_ . '|'.$hash{$_}."\n";
}
Ideone link
Presuming your input file has its records in separate lines.
perl -n -e 'chomp;#a=split/\|/;$h{join"|",splice#a,0,3}+=pop#a;END{print map{"$_: $h{$_}\n"}keys%h}' < inputfile
1-2-3-4 I declare A CODE-GOLF WAR!!! (Okay, a reasonably readable code-golf dust-up.)
my %sums;
m/([^|]+\|[^|]+\|[^|]+).*?\|(\d+)/ and $sums{ $1 } += $2 while <>;
print join( "\n", ( map { "$_|$sums{$_}" } sort keys %sums ), '' );
Sort to put all records with the same first 3 triplets next to each other. Iterate through and kick out a subtotal when a different set of triplets appears.
$prevKey="";
$subtotal=0;
open(INPUTFILE, "<$inFile");
#lines=<INPUTFILE>;
close (INPUTFILE);
open(OUTFILE, ">$outFile");
#sorted=sort(#lines);
foreach $line(#lines){
#parts=split(/\|/g, $line);
$value=pop(#parts);
$value-=0; #coerce string to a number
$key=$parts[0]."|".$parts[1]."|".$parts[2];
if($key ne $prevKey){
print OUTFILE "$prevKey|$subtotal\n";
$prevKey=$key;
$subtotal=0;
}
$subtotal+=$value;
}
close(OUTFILE);
If sorting 2 million chokes your box then you may have to put each record into a file based on the group and then do the subtotal for each file.