I'm reading two tab separated files into two hashes, files looks like this:
apple fruit
pear fruit
carrot vegetable
potato vegetable
peach fruit
and
apple 23
pear 34
carrot 12
potato 45
peach 12
I want to pick up only vegetable and get their numbers. Is there any smarter way than through the for cycle to do this?
And if I want to create two new hashes %fruits and %vegetable, do I really have to do it like:
foreach (keys %kinds_hash) {
if ($kinds_hash{$_} =~ "vegetable") {
$vegetable{$_} = $numbers_hash{$_};
} elsif ($kinds_hash{$_} =~ "fruit") {
$fruit{$_} = $numbers_hash{$_};
}
}
There's nothing wrong with iterating on all the values.
However, if you're going to be doing it often, then perhaps it would be useful to create a new data structure that contains an array of names based off type.
use strict;
use warnings;
# Data in Paragraph mode
local $/ = '';
my %counts = split ' ', <DATA>;
my %types = split ' ', <DATA>;
# Create a structure that puts each type into an array
my %group_by_type;
while (my ($name, $type) = each %types) {
push #{$group_by_type{$type}}, $name
}
# Show all Veges
for my $fruit (#{$group_by_type{vegetable}}) {
print "$fruit $counts{$fruit}\n";
}
__DATA__
apple 23
pear 34
carrot 12
potato 45
peach 12
apple fruit
pear fruit
carrot vegetable
potato vegetable
peach fruit
Outputs:
carrot 12
potato 45
To learn more about Hashes of Arrays and other data structures, check out perldsc - Perl Data Structures Cookbook
You should structure your data so that all the ways you want to access it are made as simple as possible.
You want to access all the items in the vegetable category, and the numbers for all of those items. To make that simple I would build two hashes - one relating the names of the items to their number and category, and another relating the categories to all the names in each category.
This code does just that and uses Data::Dump to show you what has been built.
use strict;
use warnings;
use autodie;
my %items;
my %categories;
open my $fh, '<', 'numbers.tabsep';
while (<$fh>) {
next unless /\S/;
chomp;
my ($name, $number) = split /\t/;
$items{$name}[0] = $number;
}
open $fh, '<', 'categories.tabsep';
while (<$fh>) {
next unless /\S/;
chomp;
my ($name, $cat) = split /\t/;
$items{$name}[1] = $cat;
push #{ $categories{$cat} }, $name;
}
use Data::Dump;
dd \%items;
dd \%categories;
output
{
apple => [23, "fruit"],
carrot => [12, "vegetable"],
peach => [12, "fruit"],
pear => [34, "fruit"],
potato => [45, "vegetable"],
}
{
fruit => ["apple", "pear", "peach"],
vegetable => ["carrot", "potato"],
}
Now, to answer the question "I want to pick up only vegetables and get their numbers" we just loop over the vegetable element of the %categories hash, and use the %items hash to determine their numbers. Like this
for my $item (#{ $categories{vegetable} }) {
printf "%s %d\n", $item, $items{$item}[0];
}
output
carrot 12
potato 45
Tool completed successfully
You can create hash of hashes, just one nested data structure where the inner key will be your category and the value will be another hash whose key will be type and value be the number.
Following program does that:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %data;
open my $fh_one, '<', 'file1';
while(<$fh_one>) {
next unless /\S+/;
chomp;
my ($type, $category) = split /\t/;
$data{$category}{$type} = undef;
}
close($fh_one);
open my $fh_two, '<', 'file2';
OUTER: while(<$fh_two>) {
next unless /\S+/;
chomp;
my ($type, $number) = split /\t/;
for my $category (keys %data) {
for my $item (keys %{ $data{$category} }) {
$data{$category}{$item} = $number and next OUTER if $item eq $type;
}
}
}
close($fh_two);
#print Dumper \%data;
while (my ($type, $number) = each $data{'vegetable'}) {
print "$type $number\n";
}
If you uncomment the print Dumper \%data; you will see the nested data structure. It will look like the following:
$VAR1 = {
'fruit' => {
'peach' => '12',
'apple' => '23',
'pear' => '34'
},
'vegetable' => {
'carrot' => '12',
'potato' => '45'
}
};
The output of the above program is:
carrot 12
potato 45
Related
I am reading user IDs from a csv file and trying to check if that user ID exists in my hash, however I have found that my checking through if(exists($myUserIDs{"$userID"})) is never returning true, despite testing with multiple keys that are in my hash.
open $file, "<", "test.csv";
while(<$file>)
{
chomp;
#fields = split /,/, $_;
$userID = #fields[1];
if(exists($myUserIDs{"$userID"}))
{
#do something
}
}
Turns out I wrote my test csv file with spaces after each comma.
Like 1, 2, 3 instead of 1,2,3 so my keys weren't actually matching. There goes an hour of my time haha.
See if this works for you.
use strict; use warnings;
use Data::Dumper;
my (%myUserIDs, #fields);
%myUserIDs = (
'1' => 'A',
'4' => 'D'
);
print Dumper(\%myUserIDs);
while(<DATA>)
{
chomp;
#fields = split /,/, $_;
my $userID = $fields[1];
if(exists($myUserIDs{$userID})){
print "ID:$userID exists in Hash\n";
} else {
print "ID:$userID not exists in Hash\n";
}
}
__DATA__
A,1
B,2
C,3
Output:
$VAR1 = {
'4' => 'D',
'1' => 'A'
};
ID:1 exists in Hash
ID:2 not exists in Hash
ID:3 not exists in Hash
I'm struggling to understand how to read a simple text file into two Perl hashes.
I have a text file like:
George Washington
John Adams
Abraham Lincoln
and I want to create two hashes, one that holds the first names and the other that holds the last names.
I'm looking at doing something like:
my %first;
my %last;
open(my $FH, '<', $file) or die$!;
my $count = 1;
while (<$FH>)
{
chomp;
if count is odd, add to %first
elsif count is even, add to %last
}
close($FH);
but I'm honestly lost. Does anyone have any ideas?
Well you can get desired result with following code.
use strict;
use warnings;
use feature 'say';
use Data::Dumper;
my $count = 0;
my %first;
my %last;
while(<DATA>) {
chomp;
my($f,$l) = split;
$first{$f} = $count;
$last{$l} = $count;
$count++;
}
say Dumper(\%first);
say Dumper(\%last);
__DATA__
George Washington
John Adams
Abraham Lincoln
Output
$VAR1 = {
'George' => 0,
'Abraham' => 2,
'John' => 1
};
$VAR1 = {
'Adams' => 1,
'Lincoln' => 2,
'Washington' => 0
};
I have a table with users the gender of their kids in seprate lines.
lilly boy
lilly boy
jane girl
lilly girl
jane boy
I wrote a script to put parse the lines and give me a total at the end
lilly boys=2 girls1
jane boys=1 girls=1
I tried this with a hash, but I dont know how to approach it
foreach $lines (#all_lines){
if ($lines =~ /(.+?)/s(.+)/){
$person = $1;
if ($2 =~ /boy/){
$boycount=1;
$girlcount=0;
}
if ($2 =~ /girl/){
$boycount=0;
$girlcount=1;
}
the next part is, if the person doesn't already exist inside the hash, add the person and then start a count for boy and girl. (i think this is the correct way, not sure)
if (!$hash{$person}){
%hash = (
'$person' => [
{'boy' => "0+$boycount", 'girl' => "0+$girlcount"}
],
);
Now, I dont know how to keep updating the values inside the hash, if the person already exists in the hash.
%hash = (
'$person' => [
{'boys' => $boyscount, 'girls' => $girlscount}
],
);
I am not sure how to keep updating the hash.
You just need to study the Perl Data Structures Cookbook
use strict;
use warnings;
my %person;
while (<DATA>) {
chomp;
my ($parent, $gender) = split;
$person{$parent}{$gender}++;
}
use Data::Dump;
dd \%person;
__DATA__
lilly boy
lilly boy
jane girl
lilly girl
jane boy
use strict;
use warnings;
my %hash;
open my $fh, '<', 'table.txt' or die "Unable to open table: $!";
# Aggregate stats:
while ( my $line = <$fh> ) { # Loop over record by record
chomp $line; # Remove trailing newlines
# split is a better tool than regexes to get the necessary data
my ( $parent, $kid_gender ) = split /\s+/, $line;
$hash{$parent}{$kid_gender}++; # Increment by one
# Take advantage of auto-vivification
}
# Print stats:
for my $parent ( keys %hash ) {
printf "%s boys=%d girls = %d\n",
$parent, $hash{$parent}{boy}, $hash{$parent}{girl};
}
I find myself doing this pattern a lot in perl
sub fun {
my $line = $_[0];
my ( $this, $that, $the_other_thing ) = split /\t/, $line;
return { 'this' => $this, 'that' => $that, 'the_other_thing' => $the_other_thing};
}
Obviously I can simplify this pattern by returning the output of a function which transforms a given array of variables into a map, where the keys are the same names as the variables eg
sub fun {
my $line = $_[0];
my ( $this, $that, $the_other_thing ) = split /\t/, $line;
return &to_hash( $this, $that, $the_other_thing );
}
It helps as the quantity of elements get larger. How do I do this? It looks like I could combine PadWalker & closures, but I would like a way to do this using only the core language.
EDIT: thb provided a clever solution to this problem, but I've not checked it because it bypasses a lot of the hard parts(tm). How would you do it if you wanted to rely on the core language's destructuring semantics and drive your reflection off the actual variables?
EDIT2: Here's the solution I hinted at using PadWalker & closures:
use PadWalker qw( var_name );
# Given two arrays, we build a hash by treating the first set as keys and
# the second as values
sub to_hash {
my $keys = $_[0];
my $vals = $_[1];
my %hash;
#hash{#$keys} = #$vals;
return \%hash;
}
# Given a list of variables, and a callback function, retrieves the
# symbols for the variables in the list. It calls the function with
# the generated syms, followed by the original variables, and returns
# that output.
# Input is: Function, var1, var2, var3, etc....
sub with_syms {
my $fun = shift #_;
my #syms = map substr( var_name(1, \$_), 1 ), #_;
$fun->(\#syms, \#_);
}
sub fun {
my $line = $_[0];
my ( $this, $that, $other) = split /\t/, $line;
return &with_syms(\&to_hash, $this, $that, $other);
}
You could use PadWalker to try to get the name of the variables, but that's really not something you should do. It's fragile and/or limiting.
Instead, you could use a hash slice:
sub fun {
my ($line) = #_;
my %hash;
#hash{qw( this that the_other_thing )} = split /\t/, $line;
return \%hash;
}
You can hide the slice in a function to_hash if that's what you desire.
sub to_hash {
my $var_names = shift;
return { map { $_ => shift } #$var_names };
}
sub fun_long {
my ($line) = #_;
my #fields = split /\t/, $line;
return to_hash [qw( this that the_other_thing )] #fields;
}
sub fun_short {
my ($line) = #_;
return to_hash [qw( this that the_other_thing )], split /\t/, $line;
}
But if you insist, here's the PadWalker version:
use Carp qw( croak );
use PadWalker qw( var_name );
sub to_hash {
my %hash;
for (0..$#_) {
my $var_name = var_name(1, \$_[$_])
or croak("Can't determine name of \$_[$_]");
$hash{ substr($var_name, 1) } = $_[$_];
}
return \%hash;
}
sub fun {
my ($line) = #_;
my ($this, $that, $the_other_thing) = split /\t/, $line;
return to_hash($this, $that, $the_other_thing);
}
This does it:
my #part_label = qw( part1 part2 part3 );
sub fun {
my $line = $_[0];
my #part = split /\t/, $line;
my $no_part = $#part_label <= $#part ? $#part_label : $#part;
return map { $part_label[$_] => $part[$_] } (0 .. $no_part);
}
Of course, your code must name the parts somewhere. The code above does it by qw(), but you can have your code autogenerate the names if you like.
[If you anticipate a very large list of *part_labels,* then you should probably avoid the *(0 .. $no_part)* idiom, but for lists of moderate size it works fine.]
Update in response to OP's comment below: You pose an interesting challenge. I like it. How close does the following get to what you want?
sub to_hash ($$) {
my #var_name = #{shift()};
my #value = #{shift()};
$#var_name == $#value or die "$0: wrong number of elements in to_hash()\n";
return map { $var_name[$_] => $value[$_] } (0 .. $#var_name);
}
sub fun {
my $line = $_[0];
return to_hash [qw( this that the_other_thing )], [split /\t/, $line];
}
If I understand you properly you want to build a hash by assigning a given sequence of keys to values split from a data record.
This code seems to do the trick. Please explain if I have misunderstood you.
use strict;
use warnings;
use Data::Dumper;
$Data::Dumper::Terse++;
my $line = "1111 2222 3333 4444 5555 6666 7777 8888 9999\n";
print Dumper to_hash($line, qw/ class division grade group kind level rank section tier /);
sub to_hash {
my #fields = split ' ', shift;
my %fields = map {$_ => shift #fields} #_;
return \%fields;
}
output
{
'division' => '2222',
'grade' => '3333',
'section' => '8888',
'tier' => '9999',
'group' => '4444',
'kind' => '5555',
'level' => '6666',
'class' => '1111',
'rank' => '7777'
}
For a more general solution which will build a hash from any two lists, I suggest the zip_by function from List::UtilsBy
use strict;
use warnings;
use List::UtilsBy qw/zip_by/;
use Data::Dumper;
$Data::Dumper::Terse++;
my $line = "1111 2222 3333 4444 5555 6666 7777 8888 9999\n";
my %fields = zip_by { $_[0] => $_[1] }
[qw/ class division grade group kind level rank section tier /],
[split ' ', $line];
print Dumper \%fields;
The output is identical to that of my initial solution.
See also the pairwise function from List::MoreUtils which takes a pair of arrays instead of a list of array references.
Aside from parsing the Perl code yourself, a to_hash function isn't feasible using just the core language. The function being called doesn't know whether those args are variables, return values from other functions, string literals, or what have you...much less what their names are. And it doesn't, and shouldn't, care.
I wish to convert a single string with multiple delimiters into a key=>value hash structure. Is there a simple way to accomplish this? My current implementation is:
sub readConfigFile() {
my %CONFIG;
my $index = 0;
open(CON_FILE, "config");
my #lines = <CON_FILE>;
close(CON_FILE);
my #array = split(/>/, $lines[0]);
my $total = #array;
while($index < $total) {
my #arr = split(/=/, $array[$index]);
chomp($arr[1]);
$CONFIG{$arr[0]} = $arr[1];
$index = $index + 1;
}
while ( ($k,$v) = each %CONFIG ) {
print "$k => $v\n";
}
return;
}
where 'config' contains:
pub=3>rec=0>size=3>adv=1234 123 4.5 6.00
pub=1>rec=1>size=2>adv=111 22 3456 .76
The last digits need to be also removed, and kept in a separate key=>value pair whose name can be 'ip'. (I have not been able to accomplish this without making the code too lengthy and complicated).
What is your configuration data structure supposed to look like? So far the solutions only record the last line because they are stomping on the same hash keys every time they add a record.
Here's something that might get you closer, but you still need to figure out what the data structure should be.
I pass in the file handle as an argument so my subroutine isn't tied to a particular way of getting the data. It can be from a file, a string, a socket, or even the stuff below DATA in this case.
Instead of fixing things up after I parse the string, I fix the string to have the "ip" element before I parse it. Once I do that, the "ip" element isn't a special case and it's just a matter of a double split. This is a very important technique to save a lot of work and code.
I create a hash reference inside the subroutine and return that hash reference when I'm done. I don't need a global variable. :)
use warnings;
use strict;
use Data::Dumper;
readConfigFile( \*DATA );
sub readConfigFile
{
my( $fh ) = shift;
my $hash = {};
while( <$fh> )
{
chomp;
s/\s+(\d*\.\d+)$/>ip=$1/;
$hash->{ $. } = { map { split /=/ } split />/ };
}
return $hash;
}
my $hash = readConfigFile( \*DATA );
print Dumper( $hash );
__DATA__
pub=3>rec=0>size=3>adv=1234 123 4.5 6.00
pub=1>rec=1>size=2>adv=111 22 3456 .76
This gives you a data structure where each line is a separate record. I choose the line number of the record ($.) as the top-level key, but you can use anything that you like.
$VAR1 = {
'1' => {
'ip' => '6.00',
'rec' => '0',
'adv' => '1234 123 4.5',
'pub' => '3',
'size' => '3'
},
'2' => {
'ip' => '.76',
'rec' => '1',
'adv' => '111 22 3456',
'pub' => '1',
'size' => '2'
}
};
If that's not the structure you want, show us what you'd like to end up with and we can adjust our answers.
I am assuming that you want to read and parse more than 1 line. So, I chose to store the values in an AoH.
#!/usr/bin/perl
use strict;
use warnings;
my #config;
while (<DATA>) {
chomp;
push #config, { split /[=>]/ };
}
for my $href (#config) {
while (my ($k, $v) = each %$href) {
print "$k => $v\n";
}
}
__DATA__
pub=3>rec=0>size=3>adv=1234 123 4.5 6.00
pub=1>rec=1>size=2>adv=111 22 3456 .76
This results in the printout below. (The while loop above reads from DATA.)
rec => 0
adv => 1234 123 4.5 6.00
pub => 3
size => 3
rec => 1
adv => 111 22 3456 .76
pub => 1
size => 2
Chris
The below assumes the delimiter is guaranteed to be a >, and there is no chance of that appearing in the data.
I simply split each line based on '>'. The last value will contain a key=value pair, then a space, then the IP, so split this on / / exactly once (limit 2) and you get the k=v and the IP. Save the IP to the hash and keep the k=v pair in the array, then go through the array and split k=v on '='.
Fill in the hashref and push it to your higher-scoped array. This will then contain your hashrefs when finished.
(Having loaded the config into an array)
my #hashes;
for my $line (#config) {
my $hash; # config line will end up here
my #pairs = split />/, $line;
# Do the ip first. Split the last element of #pairs and put the second half into the
# hash, overwriting the element with the first half at the same time.
# This means we don't have to do anything special with the for loop below.
($pairs[-1], $hash->{ip}) = (split / /, $pairs[-1], 2);
for (#pairs) {
my ($k, $v) = split /=/;
$hash->{$k} = $v;
}
push #hashes, $hash;
}
The config file format is sub-optimal, shall we say. That is, there are easier formats to parse and understand. [Added: but the format is already defined by another program. Perl is flexible enough to deal with that.]
Your code slurps the file when there is no real need.
Your code only pays attention to the last line of data in the file (as Chris Charley noted while I was typing this up).
You also have not allowed for comment lines or blank lines - both are a good idea in any config file and they are easy to support. [Added: again, with the pre-defined format, this is barely relevant, but when you design your own files, do remember it.]
Here's an adaptation of your function into somewhat more idiomatic Perl.
#!/bin/perl -w
use strict;
use constant debug => 0;
sub readConfigFile()
{
my %CONFIG;
open(CON_FILE, "config") or die "failed to open file ($!)\n";
while (my $line = <CON_FILE>)
{
chomp $line;
$line =~ s/#.*//; # Remove comments
next if $line =~ /^\s*$/; # Ignore blank lines
foreach my $field (split(/>/, $line))
{
my #arr = split(/=/, $field);
$CONFIG{$arr[0]} = $arr[1];
print ":: $arr[0] => $arr[1]\n" if debug;
}
}
close(CON_FILE);
while (my($k,$v) = each %CONFIG)
{
print "$k => $v\n";
}
return %CONFIG;
}
readConfigFile; # Ignores returned hash
Now, you need to explain more clearly what the structure of the last field is, and why you have an 'ip' field without the key=value notation. Consistency makes life easier for everybody. You also need to think about how multiple lines are supposed to be handled. And I'd explore using a more orthodox notation, such as:
pub=3;rec=0;size=3;adv=(1234,123,4.5);ip=6.00
Colon or semi-colon as delimiters are fairly conventional; parentheses around comma separated items in a list are not an outrageous convention. Consistency is paramount. Emerson said "A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines", but consistency in Computer Science is a great benefit to everyone.
Here's one way.
foreach ( #lines ) {
chomp;
my %CONFIG;
# Extract the last digit first and replace it with an end of
# pair delimiter.
s/\s*([\d\.]+)\s*$/>/;
$CONFIG{ip} = $1;
while ( /([^=]*)=([^>]*)>/g ) {
$CONFIG{$1} = $2;
}
print Dumper ( \%CONFIG );
}