How to change an array into a hashtable? - perl

I'm trying to make a program where I read in a file with a bunch of text in it. I then take punctuation out and then I read in a file that has stop words in it. Both get read in and put into arrays. I'm trying to put the array of the general text file and put it in a hash. I'm not really sure what I'm doing wrong, but I'm trying. I want to do this so I can generate stats on how many words are repeated and what not, but I have to take out stop words and such.
Anyway here is what I have so far I put a comment #WORKING ON MERGING ARRAY INTO HASH that is where I'm working at. I don't think the way I'm trying to put the array into the hash is right, but I looked online and the %hash{array} = "value"; doesn't compile. so not sure how else to do it.
Thanks, if you have any questions for me I will respond back quickly.
#!/usr/bin/perl
use strict;
use warnings;
#Reading in the text file
my $file0="data.txt";
open(my $filehandle0,'<', $file0) || die "Could not open $file0\n";
my#words;
while (my $line = <$filehandle0>){
chomp $line;
my #word = split(/\s+/, $line);
push(#words, #word);
}
for (#words) {
s/[\,|\.|\!|\?|\:|\;]//g;
}
my %words_count; #The code I was told to add in this post.
$words_count{$_}++ for #words;
Next I read in the stop words I have in another array.
#Reading in the stopwords file
my $file1 = "stoplist.txt";
open(my $filehandle1, '<',$file1) or die "Could not open $file1\n";
my #stopwords;
while(my $line = <$filehandle1>){
chomp $line;
my #linearray = split(" ", $line);
push(#stopwords, #linearray);
}
for my $w (my #stopwords) {
s/\b\Q$w\E\B//ig;
}

Some notes about hashes in Perl... Problem description:
Anyway here is what I have so far I put a comment #WORKING ON MERGING ARRAY INTO HASH that is where I'm working at. I don't think the way I'm trying to put the array into the hash is right, but I looked online and the %hash{array} = "value"; doesn't compile. so not sure how else to do it.
At first, ask yourself why you want to "put the array into the hash". An array represents a list of values while a hash represents a set of key-value pairs. So you have to define what keys and values should be. Not only for us, but for you. It often helps to explain even simple things to get a better understanding.
In this case, you may want to count how often a given word $word occured in your #words array. This could be done by iterating over all words and increase $count{$word} by one each time. This is what #raina77ow did in his answer. Important here is, that you're accessing single hash values, which are represented with the scalar sigil $ in Perl. So if you have a hash named %count, you can increase the value for the key 'foo' by
$count{foo}++;
Your result of "online looking" above (%hash{array} = "value") doesn't make sense. There are three valid ways to store values in a hash:
set all key-value pairs by assingning a even-sized list to the whole hash:
%count = (hello => 42, world => 17);
set a single value for a given key by assigning a single value for a defined key (this is what we did before):
$count{hello} = 42;
set a list of values for a given list of keys using a so-called hash slice:
#count{qw(hello world)} = (42, 17);
Note the use of sigils here: % for a hashy even-sized list of keys and values mixed, $ for single (scalar) values and # for lists of values. In your example you're using %, but define an array in the key braces {...} and assign a single scalar value.

Well, if you have a list of words in #words array, and want to get a hash where each key refers to specific word, and each value is the quantity of this word appearances in the source array, it's done as simple as...
my %words_count;
$words_count{$_}++ for #words;
In other words (no pun intended), you iterate over #words array, for each member increasing by 1 the corresponding element of %words_count hash OR, when that element is not yet defined, essentially creating it with value 1 (so-called auto-vivification).
As a sidenote, calling keys function on arrays is close to meaningless: in 5.12+ it'll give you the list of indexes used instead, and before that, throw a syntax error at you.

Related

Filter duplicates explanation

I'm a perl beginner. I try to filter duplicate entries. I found something online that magically works for me but I don't quite understand it. Could someone please explain it in detail for me?
my %seen;
grep !$seen{$_}++, #_;
It might help to see this written out in more detail.
# Hash to keep track of what you've seen
my %seen;
# Array to store the first occurrence of each value
#values;
foreach my $x (#_) {
# If we haven't seen this value already
if (!$seen{$x}) {
# Push this value onto #values
push #values, $x;
}
# Increment the value in %seen to say we've seen this value
$seen{$x}++;
}
# At the end, the unique values are in #values
%seen is a hash variable.
#_ is a array and it hold all your input parameters
The first time the loop sees an element, that element has no key in %seen. The next time the loop sees that same element, its key exists in the hash and the value for that key is true then it skips the element and go to next element.
you can find more details : What are the differences between $, #, % in Perl variable declaration?
and here

Deferencing hash of hashes in Perl

Sorry for this long post, the code should be easy to understand for veterans of Perl. I'm new to Perl and I'm trying to figure out this bit of code:
my %regression;
print "Reading regression dir: $opt_dir\n";
foreach my $f ( glob("$opt_dir/*.regress") ) {
my $name = ( fileparse( $f, '\.regress' ) )[0];
$regression{$name}{file} = $f;
say "file $regression{$name}{file}";
say "regression name $regression{$name}";
say "regression name ${regression}{$name}";
&read_regress_file( $f, $regression{$name} );
}
sub read_regress_file {
say "args #_";
my $file = shift;
my $href = shift;
say "href $href";
open FILE, $file or die "Cannot open $file: $!\n";
while ( <FILE> ) {
next if /^\s*\#/ or /^\s*$/;
chomp;
my #tokens = split "=";
my $key = shift #tokens;
$$href{$key} = join( "=", #tokens );
}
close FILE;
}
The say lines are things I added to debug.
My confusion is the last part of the subroutine read_regress_file. It looks like href is a reference from the line my $href = shift;. However, I'm trying to figure out how the hash that was passed got referenced in the first place.
%regression is a hash with keys of $name. The .regress files the code reads are simple files contains variables and their values in the form of:
var1=value
var2=value
...
So it looks like the line
my $name = (fileparse($f,'\.regress'))[0];
is creating the keys as scalars and the line
$regression{$name}{file} = $f;
actually makes $name into a hash.
In my debugging lines
say "regression name $regression{$name}";
prints the reference, for instance
regression name HASH(0x7cd198)
but
say "regression name ${regression}{$name}";
prints a name, like
regression name {filename}
with the file name inside the braces.
However, using
say "regression name $$regression{$name}";
prints nothing.
From my understanding, it looks like regression is an actual hash, but the references are the nested hashes, name.
Why does my deference test line using braces work, but the other form of dereferencing ($$) not work?
Also, why is the name still surrounded by braces when it prints? Shouldn't I be dereferencing $name instead?
I'm sorry if this is difficult to read. I'm confused which hash is actually referenced, and how to deference them if the reference is the nested hash.
This is a tough one. You've found some very awkward code that displays what may well be a bug in Perl, and you're getting confused over dereferencing Perl data structures. Standard Perl installations include the full set of documentation, and I suggest you take a look at perldoc perlreftut which is also available online at perldoc.com
The most obvious thing is that you are writing very old-fashioned Perl. Using an ampersand & to call a Perl subroutine hasn't been considered good practice since v5.8 was released fourteen years ago
I don't think there's much need to go beyond your clearly experimentatal lines at the start of the first for loop. Once you have understood this the rest should follow
say "file $regression{$name}{file}";
say "regression name $regression{$name}";
say "regression name ${regression}{$name}";
First of all, expanding data structure references within a string is unreliable. Perl tries to do what you mean, but it's very easy to write something ambiguous without realising it. It is often much better to use printf so that you can specify the embedded value separately. For instance
printf "file %s\n", $regression{$name}{file};
That said, you have a problem. $regression{$name} accesses the element of hash %regression whose key is equal to $name. That value is a reference to another hash, so the line
say "regression name $regression{$name}";
prints something like
regression name HASH(0x29348b0)
which you really don't want to see
Your first try $regression{$name}{file} accesses the element of the secondary hash that has the key file. That works fine
But ${regression}{$name} should be the same as $regression{$name}. Outside a string it is, but inside it's like ${regression} and {$name} are treated separately
There are really too many issues here for me to start guessing where you're stuck, especially without being able to talk about specifics. But it may help if I rewrite the initial code like this
my %regression;
print "Reading regression dir: $opt_dir\n";
foreach my $f ( glob("$opt_dir/*.pl") ) {
my ($name, $path, $suffix) = fileparse($f, '\.regress');
$regression{$name}{file} = $f;
my $file_details = $regression{$name};
say "file $file_details->{file}";
read_regress_file($f, $file_details);
}
I've copied the hash reference to $file_details and passed it to the subroutine like that. Can you see that each element of %regression is keyed by the name of the file, and that each value is a reference to another hash that contains the values filled in by read_regress_file?
I hope this helps. This isn't really a forum for teaching language basics so I don't think I can do much better
What I understand is that this:
$regression{$name}
represents a hashref, which looks like this:
{ file => '...something...'}
So, in order to dereference the hashref returned by $regression{$name}, you have to do something like:
%{ $regression{$name} }
In order to get the full hash.
In order to get the file property of the hash, do this:
$regression{$name}->{file}
Hope this helps.

While loop and diamond operator in Perl

I am trying to input a text file to Perl program and reverse its order of lines i.e. last line will become first, second last will become second etc. I am using following code
#!C:\Perl64\bin
$k = 0;
while (<>){
print "the value of i is $i";
#array[k] = $_;
++$k;
}
print "the array is #array";
But for some reason, my array is only printing the last line of the text file.
Any suggestions?
Typically, rather than keep a separate array index, perl programs use the push operator to push a string onto an array. One way to do this in your program:
push #array, $_;
If you really want to do it by array index, then you need to use the following syntax:
$array[$k] = $_;
Notice the $ rather than # in front. This tells perl that you're dealing with a single element from the array, not multiple elements. #array gives you the entire array, while $array[$k] gives you a single element. (There is a more advanced topic called "slices," but let's not get into that here. I will say that #array[$k] gives you a slice, and that isn't what you want here.)
If you really just want to slurp the entire file into an array, you can do that in one step:
#array = ( <> );
That will read the entire file into #array in one step.
You might have noticed I omitted/ignored your print statement. I'm not sure what it's doing printing out a variable named $i, since it didn't seem connected at all to the rest of the code. I reasoned it was debug code you had added, and not really relevant to the task at hand.
Anyway, that should get your input into #array. Now reversing the array... There are many ways you could do this in perl, but I'll let you discover those yourself.
Instead of:
#array[k] = $_;
you want:
$array[$k] = $_;
To reference the scalar variable $k, you need the $ on the front. Without that it is interpreted as the literal string 'k', which when used as an array index would be interpreted as 0 (since a non-numeric string will be interpreted as 0 in a numeric context).
So, each time around the loop you are setting the first element to the line read in (overwriting the value set in the previous iteration).
A few other tips:
#array[ ] is actually the syntax for an array slice rather than a single element. It works in this case because you are assigning to a slice of 1. The usual syntax for accessing a single element would be $array[ ].
I recommend placing 'use strict;' at the top of your script - you would have gotten an error pointing out the incorrect reference to $k
Instead of using an index variable, you could push the values onto the end of the array, eg:
while (<>) {
push #array, $_;
}
Accept input until it finds the word end
Solution1
#!/usr/bin/perl
while(<>) {
last if $_=~/end/i;
push #array,$_;
}
for (my $i=scalar(#array);$i>=0;$i--){
print pop #array;
}
Solution2
while(<>){
last if $_=~/end/i;
push #array,$_;
}
print reverse(#array);

Create a Perl hash with an array as the key

How can I put an array (like the tuple in the following example) into a hash in Perl?
%h=();
#a=(1,1);
$h{#a}=1 or $h{\#a}=1??
I tried with an array reference, but it does not work. How do I to make it work? I want to essentially de-duplicate by doing the hashing (among other things with this).
Regular hashes can only have string keys, so you'd need to create some kind of hashing function for your arrays. A simple way would be to simply join your array elements, e.g.
$h{join('-', #a)} = \#a; # A nice readable separator
$h{join($;, #a)} = \#a; # A less likely, configurable separator ("\034")
But that approach (using a sentinel value) requires that you pick a character that won't be found in the keys. The following doesn't suffer from that problem:
$h{pack('(j/a*)*', #a)} = \#a;
Alternatively, check out Hash::MultiKey which can take a more complex key.
I tried with array reference, but it does not work
Funny that, page 361 of the (new) Camel book has a paragraph title:
References Don't Work As Hash Keys
So yes, you proved the Camel book right. It then goes on to tell you how to fix it, using Tie::RefHash.
I guess you should buy the book.
(By the way, (1,1) might be called a tuple in Python, but it is called a list in Perl).
To remove duplicates in the array using hashes:
my %hash;
#hash{#array} = #array;
my #unique = keys %hash;
Alternatively, you can use map to create the hash:
my %hash = map {$_ => 1} #array;

A Perl script to process a CSV file, aggregating properties spread over multiple records

Sorry for the vague question, I'm struggling to think how to better word it!
I have a CSV file that looks a little like this, only a lot bigger:
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
The values in the first column are a ID numbers and the second column could be described as a property (for want of a better word...). The ID number 550672 has properties 1,2,3,4. Can anyone point me towards how I can begin solving how to produce strings such as that for all the ID numbers? My ideal output would be a new csv file which looks something like:
550672,1;2;3;4
656372,1;2
766153,1;4
etc.
I am very much a Perl baby (only 3 days old!) so would really appreciate direction rather than an outright solution, I'm determined to learn this stuff even if it takes me the rest of my days! I have tried to investigate it myself as best as I can, although I think I've been encumbered by not really knowing what to really search for. I am able to read in and parse CSV files (I even got so far as removing duplicate values!) but that is really where it drops off for me. Any help would be greatly appreciated!
I think it is best if I offer you a working program rather than a few hints. Hints can only take you so far, and if you take the time to understand this code it will give you a good learning experience
It is best to use Text::CSV whenever you are processing CSV data as all the debugging has already been done for you
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new;
open my $fh, '<', 'data.txt' or die $!;
my %data;
while (my $line = <$fh>) {
$csv->parse($line) or die "Invalid data line";
my ($key, $val) = $csv->fields;
push #{ $data{$key} }, $val
}
for my $id (sort keys %data) {
printf "%s,%s\n", $id, join ';', #{ $data{$id} };
}
output
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
Firstly props for seeking an approach not a solution.
As you've probably already found with perl, There Is More Than One Way To Do It.
The approach I would take would be;
use strict; # will save you big time in the long run
my %ids # Use a hash table with the id as the key to accumulate the properties
open a file handle on csv or die
while (read another line from the file handle){
split line into ID and property variable # google the split function
append new property to existing properties for this id in the hash table # If it doesn't exist already, it will be created
}
foreach my $key (keys %ids) {
deduplicate properties
print/display/do whatever you need to do with the result
}
This approach means you will need to iterate over the whole set twice (once in memory), so depending on the size of the dataset that may be a problem.
A more sophisticated approach would be to use a hashtable of hashtables to do the de duplication in the intial step, but depending on how quickly you want/need to get it working, that may not be worthwhile in the first instance.
Check out
this question
for a discussion on how to do the deduplication.
Well, open the file as stdin in perl, assume each row is of two columns, then iterate over all lines using left column as hash identifier, and gathering right column into an array pointed by a hash key. At the end of input file you'll get a hash of arrays, so iterate over it, printing a hash key and assigned array elements separated by ";" or any other sign you wish.
and here you go
dtpwmbp:~ pwadas$ cat input.txt
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
dtpwmbp:~ pwadas$ cat bb2.pl
#!/opt/local/bin/perl
my %hash;
while (<>)
{
chomp;
my($key, $value) = split /,/;
push #{$hash{$key}} , $value ;
}
foreach my $key (sort keys %hash)
{
print $key . "," . join(";", #{$hash{$key}} ) . "\n" ;
}
dtpwmbp:~ pwadas$ cat input.txt | perl -f bb2.pl
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
dtpwmbp:~ pwadas$
perl -F"," -ane 'chomp($F[1]);$X{$F[0]}=$X{$F[0]}.";".$F[1];if(eof){for(keys %X){$X{$_}=~s/;//;print $_.",".$X{$_}."\n"}}'
Another (not perl) way which incidentally is shorter and more elegant:
#!/opt/local/bin/gawk -f
BEGIN {FS=OFS=",";}
NF > 0 { IDs[$1]=IDs[$1] ";" $2; }
END { for (i in IDs) print i, substr(IDs[i], 2); }
The first line (after specifying the interpreter) sets the input FIELD SEPARATOR and the OUTPUT FIELD SEPARATOR to the comma. The second line checks of we have more than zero fields and if you do it makes the ID ($1) number the key and $2 the value. You do this for all lines.
The END statement will print these pairs out in an unspecified order. If you want to sort them you have to option of asorti gnu awk function or connecting the output of this snippet with a pipe to sort -t, -k1n,1n.