Filter duplicates explanation - perl

I'm a perl beginner. I try to filter duplicate entries. I found something online that magically works for me but I don't quite understand it. Could someone please explain it in detail for me?
my %seen;
grep !$seen{$_}++, #_;

It might help to see this written out in more detail.
# Hash to keep track of what you've seen
my %seen;
# Array to store the first occurrence of each value
#values;
foreach my $x (#_) {
# If we haven't seen this value already
if (!$seen{$x}) {
# Push this value onto #values
push #values, $x;
}
# Increment the value in %seen to say we've seen this value
$seen{$x}++;
}
# At the end, the unique values are in #values

%seen is a hash variable.
#_ is a array and it hold all your input parameters
The first time the loop sees an element, that element has no key in %seen. The next time the loop sees that same element, its key exists in the hash and the value for that key is true then it skips the element and go to next element.
you can find more details : What are the differences between $, #, % in Perl variable declaration?
and here

Related

Printing hash of array as values with Dumper leads to infinite recursion

The following code leads to infinite recursion. If printed without Dumper then it works as expected. Why Dumper causes recursion?
use Data::Dumper;
my %hash1 = (
'key1' => ['val1']
);
while (my ($key, $value) = each %hash1) {
print Dumper \%hash1;
}
each to blame here. You're using it in your loop, and Data::Dumper uses it to iterate over the contents of hashes.
From the documentation of each (Emphasis added):
The iterator used by each is attached to the hash or array, and is shared between all iteration operations applied to the same hash or array. Thus all uses of each on a single hash or array advance the same iterator location. All uses of each are also subject to having the iterator reset by any use of keys or values on the same hash or array, or by the hash (but not array) being referenced in list context. This makes each-based loops quite fragile: it is easy to arrive at such a loop with the iterator already part way through the object, or to accidentally clobber the iterator state during execution of the loop body. It's easy enough to explicitly reset the iterator before starting a loop, but there is no way to insulate the iterator state used by a loop from the iterator state used by anything else that might execute during the loop body. To avoid these problems, use a foreach loop rather than while-each.
Like so:
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper;
my %hash1 = (
'key1' => ['val1']
);
foreach my $key (keys %hash1) {
print Dumper \%hash1;
}
Basically, don't use each unless you have full control over what's done with the hash in the body of the loop over its results.
The exact cause of an infinite loop is explained with this bit of documentation:
After each has returned all entries from the hash or array, the next call to each returns the empty list in list context and undef in scalar context; the next call following that one restarts iteration.
The each loop in the Dumper() code first resets the internal iterator, and then repeats until it returns an empty list; then each is called again in the test of your while loop and starts afresh with the first (only) element of the hash, and repeats forever.

Perl Array Reference with hash reference

my $var1=[{'a'=>'1','b'=>'2'},1];
print #$var1[0]->{a};
it will print 1
but, if i print like below:
print #$var1->{a};
it will print error like below
Can't use an undefined value as a HASH reference;
Can anyone explain diff between both print statement?
#$var1[0]->{a}
is usually written as
$var1->[0]{a}
The second syntax, though, is different.
#$var1->{a}
is equivalent to
#{$var1}->{a};
You can't dereference an array (#{$var1}) as a hash. Another question is why undef is reported, to which I don't know the answer.
In the first statement you print the value of key 'a' of the first element in your array (which is $var1)
In the second statement you print the value of key 'a' of your array (and get an error as array doesn't have keys)
Hope this helps
my $var1=[{'a'=>'1','b'=>'2'},1];
$var1 is array reference which contains hash reference at index 0 and scalar at index 1
to derefer $var1 to array, we have to use #$var1.(which gives the 2-element array)
And for accessing single element we have to use $$var1[0] or $var1->[0].
And again $var1->[0] is a hash reference.
To derefer it, we have to use $var1->[0]{'a'}.
But the statement "#$var1->{'a'}" is invalid, since
Hash reference is present at 0 index of the array "#$var1".
All references are scalar, Array cannot be used to derefer at hash reference.
For more information, please refer
Perl Data Structures Cookbook
Bless my Referents

While loop and diamond operator in Perl

I am trying to input a text file to Perl program and reverse its order of lines i.e. last line will become first, second last will become second etc. I am using following code
#!C:\Perl64\bin
$k = 0;
while (<>){
print "the value of i is $i";
#array[k] = $_;
++$k;
}
print "the array is #array";
But for some reason, my array is only printing the last line of the text file.
Any suggestions?
Typically, rather than keep a separate array index, perl programs use the push operator to push a string onto an array. One way to do this in your program:
push #array, $_;
If you really want to do it by array index, then you need to use the following syntax:
$array[$k] = $_;
Notice the $ rather than # in front. This tells perl that you're dealing with a single element from the array, not multiple elements. #array gives you the entire array, while $array[$k] gives you a single element. (There is a more advanced topic called "slices," but let's not get into that here. I will say that #array[$k] gives you a slice, and that isn't what you want here.)
If you really just want to slurp the entire file into an array, you can do that in one step:
#array = ( <> );
That will read the entire file into #array in one step.
You might have noticed I omitted/ignored your print statement. I'm not sure what it's doing printing out a variable named $i, since it didn't seem connected at all to the rest of the code. I reasoned it was debug code you had added, and not really relevant to the task at hand.
Anyway, that should get your input into #array. Now reversing the array... There are many ways you could do this in perl, but I'll let you discover those yourself.
Instead of:
#array[k] = $_;
you want:
$array[$k] = $_;
To reference the scalar variable $k, you need the $ on the front. Without that it is interpreted as the literal string 'k', which when used as an array index would be interpreted as 0 (since a non-numeric string will be interpreted as 0 in a numeric context).
So, each time around the loop you are setting the first element to the line read in (overwriting the value set in the previous iteration).
A few other tips:
#array[ ] is actually the syntax for an array slice rather than a single element. It works in this case because you are assigning to a slice of 1. The usual syntax for accessing a single element would be $array[ ].
I recommend placing 'use strict;' at the top of your script - you would have gotten an error pointing out the incorrect reference to $k
Instead of using an index variable, you could push the values onto the end of the array, eg:
while (<>) {
push #array, $_;
}
Accept input until it finds the word end
Solution1
#!/usr/bin/perl
while(<>) {
last if $_=~/end/i;
push #array,$_;
}
for (my $i=scalar(#array);$i>=0;$i--){
print pop #array;
}
Solution2
while(<>){
last if $_=~/end/i;
push #array,$_;
}
print reverse(#array);

How to change an array into a hashtable?

I'm trying to make a program where I read in a file with a bunch of text in it. I then take punctuation out and then I read in a file that has stop words in it. Both get read in and put into arrays. I'm trying to put the array of the general text file and put it in a hash. I'm not really sure what I'm doing wrong, but I'm trying. I want to do this so I can generate stats on how many words are repeated and what not, but I have to take out stop words and such.
Anyway here is what I have so far I put a comment #WORKING ON MERGING ARRAY INTO HASH that is where I'm working at. I don't think the way I'm trying to put the array into the hash is right, but I looked online and the %hash{array} = "value"; doesn't compile. so not sure how else to do it.
Thanks, if you have any questions for me I will respond back quickly.
#!/usr/bin/perl
use strict;
use warnings;
#Reading in the text file
my $file0="data.txt";
open(my $filehandle0,'<', $file0) || die "Could not open $file0\n";
my#words;
while (my $line = <$filehandle0>){
chomp $line;
my #word = split(/\s+/, $line);
push(#words, #word);
}
for (#words) {
s/[\,|\.|\!|\?|\:|\;]//g;
}
my %words_count; #The code I was told to add in this post.
$words_count{$_}++ for #words;
Next I read in the stop words I have in another array.
#Reading in the stopwords file
my $file1 = "stoplist.txt";
open(my $filehandle1, '<',$file1) or die "Could not open $file1\n";
my #stopwords;
while(my $line = <$filehandle1>){
chomp $line;
my #linearray = split(" ", $line);
push(#stopwords, #linearray);
}
for my $w (my #stopwords) {
s/\b\Q$w\E\B//ig;
}
Some notes about hashes in Perl... Problem description:
Anyway here is what I have so far I put a comment #WORKING ON MERGING ARRAY INTO HASH that is where I'm working at. I don't think the way I'm trying to put the array into the hash is right, but I looked online and the %hash{array} = "value"; doesn't compile. so not sure how else to do it.
At first, ask yourself why you want to "put the array into the hash". An array represents a list of values while a hash represents a set of key-value pairs. So you have to define what keys and values should be. Not only for us, but for you. It often helps to explain even simple things to get a better understanding.
In this case, you may want to count how often a given word $word occured in your #words array. This could be done by iterating over all words and increase $count{$word} by one each time. This is what #raina77ow did in his answer. Important here is, that you're accessing single hash values, which are represented with the scalar sigil $ in Perl. So if you have a hash named %count, you can increase the value for the key 'foo' by
$count{foo}++;
Your result of "online looking" above (%hash{array} = "value") doesn't make sense. There are three valid ways to store values in a hash:
set all key-value pairs by assingning a even-sized list to the whole hash:
%count = (hello => 42, world => 17);
set a single value for a given key by assigning a single value for a defined key (this is what we did before):
$count{hello} = 42;
set a list of values for a given list of keys using a so-called hash slice:
#count{qw(hello world)} = (42, 17);
Note the use of sigils here: % for a hashy even-sized list of keys and values mixed, $ for single (scalar) values and # for lists of values. In your example you're using %, but define an array in the key braces {...} and assign a single scalar value.
Well, if you have a list of words in #words array, and want to get a hash where each key refers to specific word, and each value is the quantity of this word appearances in the source array, it's done as simple as...
my %words_count;
$words_count{$_}++ for #words;
In other words (no pun intended), you iterate over #words array, for each member increasing by 1 the corresponding element of %words_count hash OR, when that element is not yet defined, essentially creating it with value 1 (so-called auto-vivification).
As a sidenote, calling keys function on arrays is close to meaningless: in 5.12+ it'll give you the list of indexes used instead, and before that, throw a syntax error at you.

A Perl script to process a CSV file, aggregating properties spread over multiple records

Sorry for the vague question, I'm struggling to think how to better word it!
I have a CSV file that looks a little like this, only a lot bigger:
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
The values in the first column are a ID numbers and the second column could be described as a property (for want of a better word...). The ID number 550672 has properties 1,2,3,4. Can anyone point me towards how I can begin solving how to produce strings such as that for all the ID numbers? My ideal output would be a new csv file which looks something like:
550672,1;2;3;4
656372,1;2
766153,1;4
etc.
I am very much a Perl baby (only 3 days old!) so would really appreciate direction rather than an outright solution, I'm determined to learn this stuff even if it takes me the rest of my days! I have tried to investigate it myself as best as I can, although I think I've been encumbered by not really knowing what to really search for. I am able to read in and parse CSV files (I even got so far as removing duplicate values!) but that is really where it drops off for me. Any help would be greatly appreciated!
I think it is best if I offer you a working program rather than a few hints. Hints can only take you so far, and if you take the time to understand this code it will give you a good learning experience
It is best to use Text::CSV whenever you are processing CSV data as all the debugging has already been done for you
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new;
open my $fh, '<', 'data.txt' or die $!;
my %data;
while (my $line = <$fh>) {
$csv->parse($line) or die "Invalid data line";
my ($key, $val) = $csv->fields;
push #{ $data{$key} }, $val
}
for my $id (sort keys %data) {
printf "%s,%s\n", $id, join ';', #{ $data{$id} };
}
output
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
Firstly props for seeking an approach not a solution.
As you've probably already found with perl, There Is More Than One Way To Do It.
The approach I would take would be;
use strict; # will save you big time in the long run
my %ids # Use a hash table with the id as the key to accumulate the properties
open a file handle on csv or die
while (read another line from the file handle){
split line into ID and property variable # google the split function
append new property to existing properties for this id in the hash table # If it doesn't exist already, it will be created
}
foreach my $key (keys %ids) {
deduplicate properties
print/display/do whatever you need to do with the result
}
This approach means you will need to iterate over the whole set twice (once in memory), so depending on the size of the dataset that may be a problem.
A more sophisticated approach would be to use a hashtable of hashtables to do the de duplication in the intial step, but depending on how quickly you want/need to get it working, that may not be worthwhile in the first instance.
Check out
this question
for a discussion on how to do the deduplication.
Well, open the file as stdin in perl, assume each row is of two columns, then iterate over all lines using left column as hash identifier, and gathering right column into an array pointed by a hash key. At the end of input file you'll get a hash of arrays, so iterate over it, printing a hash key and assigned array elements separated by ";" or any other sign you wish.
and here you go
dtpwmbp:~ pwadas$ cat input.txt
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
dtpwmbp:~ pwadas$ cat bb2.pl
#!/opt/local/bin/perl
my %hash;
while (<>)
{
chomp;
my($key, $value) = split /,/;
push #{$hash{$key}} , $value ;
}
foreach my $key (sort keys %hash)
{
print $key . "," . join(";", #{$hash{$key}} ) . "\n" ;
}
dtpwmbp:~ pwadas$ cat input.txt | perl -f bb2.pl
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
dtpwmbp:~ pwadas$
perl -F"," -ane 'chomp($F[1]);$X{$F[0]}=$X{$F[0]}.";".$F[1];if(eof){for(keys %X){$X{$_}=~s/;//;print $_.",".$X{$_}."\n"}}'
Another (not perl) way which incidentally is shorter and more elegant:
#!/opt/local/bin/gawk -f
BEGIN {FS=OFS=",";}
NF > 0 { IDs[$1]=IDs[$1] ";" $2; }
END { for (i in IDs) print i, substr(IDs[i], 2); }
The first line (after specifying the interpreter) sets the input FIELD SEPARATOR and the OUTPUT FIELD SEPARATOR to the comma. The second line checks of we have more than zero fields and if you do it makes the ID ($1) number the key and $2 the value. You do this for all lines.
The END statement will print these pairs out in an unspecified order. If you want to sort them you have to option of asorti gnu awk function or connecting the output of this snippet with a pipe to sort -t, -k1n,1n.