remove duplicate values for a key in hash - perl

I have the following code
chdir("c:/perl/normalized");
$docid=0;
my %hash = ();
#files = <*>;
foreach $file (#files)
{
$docid++;
open (input, $file);
while (<input>)
{
open (output,'>>c:/perl/tokens/total');
chomp;
(#words) = split(" ");
foreach $word (#words)
{
push #{ $hash{$word} }, $docid;
}
}
}
foreach $key (sort keys %hash) {
print output"$key : #{ $hash{$key} }\n";
}
close (input);
close (output);
This is a sample output in a file
of : 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 4 4 4 4 5 6 6 7 7 7 7 7 7 7 7 7
it is true since the term "of" for example existed 10(ten ones) times in the first document
however is there a way to remove the repeated values; i.e instead of ten ones I want just one
Thank you for your help

To avoid adding the dups in the first place, change
foreach $word (#words)
to
foreach $word (uniq #words)
If you want to leave the dups in the data structure, instead change
print output"$key : #{ $hash{$key} }\n";
to
print output "$key : ", join(" ", uniq #{ $hash{$key} }), "\n";
uniq is provided by List::MoreUtils.
use List::MoreUtils qw( uniq );
Or you can use
sub uniq { my %seen; grep !$seen{$_}++, #_ }

Related

For Every Line in File Sum Values

I am trying to make a Perl script that goes through and loops through a file, does some Regex to get number that are surrounded by parenthesis, add them up for every line.
#file
Awaiting_Parts_Bin(2),Inspection_Bin(1),Picked-1-3888(1),Picked-1-4364(2)
Picked-1-3890(1)
Picked-1-4364(1)
Picked-1-3888(4),Picked-1-3890(2),Picked-1-4364(1),Picked-1-7202(1)
Awaiting_Parts_Bin(1)
Desired Output
#new_file
6
1
1
8
1
Perl script
#!/usr/bin/perl
use strict;
use warnings;
my $file = '/Users/.....csv';
my $new_file = '/Users/.....csv';
open(my $fh, '<', $file)
or die "cannot open file";
open(my $new_fh, '>', $new_file)
or die "cannot open file";
my $sum = 0;
while (my $line = <$fh>){
my #arr = ( $line =~ /\(([0-9]+)\)/g);
foreach my $val ( #arr ) {
$sum += $val;
print $sum, "\n";
# this makes sense that it is resetting to zero while looping. This is just one variation I tried. I tried putting the sum=0 outside the loop and it made it a running total
$sum = 0;
}
}
No matter what I try I can't get it right. This code doesn't add all of the values it takes the last one in the file. So the output would look like this
#output now
2
1
1
1
1
Or if I remove the $sum=0 in the for loop then it makes it a running total.
You need to reset $sum to zero at the start of the outer (per line) while loop, not in the inner (per element) foreach loop.
This can be achieved by putting the declaration and initial assignment of $sum inside the while loop:
while (my $line = <$fh>) {
my $sum = 0;
my #arr = ( $line =~ /\(([0-9]+)\)/g);
foreach my $val ( #arr ) {
$sum += $val;
}
print $sum, "\n";
}
Like this?
use strict;
use warnings 'all';
use List::Util 'sum';
while ( <> ) {
my $sum = sum /\((\d+)\)/g;
print "$sum\n" if defined $sum;
}
output
6
1
1
8
1

Perl script grep

The script is printing the amount of input lines, I want it to print the amount of input lines that are present in another file
#!/usr/bin/perl -w
open("file", "text.txt");
#todd = <file>;
close "file";
while(<>){
if( grep( /^$_$/, #todd)){
#if( grep #todd, /^$_$/){
print $_;
}
print "\n";
}
if for example file contains
1
3
4
5
7
and the input file that will be read from contains
1
2
3
4
5
6
7
8
9
I would want it to print 1,3,4,5 and 7
but 1-9 are being printed instead
UPDATE******
This is my code now and I am getting this error
readline() on closed filehandle todd at ./may6test.pl line 3.
#!/usr/bin/perl -w
open("todd", "<text.txt");
#files = <todd>; #file looking into
close "todd";
while( my $line = <> ){
chomp $line;
if ( grep( /^$line$/, #files) ) {
print $_;
}
print "\n";
}
which makes no sense to me because I have this other script that is basically doing the same thing
#!/usr/bin/perl -w
open("file", "<text2.txt"); #
#file = <file>; #file looking into
close "file"; #
while(<>){
$temp = $_;
$temp =~ tr/|/\t/; #puts tab between name and id
my ($name, $number1, $number2) = split("\t", $temp);
if ( grep( /^$number1$/, #file) ) {
print $_;
}
}
print "\n";
OK, the problem here is - grep sets $_ too. So grep { $_ } #array will always give you every element in the array.
At a basic level - you need to:
while ( my $line = <> ) {
chomp $line;
if ( grep { /^$line$/ } #todd ) {
#do something
}
}
But I'd suggest instead that you might want to consider building a hash of your lines instead:
open( my $input, '<', "text.txt" ) or die $!;
my %in_todd = map { $_ => 1 } <$input>;
close $input;
while (<>) {
print if $in_todd{$_};
}
Note - you might want to watch for trailing linefeeds.

How to only count words containing only A-Z and a-z?

Okay, first of all, here is my code:
#!/usr/bin/perl
use open qw(:utf8 :std);
use utf8;
print "Which file do you want to search?\n";
$file = <>;
if ($file =~ /^\s*$/) {
$file = "test.txt";
}
open (FILE, $file) or die("Could not open file.");
%hash;
while (<FILE>) {
$hash{$_}++ for split /\W+/;
}
$count = 0;
for (sort {
$hash{$b} <=> $hash{$a}
||
lc($a) cmp lc($b)
||
$a cmp $b
} keys %hash )
{
next unless /\w/;
printf "%-20s %5d\n", $_, $hash{$_} if ($count <= 9);
$count++;
}
I only want to count words containing only A-Z and a-z but this code also counts numbers. What must I do?
This is an example of the output:
Car 18
5 11
Test 11
Task 10
Perl 7
School 6
Hi 5
Tired 5
Word 4
bye 3
As you can see, the number 5 is listed which isn't supposed to happen.
Thanks!
++$hash{$_} for grep /^[a-zA-Z]+\z/, split /\W+/;
Of course, you probably meant words that only contain letters.
++$hash{$_} for grep /^\pL+\z/, split /\W+/;

List::Util - reduce - length - encoding - question

Why do I get a wrong result with the first reduce example?
test.txt
__BE
bb bbbbbbbbbbbbbbb
aaaaaa
test.pl
#!/usr/bin/env perl
use warnings; use 5.012;
use open ':encoding(UTF-8)';
use List::Util qw(reduce);
use Encode;
my( #list, $longest, $len );
open my $fh, '<', 'test.txt' or die $!;
while( my $line = readline( $fh ) ) {
chomp $line;
push #list, split( /\s+/, $line );
}
close $fh;
$longest = reduce{ length($a) > length($b) ? $a : $b } #list;
$len = length $longest;
say $longest; # aaaaaa
say $len; # 6
$longest = reduce{ length(Encode::encode_utf8($a)) > length(Encode::encode_utf8($b)) ? $a : $b } #list;
$len = length(Encode::encode_utf8($longest));
say $longest; # bbbbbbbbbbbbbbb
say $len; # 15
$longest = $list[0];
$len = length $longest;
for my $str (#list) {
if ( length($str) > $len ) {
$longest = $str;
$len = length($str);
}
}
say $longest; # bbbbbbbbbbbbbbb
say $len; # 15
AFAICS, it might even be a bug in Perl...it certainly isn't obvious that it is behaving correctly. I modified the first reduce to print diagnostics as it goes:
#!/usr/bin/env perl
use warnings; use 5.012;
use open ':encoding(UTF-8)';
use List::Util qw(reduce);
use Encode;
my( #list, $longest, $len );
open my $fh, '<', 'test.txt' or die $!;
while( my $line = readline( $fh ) ) {
chomp $line;
push #list, split( /\s+/, $line );
}
close $fh;
$longest = reduce { say "<<$a>>/<<$b>> : ", length($a), " : ", length($b);
length($a) > length($b) ? $a : $b } #list;
$len = length $longest;
say $longest; # aaaaaa
say $len; # 6
$longest = reduce { length(Encode::encode_utf8($a)) > length(Encode::encode_utf8($b)) ? $a : $b } #list;
$len = length(Encode::encode_utf8($longest));
say $longest; # bbbbbbbbbbbbbbb
say $len; # 15
$longest = $list[0];
$len = length $longest;
for my $str (#list) {
if ( length($str) > $len ) {
$longest = $str;
$len = length($str);
}
}
say $longest; # bbbbbbbbbbbbbbb
say $len; # 15
When run on MacOS X (10.6.5) using Perl 5.13.4, the output I get is:
<<>>/<<__BE>> : 0 : 4
<<__BE>>/<<>> : 0 : 0
<<>>/<<bb>> : 0 : 2
<<bb>>/<<bbbbbbbbbbbbbbb>> : 0 : 15
<<bbbbbbbbbbbbbbb>>/<<>> : 0 : 0
<<>>/<<aaaaaa>> : 0 : 6
aaaaaa
6
bbbbbbbbbbbbbbb
15
bbbbbbbbbbbbbbb
15
To all appearances, the first argument to the first reduce is always a zero length string, even on those odd occasions when it contains some data.
If the 'use open ':encoding(UTF-8)';' line is removed, then it behaves sanely.
<<>>/<<__BE>> : 0 : 4
<<__BE>>/<<>> : 4 : 0
<<__BE>>/<<bb>> : 4 : 2
<<__BE>>/<<bbbbbbbbbbbbbbb>> : 4 : 15
<<bbbbbbbbbbbbbbb>>/<<>> : 15 : 0
<<bbbbbbbbbbbbbbb>>/<<aaaaaa>> : 15 : 6
bbbbbbbbbbbbbbb
15
bbbbbbbbbbbbbbb
15
bbbbbbbbbbbbbbb
15
That might suggest that the bug is somewhere in the interaction of file I/O, UTF-8 encoding and List::Util. On the other hand, it could be somewhere more obscure. But my impression is that you have a test case that is reproducible and could be reported as a possible bug somewhere in Perl and its core modules.
I've reported this as bug in List::Util after trying to modify this program.

How can I iterate through nested arrays?

I have created an array as follows
while (defined ($line = `<STDIN>`))
{
chomp ($line);
push #stack,($line);
}
each line has two numbers.
15 6
2 8
how do iterate over each item in each line?
i.e. I want to print
15
6
2
8
I understand it's something like
foreach (#{stack}) (#stack){
print "?????
}
This is where I am stuck.
See the perldsc documentation. That's the Perl Data Structures Cookbook, which has examples for dealing with arrays of arrays. From what you're doing though, it doesn't look like you need an array of arrays.
For your problem of taking two numbers per line and outputting one number per line, just turn the whitespace into newlines:
while( <> ) {
s/\s+/\n/; # turn all whitespace runs into newlines
print; # it's ready to print
}
With Perl 5.10, you can use the new \h character class that matches only horizontal whitespace:
while( <> ) {
s/\h+/\n/; # turn all horizontal whitespace runs into newlines
print; # it's ready to print
}
As a Perl one-liner, that's just:
% perl -pe 's/\h+/\n/' file.txt
#!/usr/bin/perl
use strict;
use warnings;
while ( my $data = <DATA> ) {
my #values = split ' ', $data;
print $_, "\n" for #values;
}
__DATA__
15 6
2 8
Output:
C:\Temp> h
15
6
2
8
Alternatively, if you want to store each line in #stack and print out later:
my #stack = map { [ split ] } grep { chomp; length } <DATA>;
The line above slurps everything coming from the DATA filehandle into a list of lines (because <DATA> happens in list context). The grep chomps each line and filters by length after chomping (to avoid getting any trailing empty lines in the data file -- you can avoid it if there are none). The map then splits each line along spaces, and then creates an anonymous array reference for each line. Finally, such array references are stored in each element of #stack. You might want to use Data::Dumper to look at #stack to understand what's going on.
print join("\n", #$_), "\n" for #stack;
Now, we look over each entry in stack, dereferencing each array in turn, then joining the elements of each array with newlines to print one element per line.
Output:
C:\Temp> h
15
6
2
8
The long way of writing essentially the same thing (with less memory consumption) would be:
my #stack;
while ( my $line = <DATA> ) {
last unless $line =~ /\S/;
my #values = split ' ', $line;
push #stack, \#values;
}
for my $ref ( #stack ) {
print join("\n", #$ref), "\n";
}
Finally, if you wanted do something other than printing all values, say, sum all the numbers, you should store one value per element of #stack:
use List::Util qw( sum );
my #stack;
while ( my $line = <DATA> ) {
last unless $line =~ /\S/;
my #values = split ' ', $line;
push #stack, #values;
}
printf "The sum is %d\n", sum #stack;
#!/usr/bin/perl
while ($line = <STDIN>) {
chomp ($line);
push #stack, $line;
}
# prints each line
foreach $line (#stack) {
print "$line\n";
}
# splits each line into items using ' ' as separator
# and prints the items
foreach $line (#stack) {
#items = split / /, $line;
foreach $item (#items) {
print $item . "\n";
}
}
I use 'for' for "C" style loops, and 'foreach' for iterating over lists.
#!/usr/bin/perl
use strict;
use warnings;
open IN, "< read.txt" or
die "Can't read in 'read.txt'!";
my $content = join '', <IN>;
while ($content =~ m`(\d+)`g) {
print "$1\n";
}