Push to array not working inside loop - perl

AIM:
I am trying to count a value in "column" 20 in a text file, then print the number of occurrences with the values from the line in the text file. Some of the lines will be identical, with the exception of "column" 0 (first column). I am trying to use hashes (though I have limited understanding of how to use hashes).
PROBLEM:
While doing push in a sub function (inside a foreach loop) the value is not being pushed to an array outside the loop, and hence the output will not be saved to file. Printing inside of the loop works (print $dummy) and all the data is being displayed.
INPUT:
Filename1 Value1a Value2a Value3a ... Column20a ... ColumnENDa
Filename2 Value1b Value2b Value3b ... Column20b ... ColumnENDb
Filename3 Value1c Value2c Value3c ... Column20a ... ColumnENDc
...
OUTPUT (using print $dummy inside loop):
2 Column20a Filename1, Filename3
1 Column20b Filename2
...
CODE:
use strict;
use warnings;
use Cwd;
use File::Find::Rule;
use File::Spec;
use File::Basename;
use Text::Template;
use File::Slurp;
use List::MoreUtils qw(uniq);
my $current_dir = cwd;
my #test_file = read_file ("test_file.txt");
my %count = ();
my %name = ();
my #test = "Counts\tName\tFile_names";
foreach (#test_file) {
chomp $_;
our(#F) = split('\t', $_, 0);
++$count{"$F[20] "};
$name{"$F[20] "} .= "$F[0]," if $F[20];
sub END {
foreach $_ (keys %name) {
$name{$_} =~ s/,$//;
my $dummy = "$count{$_}\t $_\t $name{$_}\n";
#print $dummy;
push (#test, $dummy);
}
};
}
print "#test";
write_file( 'test.txt', #test);
Why is the push function not working outside the sub (foreach loop)?

You're not actually calling your sub.
If you meant it to be the END block, it shouldn't be a sub - and you should not use END blocks unless there's a technical reason to do so.
If you mean it to be a sub, name it something else and actually call it (the name isn't an error, just looks bad - END has special meaning).
The end of your code would be (without fixing/improving it):
foreach (#test_file) {
chomp $_;
our(#F) = split('\t', $_, 0);
++$count{$F[20]};
$name{$F[20]} .= "$F[0]," if $F[20];
}
process_test();
print "#test";
write_file( 'test.txt', #test);
##########################
sub process_test {
foreach $_ (keys %name) {
$name{$_} =~ s/,$//;
my $dummy = "$count{$_}\t $_\t $name{$_}\n";
push (#test, $dummy);
}
}
As an alternative, don't even have a sub (it's not necessary for a couple of lines of code :)
foreach (#test_file) {
chomp $_;
our(#F) = split('\t', $_, 0);
++$count{$F[20]};
$name{$F[20]} .= "$F[0]," if $F[20];
}
foreach $_ (keys %name) {
$name{$_} =~ s/,$//;
my $dummy = "$count{$_}\t $_\t $name{$_}\n";
push (#test, $dummy);
}
print "#test";
write_file('test.txt', #test);
I tested this on my own version of your code, and got the following in the output file using your test input:
Counts Name File_names
2 Column20a Filename1,Filename3
1 Column20b Filename2

Why do you code the subroutine in the foreach-loop? So for every iteration through your loop you create a new one.
There is also a problem with the name of you subroutine. Actually you don't call it. And you can't call it, because perl uses END for the END block. Let me show you this with an example:
use warnings;
use strict;
END('hello world');
sub END{
my $string = shift;
print $string;
}
The purpose of the END block is to do everything between the brackets when the program ends, therefore the name.
use warnings;
use strict;
END('hello world');
sub END{
my $string = shift;
print $string;
}
Either you omit the subroutine or you declare it in a global context e.g. at the end of the program.

Related

Perl: Printing out the file where a word occurs

I am trying to write a small program that takes from command line file(s) and prints out the number of occurrence of a word from all files and in which file it occurs. The first part, finding the number of occurrence of a word, seems to work well.
However, I am struggling with the second part, namely, finding in which file (i.e. file name) the word occurs. I am thinking of using an array that stores the word but don’t know if this is the best way, or what is the best way.
This is the code I have so far and seems to work well for the part that counts the number of times a word occurs in given file(s):
use strict;
use warnings;
my %count;
while (<>) {
my $casefoldstr = lc $_;
foreach my $str ($casefoldstr =~ /\w+/g) {
$count{$str}++;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}:\n";
}
The filename is accessible through $ARGV.
You can use this to build a nested hash with the filename and word as keys:
use strict;
use warnings;
use List::Util 'sum';
while (<>) {
$count{$word}{$ARGV}++ for map +lc, /\w+/g;
}
foreach my $word ( keys %count ) {
my #files = keys %$word; # All files containing lc $word
print "Total word count for '$word': ", sum( #{ $count{$word} }{#files} ), "\n";
for my $file ( #files ) {
print "$count{$word}{$file} counts of '$word' detected in '$file'\n";
}
}
Using an array seems reasonable, if you don't visit any file more than once - then you can always just check the last value stored in the array. Otherwise, use a hash.
#!/usr/bin/perl
use warnings;
use strict;
my %count;
my %in_file;
while (<>) {
my $casefoldstr = lc;
for my $str ($casefoldstr =~ /\w+/g) {
++$count{$str};
push #{ $in_file{$str} }, $ARGV
unless ref $in_file{$str} && $in_file{$str}[-1] eq $ARGV;
}
}
foreach my $str (sort keys %count) {
printf "$str $count{$str}: #{ $in_file{$str} }\n";
}

Unexpected results for high order function

I have a higher order function that maps even position values in an array:
sub map_even(&#) {
my $block = shift;
my #res;
for $i (0..$#_) {
push #res, $i%2 ? $_[$i] : &$block($_[$i]);
}
#res;
}
print map_even {$_*$_} 1,2,3,4;
I am expecting the output to be 14316, but the actual output is
0204
Why does this happen and how can I fix this? And is there any improvement can be done to the code?
In your anonymous function you have to access first input argument via $_[0] (hint: #_ array).
use strict;
use warnings;
sub map_even(&#) {
my $block = shift;
my #res;
for my $i (0..$#_) {
push #res, $i%2 ? $block->($_[$i]) : $_[$i];
}
#res;
}
print join ",", map_even {$_[0]*$_[0]} 1,2,3,4;
output
1,4,3,16
Using $_,
sub map_even(&#) {
my $block = shift;
my #res;
for my $i (0..$#_) {
push #res, $i%2 ? $block->() : $_ for $_[$i];
# or
# local $_ = $_[$i];
# push #res, $i%2 ? $block->() : $_;
}
#res;
}
print join ",", map_even {$_*$_} 1,2,3,4;
In your map_even block, you use the special $_ variable. However, you have to set it inside your loop:
local $_ = $_[$i];
... $block->();
The $_ is a global variable and can be temporarily overridden with the local operator. The $_ has nothing to do with subroutine arguments.
About aliasing: Perls for, map and grep mostly alias $_ to the current element as a performance hack, not because this behavior would be particularly desirable. In order to perform an alias, you should localize the whole *_ typeglob which contains the $_ variable and then assign a scalar reference of the alias target to the glob:
local *_ = \$_[$i];
I would solve this one of two ways.
First, by using List::Utils's pairmap:
use strict;
use warnings;
use List::Util qw(pairmap);
my #x = (1 .. 4);
my #result = pairmap {$a, $b**2} #x;
print "#result\n";
Or more simply, by just using the indexes:
use strict;
use warnings;
my #x = (1 .. 4);
my #result = map {$_ % 2 ? $x[$_] ** 2 : $x[$_]} (0..$#x);
print "#result\n";
However, if you really wanted a new sub, I'd just setup a flip-flop:
use strict;
use warnings;
sub map_even(&#) {
my $block = shift;
my $even = 1;
map {($even ^= 1) ? $block->() : $_} #_;
}
print join " ", map_even {$_*$_} 1,2,3,4;
All output:
1 4 3 16

When trying to print an array from sub only the first element prints

I'm writing a Perl script that requires me to pull out a whole column from a file and manipulate it. For example take out column A and compare it to another column in another file
A B C
A B C
A B C
So far I have:
sub routine1
{
( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
I have most of it done. The only problem is that when I call to print the subroutine it only prints the first element in the array (i.e. it will only print one A).
I am sure that what you actually have is this
sub routine1
{
while ( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
which does compile, and reads the file one line at a time in a loop.
There are two problems here. First of all, as soon as your loop has read the first line of the file (after the header) the return statement exits the subroutine, returning the only field it has read. That is why you get only a single value.
Secondly, you have indexed your #array1 with #_. What that does is take the number of elements in #_ (usually one) and use that to index #array1. You will therefore always get the second element of the array.
I'm not clear what you expect as a result, but you should write something like this. It accumulates all the values from the specified column into the array #retval, and passes the file handle into the subroutine instead of just using a global, which is poor programming practice.
use strict;
use warnings;
open my $fh, '<', 'myfile.txt' or die $!;
my #column2 = routine1($fh, 1);
print "#column2\n";
sub routine1 {
my ($fh, $index) = #_;
my #retval;
while ($_ = <$fh>) {
next if $. < 2; # to skip header of file
my #fields = split /\t/;
my $field = $fields[$index];
push #retval, $field;
}
return #retval;
}
output
B B
Try replacing most of your sub with something like this:
my #aColumn = ();
while (<FILE>)
{
chomp;
($Acol, $Bcol, $Ccol) = split("\t");
push(#aColumn, $Acol);
}
return #aColumn
Jumping to the end, the following will pull out the first column in your file blah.txt and put it in an array for you to manipulate later:
use strict;
use warnings;
use autodie;
my $file = 'blah.txt';
open my $fh, '<', $file;
my #firstcol;
while (<$fh>) {
chomp;
my #cols = split;
push #firstcol, $cols[0];
}
use Data::Dump;
dd \#firstcol;
What you have right now isn't actually looping on the contents of the file, so you aren't going to be building an array.
Here's are a few items for you to consider when crafting a subroutine solution for obtaining an array of column values from a file:
Skip the file header before entering the while loop to avoid a line-number comparison for each file line.
split only the number of columns you need by using split's LIMIT. This can significantly speed up the process.
Optionally, initialize a local copy of Perl's #ARGV with the file name, and let Perl handle the file i/o.
Borodin's solution to create a subroutine that takes both the file name column number is excellent, so it's implemented below, too:
use strict;
use warnings;
my #colVals = getFileCol( 'File.txt', 0 );
print "#colVals\n";
sub getFileCol {
local #ARGV = (shift);
my ( $col, #arr ) = shift;
<>; # skip file header
while (<>) {
my $val = ( split ' ', $_, $col + 2 )[$col] or next;
push #arr, $val;
}
return #arr;
}
Output on your dataset:
A A
Hope this helps!

Reading ARGV files one at a time

I want each (small) file specified with ARGV read in its own array. If I don't test $ARGV, <> will slurp all files in a single table. Is there a better/shorter/simpler way of doing it?
# invocation: ./prog.pl *.txt
#table = ();
$current = "";
while (<>)
{
if ($ARGV ne $current)
{
#ar = ();
$current = $ARGV;
if ($current)
{
push #table, \#ar;
}
}
push #ar;
}
The eof function can be used to detect the end of each file:
#!/usr/bin/env perl
use strict;
use warnings;
my #files;
my $file_ctr = 0;
while (<>) {
chomp;
push #{ $files[$file_ctr] }, $_;
}
continue { $file_ctr++ if eof }
Relevant documentation:
In a while (<>) loop, eof or eof(ARGV) can be used to detect the
end of each file, whereas eof() will detect the end of the very last
file only.
Please always use strict and use warnings at the top of your programs, and declare variables close to their first point of use using my.
It is simplest to test end of file on the ARGV filehandle to determine when a new file is about to be opened.
This code uses a state variable $eof to record whether the previous file has been completely read to avoid unnecessarily adding a new element to the #table array when the end of the #ARGV list is reached.
use strict;
use warnings;
my #table;
my $eof = 1;
while (<>) {
chomp;
push #table, [] if $eof;
push #{$table[-1]}, $_;
$eof = eof;
}
#Alan Haggai Alavi's idea of incrementing an index at end of file instead of setting a flag is far better as it avoids the need to explicitly create an empty array at the start of each file.
Here is my take on his solution, but it is completely dependent on Alan's post and he should gete the credit for it.
use strict;
use warnings;
my #table;
my $index = 0;
while (<>) {
chomp;
push #{$table[$index]}, $_;
$index++ if eof;
}
You can leverage File::Slurp to avoid opening and closing the files yourself.
use strict;
use warnings;
use File::Slurp;
my #table = ();
foreach my $arg ( #ARGV ) {
push #table, read_file( $arg, array_ref => 1 );
}
A hash for array refs of files:
my %files;
while (<>) {
push #{$files{$ARGV}}, $_;
}

How can I print the lines in STDIN in random order in Perl?

I want to do the inverse of sort(1) : randomize every line of stdin to stdout in Perl.
I bet real Perl hackers will tear this apart, but here it goes nonetheless.
use strict;
use warnings;
use List::Util 'shuffle';
my #lines = ();
my $bufsize = 512;
while(<STDIN>) {
push #lines, $_;
if (#lines == $bufsize) {
print shuffle(#lines);
undef #lines;
}
}
print shuffle(#lines);
Difference between this and the other solution:
Will not consume all the input and then randomize it (memory hog), but will randomize every $bufsize lines (not truly random and slow as a dog compared to the other option).
Uses a module which returns a new list instead of a in place editing Fisher - Yates implementation. They are interchangeable (except that you would have to separate the print from the shuffle). For more information type perldoc -q rand on your shell.
This perl snippet does the trick :
#! /usr/bin/perl
# randomize cat
# fisher_yates_shuffle code copied from Perl Cookbook
# (By Tom Christiansen & Nathan Torkington; ISBN 1-56592-243-3)
use strict;
my #lines = <>;
fisher_yates_shuffle( \#lines ); # permutes #array in place
foreach my $line (#lines) {
print $line;
}
# fisher_yates_shuffle( \#array ) : generate a random permutation
# of #array in place
sub fisher_yates_shuffle {
my $array = shift;
my $i;
for ($i = #$array; --$i; ) {
my $j = int rand ($i+1);
next if $i == $j;
#$array[$i,$j] = #$array[$j,$i];
}
}
__END__
use List::Util 'shuffle';
print shuffle <>
Or if you worry about last lines lacking \n,
chomp(my #lines = <>);
print "$_\n" for shuffle #lines;