Reading ARGV files one at a time - perl

I want each (small) file specified with ARGV read in its own array. If I don't test $ARGV, <> will slurp all files in a single table. Is there a better/shorter/simpler way of doing it?
# invocation: ./prog.pl *.txt
#table = ();
$current = "";
while (<>)
{
if ($ARGV ne $current)
{
#ar = ();
$current = $ARGV;
if ($current)
{
push #table, \#ar;
}
}
push #ar;
}

The eof function can be used to detect the end of each file:
#!/usr/bin/env perl
use strict;
use warnings;
my #files;
my $file_ctr = 0;
while (<>) {
chomp;
push #{ $files[$file_ctr] }, $_;
}
continue { $file_ctr++ if eof }
Relevant documentation:
In a while (<>) loop, eof or eof(ARGV) can be used to detect the
end of each file, whereas eof() will detect the end of the very last
file only.

Please always use strict and use warnings at the top of your programs, and declare variables close to their first point of use using my.
It is simplest to test end of file on the ARGV filehandle to determine when a new file is about to be opened.
This code uses a state variable $eof to record whether the previous file has been completely read to avoid unnecessarily adding a new element to the #table array when the end of the #ARGV list is reached.
use strict;
use warnings;
my #table;
my $eof = 1;
while (<>) {
chomp;
push #table, [] if $eof;
push #{$table[-1]}, $_;
$eof = eof;
}
#Alan Haggai Alavi's idea of incrementing an index at end of file instead of setting a flag is far better as it avoids the need to explicitly create an empty array at the start of each file.
Here is my take on his solution, but it is completely dependent on Alan's post and he should gete the credit for it.
use strict;
use warnings;
my #table;
my $index = 0;
while (<>) {
chomp;
push #{$table[$index]}, $_;
$index++ if eof;
}

You can leverage File::Slurp to avoid opening and closing the files yourself.
use strict;
use warnings;
use File::Slurp;
my #table = ();
foreach my $arg ( #ARGV ) {
push #table, read_file( $arg, array_ref => 1 );
}

A hash for array refs of files:
my %files;
while (<>) {
push #{$files{$ARGV}}, $_;
}

Related

Opening, spliting and sorting into an Arrray in perl

I am a beginner programmer, who has been given a weeklong assignment to build a complex program, but is having a difficult time starting off. I have been given a set of data, and the goal is separate it into two separate arrays by the second column, based on whether the letter is M or F.
this is the code I have thus far:
#!/usr/local/bin/perl
open (FILE, "ssbn1898.txt");
$x=<FILE>;
split/[,]/$x;
#array1=$y;
if #array1[2]="M";
print #array2;
else;
print #array3;
close (FILE);
How do I fixed this? Please try and use the simplest terms possible I stared coding last week!
Thank You
First off - you split on comma, so I'm going to assume your data looks something like this:
one,M
two,F
three,M
four,M
five,F
six,M
There's a few problems with your code:
turn on strict and warnings. The warn you about possible problems with your code
open is better off written as open ( my $input, "<", $filename ) or die $!;
You only actually read one line from <FILE> - because if you assign it to a scalar $x it only reads one line.
you don't actually insert your value into either array.
So to do what you're basically trying to do:
#!/usr/local/bin/perl
use strict;
use warnings;
#define your arrays.
my #M_array;
my #F_array;
#open your file.
open (my $input, "<", 'ssbn1898.txt') or die $!;
#read file one at a time - this sets the implicit variable $_ each loop,
#which is what we use for the split.
while ( <$input> ) {
#remove linefeeds
chomp;
#capture values from either side of the comma.
my ( $name, $id ) = split ( /,/ );
#test if id is M. We _assume_ that if it's not, it must be F.
if ( $id eq "M" ) {
#insert it into our list.
push ( #M_array, $name );
}
else {
push ( #F_array, $name );
}
}
close ( $input );
#print the results
print "M: #M_array\n";
print "F: #F_array\n";
You could probably do this more concisely - I'd suggest perhaps looking at hashes next, because then you can associate key-value pairs.
There's a part function in List::MoreUtils that does exactly what you want.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use List::MoreUtils 'part';
my ($f, $m) = part { (split /,/)[1] eq 'M' } <DATA>;
say "M: #$m";
say "F: #$f";
__END__
one,M,foo
two,F,bar
three,M,baz
four,M,foo
five,F,bar
six,M,baz
The output is:
M: one,M,foo
three,M,baz
four,M,foo
six,M,baz
F: two,F,bar
five,F,bar
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my #boys=();
my #girls=();
my $fname="ssbn1898.txt"; # I keep stuff like this in a scalar
open (FIN,"< $fname")
or die "$fname:$!";
while ( my $line=<FIN> ) {
chomp $line;
my #f=split(",",$line);
push #boys,$f[0] if $f[1]=~ m/[mM]/;
push #girls,$f[1] if $f[1]=~ m/[gG]/;
}
print Dumper(\#boys);
print Dumper(\#girls);
exit 0;
# Caveats:
# Code is not tested but should work and definitely shows the concepts
#
In fact the same thing...
#!/usr/bin/perl
use strict;
my (#m,#f);
while(<>){
push (#m,$1) if(/(.*),M/);
push (#f,$1) if(/(.*),F/);
}
print "M=#m\nF=#f\n";
Or a "perl -n" (=for all lines do) variant:
#!/usr/bin/perl -n
push (#m,$1) if(/(.*),M/);
push (#f,$1) if(/(.*),F/);
END { print "M=#m\nF=#f\n";}

Perl Merge file

I have 3 or multiple files I need to merge, the data looks like this..
file 1
0334.45656
0334.45678
0335.67899
file 2
0334.89765
0335.12346
0335.56789
file 3
0334.12345
0335.45678
0335.98764
Expected output in file 4,
0334.89765
0334.89765
0334.89765
0334.12345
0335.67899
0335.12346
0335.56789
0335.45678
0335.98764
So far I have tried but data in 4rth file does not come in sorted order,
#!/usr/bin/perl
my %hash;
my $outFile = "outFile.txt";
foreach $file(#ARGV)
{
print "$file\n";
open (IN, "$file") || die "cannot open file $!";
open (OUT,">>$outFile") || die "cannot open file $!";
while ( <IN> )
{
chomp $_;
($timestamp,$data) = split (/\./,$_);
$hash{$timeStamp}{'data'}=$data;
if (defined $hash{$timeStamp})
{
print "$_\n";
print OUT"$_\n";
}
}
}
close (IN);
close (OUT);
I wouldn't normally suggest this, but unix utilties should be able to handle this just fine.
cat the 3 files together.
use sort to sort the merged file.
However, using perl, could just do the following:
#!/usr/bin/perl
use strict;
use warnings;
my #data;
push #data, $_ while (<>);
# Because the numbers are all equal length, alpha sort will work here
print for sort #data;
However, as we've discussed, it's possible that the files will be extremely large. Therefore it will be more efficient both in memory and speed if you're able to take advantage of the fact that all the files are already sorted.
The following solution therefore streams the files, pulling out the next one in order each loop of the while:
#!/usr/bin/perl
# Could name this catsort.pl
use strict;
use warnings;
use autodie;
# Initialize File handles
my #fhs = map {open my $fh, '<', $_; $fh} #ARGV;
# First Line of each file
my #data = map {scalar <$_>} #fhs;
# Loop while a next line exists
while (#data) {
# Pull out the next entry.
my $index = (sort {$data[$a] cmp $data[$b]} (0..$#data))[0];
print $data[$index];
# Fill In next Data at index.
if (! defined($data[$index] = readline $fhs[$index])) {
# End of that File
splice #fhs, $index, 1;
splice #data, $index, 1;
}
}
Using Miller's idea in a more reusable way,
use strict;
use warnings;
sub get_sort_iterator {
my #fhs = map {open my $fh, '<', $_ or die $!; $fh} #_;
my #d;
return sub {
for my $i (0 .. $#fhs) {
# skip to next file handle if it doesn't exists or we have value in $d[$i]
next if !$fhs[$i] or defined $d[$i];
# reading from $fhs[$i] file handle was success?
if ( defined($d[$i] = readline($fhs[$i])) ) { chomp($d[$i]) }
# file handle at EOF, not needed any more
else { undef $fhs[$i] }
}
# compare as numbers, return undef if no more data
my ($index) = sort {$d[$a] <=> $d[$b]} grep { defined $d[$_] } 0..$#d
or return;
# return value from $d[$index], and set it to undef
return delete $d[$index];
};
}
my $iter = get_sort_iterator(#ARGV);
while (defined(my $x = $iter->())) {
print "$x\n";
}
output
0334.12345
0334.45656
0334.45678
0334.89765
0335.12346
0335.45678
0335.56789
0335.67899
0335.98764
Suppose every input files are already in ascending order and have at least one line in them, this script could merge them in ascending order:
#!/usr/bin/perl
use warnings;
use strict;
use List::Util 'reduce';
sub min_index {
reduce { $_[$a] < $_[$b] ? $a : $b } 0 .. $#_;
}
my #fhs = map { open my $fh, '<', $_; $fh } #ARGV;
my #data = map { scalar <$_> } #fhs;
while (#data) {
my $idx = min_index(#data);
print "$data[$idx]";
if (! defined($data[$idx] = readline $fhs[$idx])) {
splice #data, $idx, 1;
splice #fhs, $idx, 1;
}
}
Note: this is basic the same as the second script offered by #Miller, but a bit clearer and more concise.
I suggest this solution, which uses a sorted array of hashes - each hash corresponding to an input file, and containing a file handle fh, the last line read line and the timestamp extracted from the line timestamp.
The hash at the end of the array always corresponds to the input that has the smallest value for the timestamp, so all that is necessary is to repeateedly pop the next value from the array, print its data, read the next line and (if it hasn't reached eof) insert it back into the array in sorted order.
This could produce an appreciable increase in speed over the repeated sorting of all the data for each output line that other answers use.
Note that the program expects the list of input files as parameters on the command line, and sends its merged output to STDOUT. It also assumes that the input files are already sorted.
use strict;
use warnings;
use autodie;
my #data;
for my $file (#ARGV) {
my $item;
open $item->{fh}, '<', $file;
insert_item($item, \#data);
}
while (#data) {
my $item = pop #data;
print $item->{line};
insert_item($item, \#data);
}
sub insert_item {
my ($item, $array) = #_;
return if eof $item->{fh};
$item->{line} = readline $item->{fh};
($item->{timestamp}) = $item->{line} =~ /^(\d+)/;
my $i = 0;
++$i while $i < #$array and $item->{timestamp} < $array->[$i]{timestamp};
splice #$array, $i, 0, $item;
}
output
0334.45656
0334.89765
0334.12345
0334.45678
0335.12346
0335.45678
0335.67899
0335.56789
0335.98764

When trying to print an array from sub only the first element prints

I'm writing a Perl script that requires me to pull out a whole column from a file and manipulate it. For example take out column A and compare it to another column in another file
A B C
A B C
A B C
So far I have:
sub routine1
{
( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
I have most of it done. The only problem is that when I call to print the subroutine it only prints the first element in the array (i.e. it will only print one A).
I am sure that what you actually have is this
sub routine1
{
while ( $_ = <FILE> )
{
next if $. < 2; # to skip header of file
my #array1 = split(/\t/, $_);
my $file1 = $array1[#_];
return $file1;
}
}
which does compile, and reads the file one line at a time in a loop.
There are two problems here. First of all, as soon as your loop has read the first line of the file (after the header) the return statement exits the subroutine, returning the only field it has read. That is why you get only a single value.
Secondly, you have indexed your #array1 with #_. What that does is take the number of elements in #_ (usually one) and use that to index #array1. You will therefore always get the second element of the array.
I'm not clear what you expect as a result, but you should write something like this. It accumulates all the values from the specified column into the array #retval, and passes the file handle into the subroutine instead of just using a global, which is poor programming practice.
use strict;
use warnings;
open my $fh, '<', 'myfile.txt' or die $!;
my #column2 = routine1($fh, 1);
print "#column2\n";
sub routine1 {
my ($fh, $index) = #_;
my #retval;
while ($_ = <$fh>) {
next if $. < 2; # to skip header of file
my #fields = split /\t/;
my $field = $fields[$index];
push #retval, $field;
}
return #retval;
}
output
B B
Try replacing most of your sub with something like this:
my #aColumn = ();
while (<FILE>)
{
chomp;
($Acol, $Bcol, $Ccol) = split("\t");
push(#aColumn, $Acol);
}
return #aColumn
Jumping to the end, the following will pull out the first column in your file blah.txt and put it in an array for you to manipulate later:
use strict;
use warnings;
use autodie;
my $file = 'blah.txt';
open my $fh, '<', $file;
my #firstcol;
while (<$fh>) {
chomp;
my #cols = split;
push #firstcol, $cols[0];
}
use Data::Dump;
dd \#firstcol;
What you have right now isn't actually looping on the contents of the file, so you aren't going to be building an array.
Here's are a few items for you to consider when crafting a subroutine solution for obtaining an array of column values from a file:
Skip the file header before entering the while loop to avoid a line-number comparison for each file line.
split only the number of columns you need by using split's LIMIT. This can significantly speed up the process.
Optionally, initialize a local copy of Perl's #ARGV with the file name, and let Perl handle the file i/o.
Borodin's solution to create a subroutine that takes both the file name column number is excellent, so it's implemented below, too:
use strict;
use warnings;
my #colVals = getFileCol( 'File.txt', 0 );
print "#colVals\n";
sub getFileCol {
local #ARGV = (shift);
my ( $col, #arr ) = shift;
<>; # skip file header
while (<>) {
my $val = ( split ' ', $_, $col + 2 )[$col] or next;
push #arr, $val;
}
return #arr;
}
Output on your dataset:
A A
Hope this helps!

Push to array not working inside loop

AIM:
I am trying to count a value in "column" 20 in a text file, then print the number of occurrences with the values from the line in the text file. Some of the lines will be identical, with the exception of "column" 0 (first column). I am trying to use hashes (though I have limited understanding of how to use hashes).
PROBLEM:
While doing push in a sub function (inside a foreach loop) the value is not being pushed to an array outside the loop, and hence the output will not be saved to file. Printing inside of the loop works (print $dummy) and all the data is being displayed.
INPUT:
Filename1 Value1a Value2a Value3a ... Column20a ... ColumnENDa
Filename2 Value1b Value2b Value3b ... Column20b ... ColumnENDb
Filename3 Value1c Value2c Value3c ... Column20a ... ColumnENDc
...
OUTPUT (using print $dummy inside loop):
2 Column20a Filename1, Filename3
1 Column20b Filename2
...
CODE:
use strict;
use warnings;
use Cwd;
use File::Find::Rule;
use File::Spec;
use File::Basename;
use Text::Template;
use File::Slurp;
use List::MoreUtils qw(uniq);
my $current_dir = cwd;
my #test_file = read_file ("test_file.txt");
my %count = ();
my %name = ();
my #test = "Counts\tName\tFile_names";
foreach (#test_file) {
chomp $_;
our(#F) = split('\t', $_, 0);
++$count{"$F[20] "};
$name{"$F[20] "} .= "$F[0]," if $F[20];
sub END {
foreach $_ (keys %name) {
$name{$_} =~ s/,$//;
my $dummy = "$count{$_}\t $_\t $name{$_}\n";
#print $dummy;
push (#test, $dummy);
}
};
}
print "#test";
write_file( 'test.txt', #test);
Why is the push function not working outside the sub (foreach loop)?
You're not actually calling your sub.
If you meant it to be the END block, it shouldn't be a sub - and you should not use END blocks unless there's a technical reason to do so.
If you mean it to be a sub, name it something else and actually call it (the name isn't an error, just looks bad - END has special meaning).
The end of your code would be (without fixing/improving it):
foreach (#test_file) {
chomp $_;
our(#F) = split('\t', $_, 0);
++$count{$F[20]};
$name{$F[20]} .= "$F[0]," if $F[20];
}
process_test();
print "#test";
write_file( 'test.txt', #test);
##########################
sub process_test {
foreach $_ (keys %name) {
$name{$_} =~ s/,$//;
my $dummy = "$count{$_}\t $_\t $name{$_}\n";
push (#test, $dummy);
}
}
As an alternative, don't even have a sub (it's not necessary for a couple of lines of code :)
foreach (#test_file) {
chomp $_;
our(#F) = split('\t', $_, 0);
++$count{$F[20]};
$name{$F[20]} .= "$F[0]," if $F[20];
}
foreach $_ (keys %name) {
$name{$_} =~ s/,$//;
my $dummy = "$count{$_}\t $_\t $name{$_}\n";
push (#test, $dummy);
}
print "#test";
write_file('test.txt', #test);
I tested this on my own version of your code, and got the following in the output file using your test input:
Counts Name File_names
2 Column20a Filename1,Filename3
1 Column20b Filename2
Why do you code the subroutine in the foreach-loop? So for every iteration through your loop you create a new one.
There is also a problem with the name of you subroutine. Actually you don't call it. And you can't call it, because perl uses END for the END block. Let me show you this with an example:
use warnings;
use strict;
END('hello world');
sub END{
my $string = shift;
print $string;
}
The purpose of the END block is to do everything between the brackets when the program ends, therefore the name.
use warnings;
use strict;
END('hello world');
sub END{
my $string = shift;
print $string;
}
Either you omit the subroutine or you declare it in a global context e.g. at the end of the program.

How to get last duplicate line from file using perl

I'd like to get the last entry of a duplicate line from a file.
The basis for duplicate checking would be the first element from a csv.
The duplicates may or may not be adjacent.
Input file:
971~11
972~12
973~11
974~11
972~11
Expected output:
971~11
973~11
974~11
972~11
I'm not looking for a perl one-liner as I intend to write this as
a subroutine.
Thanks!
PS:
I have modified this code from somewhere, but this just removes the duplicates
#!/usr/bin/perl -w
while (<STDIN>) { push (#lines, $_); }
print "-\n";
foreach my $i (#lines)
{
#newline = split(/\||~/, $i);
if (scalar(grep{ /$newline[0]/ } #lines) == 1)
{
print $i;
}
}
If the output order doesn't matter, the easiest way to do this is to use a hash to do the duplicate removal. Something like the following:
#!/usr/bin/perl -w
use strict;
sub printlast(#) {
my %dedup;
foreach my $line (#_) {
my $a = (split(/\||~/, $line))[0];
$dedup{$a} = $line;
}
print $dedup{$_} for keys %dedup; # or sort keys %dedup for prettier output
}
my #lines;
while (<STDIN>) { push (#lines, $_); }
print "-\n";
printlast(#lines);
When looking to dedup, it's almost always best to use a hash.
Here's something similar to the accepted answer (since #Mat beat me to it)
#!/usr/bin/env perl -lw
use Data::Dumper; $Data::Dumper::Indent = 1;
my %seen;
while (<DATA>) {
chomp;
my #fields = split('~');
$seen{$fields[0]} = $fields[1];
}
my #output;
while (my ($k,$v) = each %seen) {
push #output, join('~', $k, $v);
}
print Dumper \#output;
__DATA__
971~11
972~12
973~11
974~11
972~11