Perl grep not returning expected value - perl

I have the following code:
#!/usr/bin/perl
# splits.pl
use strict;
use warnings;
use diagnostics;
my $pivotfile = "myPath/Internal_Splits_Pivot.txt";
open PIVOTFILE, $pivotfile or die $!;
while (<PIVOTFILE>) { # loop through each line in file
next if ($. == 1); # skip first line (contains business segment code)
next if ($. == 2); # skip second line (contains transaction amount text)
my #fields = split('\t',$_); # split fields for line into an array
print scalar(grep $_, #fields), "\n";
}
Given that the data in the text file is this:
4 G I M N U X
Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount
0000-13-I21 600
0001-8V-034BLA 2,172 2,172
0001-8V-191GYG 13,125 4,375
0001-9W-GH5B2A -2,967.09 2,967.09 25.00
I would expect the output from the perl script to be: 2 3 3 4 given the amount of defined elements in each line. The file is a tab delimited text file with 8 columns.
Instead I get 3 4 3 4 and I have no idea why!
For background, I am using Counting array elements in Perl as the basis for my development, as I am trying to count the number of elements in the line to know if I need to skip that line or not.

I suspect you have spaces mixed with the tabs in some places, and your grep test will consider " " true.
What does:
use Data::Dumper;
$Data::Dumper::Useqq=1;
print Dumper [<PIVOTFILE>];
show?

The problem should be in this line:
my #fields = split('\t',$_); # split fields for line into an array
The tab character doesn't get interpolated. And your file doesn't seem to be tab-only separated, at least here on SO. I changed the split regex to match arbitrary whitespace, ran the code on my machine and got the "right" result:
my #fields = split(/\s+/,$_); # split fields for line into an array
Result:
2
3
3
4

As a side note:
For background, I am using Counting array elements in Perl as the basis for my development, as I am trying to count the number of elements in the line to know if I need to skip that line or not.
Now I understand why you use grep to count array elements. That's important when your array contains undefined values like here:
my #a;
$a[1] = 42; # #a contains the list (undef, 42)
say scalar #a; # 2
or when you manually deleted entries:
my #a = split /,/ => 'foo,bar'; # #a contains the list ('foo', 'bar')
delete $a[0]; # #a contains the list (undef, 'bar')
say scalar #a; # 2
But in many cases, especially when you're using arrays to just store list without operating on single array elements, scalar #a works perfectly fine.
my #a = (1 .. 17, 1 .. 25); # (1, 2, ..., 17, 1, 2, .., 25)
say scalar #a; # 42
It's important to understand, what grep does! In your case
print scalar(grep $_, #fields), "\n";
grep returns the list of true values of #fields and then you print how many you have. But sometimes this isn't what you want/expect:
my #things = (17, 42, 'foo', '', 0); # even '' and 0 are things
say scalar grep $_ => #things # 3!
Because the empty string and the number 0 are false values in Perl, they won't get counted with that idiom. So if you want to know how long an array is, just use
say scalar #array; # number of array entries
If you want to count true values, use this
say scalar grep $_ => #array; # number of true values
But if you want to count defined values, use this
say scalar grep defined($_) => #array; # number of defined values
I'm pretty sure you already know this from the other answers on the linked page. In hashes, the situation is a little bit more complex because setting something to undef is not the same as deleteing it:
my %h = (a => 0, b => 42, c => 17, d => 666);
$h{c} = undef; # still there, but undefined
delete $h{d}; # BAM! $h{d} is gone!
What happens when we try to count values?
say scalar grep $_ => values %h; # 1
because 42 is the only true value in %h.
say scalar grep defined $_ => values %h; # 2
because 0 is defined although it's false.
say scalar grep exists $h{$_} => qw(a b c d); # 3
because undefined values can exist. Conclusion:
know what you're doing instead of copy'n'pasting code snippets :)

There are not only tabs, but there are spaces as well.
trying out with splitting by space works
Look below
#!/usr/bin/perl
# splits.pl
use strict;
use warnings;
use diagnostics;
while (<DATA>) { # loop through each line in file
next if ($. == 1); # skip first line (contains business segment code)
next if ($. == 2); # skip second line (contains transaction amount text)
my #fields = split(" ",$_); # split fields by SPACE
print scalar(#fields), "\n";
}
__DATA__
4 G I M N U X
Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount Transaction Amount
0000-13-I21 600
0001-8V-034BLA 2,172 2,172
0001-8V-191GYG 13,125 4,375
0001-9W-GH5B2A -2,967.09 2,967.09 25.00
Output
2
3
3
4

Your code works for me. The problem may be that the input file contains some "hidden" whitespace fields (eg. other whitespace than tabs). For instance
A<tab><space><CR> gives two fields, A and <space><CR>
A<tab>B<tab><CR> gives three, A, B, <CR> (remember, the end of line is part of the input!)
I suggest you to chomp every line you use; other than that, you will have to clean the array from whitespace-only fields. Eg.
scalar(grep /\S/, #fields)
should do it.

A lot of great help on this question, and quickly too!
After a long, drawn-out learning process, this is what I came up with that worked quite well, with intended results.
#!/usr/bin/perl
# splits.pl
use strict;
use warnings;
use diagnostics;
my $pivotfile = "myPath/Internal_Splits_Pivot.txt";
open PIVOTFILE, $pivotfile or die $!;
while (<PIVOTFILE>) { # loop through each line in file
next if ($. == 1); # skip first line (contains business segment code)
next if ($. == 2); # skip second line (contains transaction amount text)
chomp $_; # clean line of trailing \n and white space
my #fields = split(/\t/,$_); # split fields for line into an array
print scalar(grep $_, #fields), "\n";
}

Related

Replace single space with multiple spaces in perl

I have a requirement of replacing a single space with multiple spaces so that the second field always starts at a particular position (here 36 is the position of second field always).
I have a perl script written for this:
while(<INP>)
{
my $md=35-index($_," ");
my $str;
$str.=" " for(1..$md);
$_=~s/ +/$str/;
print "$_" ;
}
Is there any better approach with just using the regex in =~s/// so that I can use it on CLI directly instead of script.
Assuming that the fields in your data are demarcated by spaces
while (<$fh>) {
my ($first, #rest) = split;
printf "%-35s #rest\n", $first;
}
The first field is now going to be 36 wide, aligned left due to - in the format of printf. See sprintf for the many details. The rest is printed with single spaces between the original space-separated fields, but can instead be done as desired (tab separated, fixed width...).
Or you can leave the "rest" after the first field untouched by splitting the line into two parts
while (<$fh>) {
my ($first, $rest) = /(\S+)\s+(.*)/;
printf "%-35s $rest\n", $first;
}
(or use split ' ', $_, 2 instead of regex)
Please give more detail if there are other requirements.
One approach is to use plain ol' Perl formats:
#!/usr/bin/perl
use warnings;
use strict;
my($first, $second, $remainder);
format STDOUT =
#<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< #<<<<<< #<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
$first, $second,$remainder
.
while (<DATA>) {
($first, $second, $remainder) = split(/\s+/, $_, 3);
write;
}
exit 0;
__DATA__
ABCD TEST EFGH don't touch
FOO BAR FUD don't touch
Test output. I probably miscounted the columns, but you should get the idea:
$ perl dummy.pl
ABCD TEST EFGH don't touch
FOO BAR FUD don't touch
Other option would be Text::Table

In Perl, how can I tell split not to strip empty trailing fields?

Was trying to count the number of lines in a string of text (including empty lines). A little surprised by the behavior of split. Had expected the following to output 2 but it printed 1 on my perl 5.14.2.
$str = "hello\
world\n\n";
#a = split(/\n/, $str);
print $#a, "\n";
Seems that split() is insensitive to consecutive \n (add more \n's at the end of the string will not increase the printout). The only I can get it sort of close to giving the number of lines is
$str = "hello\
world\n\n";
#a = split(/(\n)/, $str);
printf "%d\n", ($#a + 1)/2, "\n";
But it looks more like a workaround than a straight solution. Any ideas?
perldoc -f split:
If LIMIT is negative, it is treated as if it were instead
arbitrarily large; as many fields as possible are produced.
If LIMIT is omitted (or, equivalently, zero), then it is usually
treated as if it were instead negative but with the exception that
trailing empty fields are stripped (empty leading fields are
always preserved); if all fields are empty, then all fields are
considered to be trailing (and are thus stripped in this case).
$ perl -E 'my $x = "1\n2\n\n"; my #x = split /\n/, $x, -1; say $#x'
3
Perhaps the problem is that you are using $#a when scalar #a is what you are actually looking for?
I apologize if you are already aware of this or if this is not the issue, but $#a returns the index of the last element of #a and (scalar #a) returns the number of elements that #a contains. Since array indexing starts at 0, $#a is one less than scalar #a.

Finding equal lines in file with Perl

I have a CSV file which contains duplicated items in different rows.
x1,y1
x2,y2
y1,x1
x3,y3
The two rows containing x1,y1 and y1,x1 are a match as they contain the same data in a diffrent order.
I need your help to find an algorithm to search for such lines in a 12MB file.
If you can define some ordering and equality relations between fields, you could store a normalized form and test your lines for equality against that.
As an example, we will use string comparision for your fields, but after lowercasing them. We can then sort the parts according to this relation, and create a lookup table via a nested hash:
use strict; use warnings;
my $cache; # A hash of hashes. Will be autovivified later.
while (<DATA>) {
chomp;
my #fields = split;
# create the normalized representation by lowercasing and sorting the fields
my #normalized_fields = sort map lc, #fields;
# find or create the path in the lookup
my $pointer = \$cache;
$pointer = \${$pointer}->{$_} for #normalized_fields;
# if this is an unknow value, make it known, and output the line
unless (defined $$pointer) {
$$pointer = 1; # set some defined value
print "$_\n"; # emit the unique line
}
}
__DATA__
X1 y1
X2 y2
Y1 x1
X3 y3
In this example I used the scalar 1 as value of the lookup data structure, but in more complex scenarios the original fields or the line number could be stored here. For the sake of the example, I used space-seperated values here, but you could replace the split with a call to Text::CSV or something.
This hash-of-hashes approach has sublinear space complexity, and worst case linear space complexity. The lookup time only depends on the number (and size) of fields in a record, not on the total number of records.
Limitation: All records must have the same number of fields, or some shorter records could be falsely considered “seen”. To circumvent these problems, we can use more complex nodes:
my $pointer = \$cache;
$pointer = \$$pointer->[0]{$_} for #normalized_fields;
unless (defined $$pointer->[1]) {
$$pointer->[1] = 1; ...
}
or introduce a default value for nonexistant field (e.g. the seperator of the original file). Here an example with the NUL character:
my $fields = 3;
...;
die "record too long" if #fields > $fields;
...; # make normalized fields
push #normalized_fields, ("\x00") x ($fields - #normalized_fields);
...; # do the lookup
A lot depends on what you want to know about duplicate lines once they have been found. This program uses a simple hash to list the line numbers of those lines that are equivalent.
use strict;
use warnings;
my %data;
while (<DATA>) {
chomp;
my $key = join ',', sort map lc, split /,/;
push #{$data{$key}}, $.;
}
foreach my $list (values %data) {
next unless #$list > 1;
print "Lines ", join(', ', #$list), " are equivalent\n";
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3
output
Lines 1, 3 are equivalent
Make two hash tables A and B
Stream through your input one line at a time
For the first line pair x and y, use each as key and the other as value for both hash tables (e.g., $A->{x} = y; $B->{y} = x;)
For the second and subsequent line pairs, test if the second field's value exists as a key for either A or B — if it does, you have a reverse match — if not, then repeat the addition process from step 3 to add it to the hash tables
To do a version of amon's answer without a hash table, if your data are numerical, you could:
Stream through input line by line, sorting fields one and two by numerical ordering
Pipe result to UNIX sort on first and second fields
Stream through sorted output line by line, checking if current line matches the previous line (reporting a reverse match, if true)
This has the advantage of using less memory than hash tables, but may take more time to process.
amon already provided the answer I would've provided, so please enjoy this bad answer:
#! /usr/bin/perl
use common::sense;
my $re = qr/(?!)/; # always fails
while (<DATA>) {
warn "Found duplicate: $_" if $_ =~ $re;
next unless /^(.*),(.*)$/;
die "Unexpected input at line $.: $_" if "$1$2" =~ tr/,//;
$re = qr/^\Q$2,$1\E$|$re/
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3

Perl file manipulation using Tie::File

I am parsing an HTML file that contains data that is associated in a grid like manner and am close to being done. I had previously thought that removing all blank lines would be needed but I failed to notice that some fields in the grid are blank. I am now trying to use the Tie::File module to store the file in an array, iterate over it, and if there are three continuous blank lines, I want to insert a dummy value I can manipulate later so that the blank line stripping does not alter the structure of my data.
What I have tried so far (the file is ~2 MB):
my #lines;
my $num = 0;
tie #lines, 'Tie::File', 'results.txt';
(tied #lines)->defer;
foreach (#lines)
{
chomp $lines[$num];
$num++;
if ($lines[$num-1] =~ /^$/ && $lines[$num+1] =~ /^$/)
{
$lines[$num] = "null";
}
}
(tied #lines)->flush;
untie #lines;
Edit: How do I go about iterating over the array and insert the value so there is only one space between each line so I can later get rid of all the blank lines?
If I understand your problem correctly (replace three consecutive empty lines with the word "null" and an empty line on either side), perhaps the following regex operating on your file's contents will help:
use Modern::Perl;
my $htmlFile = do { local $/; <DATA> };
$htmlFile =~ s/(?<!\S)\n{3}/\nnull\n\n/g;
say $htmlFile;
__DATA__
A
B
C
D
E
F
Output:
null
A
B
null
null
C
D
null
E
F

How do I calculate the difference between each element in two arrays?

I have a text file with numbers which I have grouped as follows, seperated by blank line:
42.034 41.630 40.158 26.823 26.366 25.289 23.949
34.712 35.133 35.185 35.577 28.463 28.412 30.831
33.490 33.839 32.059 32.072 33.425 33.349 34.709
12.596 13.332 12.810 13.329 13.329 13.569 11.418
Note: the groups are always of equal length and can be arranged in more than one line long, if the group is large, say 500 numbers long.
I was thinking of putting the groups in arrays and iterate along the length of the file.
My first question is: how should I subtract the first element of array 2 from array 1, array 3 from array 2, similarly for the second element and so on till the end of the group?
i.e.:
34.712-42.034,35.133-41.630,35.185-40.158 ...till the end of each group
33.490-34.712,33.839-35.133 ..................
and then save the differences of the first element in one group (second question: how ?) till the end
i.e.:
34.712-42.034 ; 33.490-34.712 ; and so on in one group
35.133-41.630 ; 33.839-35.133 ; ........
I am a beginner so any suggestions would be helpful.
Assuming you have your file opened, the following is a quick sketch
use List::MoreUtils qw<pairwise>;
...
my #list1 = split ' ', <$file_handle>;
my #list2 = split ' ', <$file_handle>;
my #diff = pairwise { $a - $b } #list1, #list2;
pairwise is the simplest way.
Otherwise there is the old standby:
# construct a list using each index in #list1 ( 0..$#list1 )
# consisting of the difference at each slot.
my #diff = map { $list1[$_] - $list2[$_] } 0..$#list1;
Here's the rest of the infrastructure to make Axeman's code work:
#!/usr/bin/perl
use strict;
use warnings;
use List::MoreUtils qw<pairwise>;
my (#prev_line, #this_line, #diff);
while (<>) {
next if /^\s+$/; # skip leading blank lines, if any
#prev_line = split;
last;
}
# get the rest of the lines, calculating and printing the difference,
# then saving this line's values in the previous line's values for the next
# set of differences
while (<>) {
next if /^\s+$/; # skip embedded blank lines
#this_line = split;
#diff = pairwise { $a - $b } #this_line, #prev_line;
print join(" ; ", #diff), "\n\n";
#prev_line = #this_line;
}
So given the input:
1 1 1
3 2 1
2 2 2
You'll get:
2 ; 1 ; 0
-1 ; 0 ; 1