Delete repeated value containing lines after keeping the first line - perl

I have a tab separated large file like this:
input.txt
a b c
s t e
a b c
f q y
r e x
to delete the repeated lines (rows) in this file, i use:
my %seen;
my #lines;
while (<>) {
my #cols = split /\s+/;
unless ($seen{$cols[0]}++) {
push #lines, $_;
}
}
print #lines;
the output here is:
a b c
s t e
f q y
r e x
Now if I want to delete those lines too that contain repeted values (means: that value once appear anywhere in upper rows/columns, here "e") and keep only the uppermost value containing line, please suggest what will be the most preffered approach keeping in mind that my input file is very large with many columns and rows.
model output that I want for the above input.txt would be:
a b c
s t e
f q y
Thank you

You also need to iterate over the #cols and examine every item instead of just the first one, $cols[0].
You need something like
unless ($seen{$cols[0]}++ || $seen{$cols[1]}++ || $seen{$cols[2]}++ ...) {
push #lines, $_;
}
Of course that would be bad style and impossible if you don't know the number of columns in advance.
I would do it with grep:
my %seen;
my #lines;
while (<DATA>) {
my #cols = split /\s+/;
unless ( grep { $seen{$_}++ } #cols ) {
push #lines, $_;
}
}
print #lines;
__DATA__
a b c
s t e
a b c
f q y
r e x
Output:
a b c
s t e
f q y
grep processes the code between the curlies { $seen{$_}++ } for each element in the list #cols and returns (in scalar context) the number of items that evaluated to true.
It's not the fastest approach because it always iterates over the whole array (even if the first evaluation would be sufficient for your particular test). But give it a try; perhaps it's fast enough for you.

As I wrote in my comments, split /\s+/ is very rarely correct
And the solution you have mishandles lines with duplicate fields
It's also more efficient to replace grep with any from the core List::Util module
I suggest that you store the fields of each line in a hash %cols, like this
use strict;
use warnings 'all';
use List::Util 'any';
my ( #lines, %seen );
while ( <DATA> ) {
my %cols = map { $_ => 1 } split;
push #lines, $_ unless any { $seen{$_}++ } keys %cols;
}
print for #lines;
__DATA__
a b c
p p p
p q r
s t e
a b c
f q y
r e x
output
a b c
p p p
s t e
Even this may not be what you want, as the line f q y is omitted because q has already been "seen" in the omitted line p q r. You will have to clarify the required behaviour in this situation

Related

Split data separated by pipe from a file perl

I have a file which i have data pipe seprated having some values 0 ,1 , or some text. Now what i have to do is to read the file line by line and split data seprated by pipe and then if the values of any index is 1 then i have create increment it's value by 1 into a hash so at last i have a hash which will give details how many true values are there in the files.
Below is the example of data and the name of each value , if the value is 1 then in hash key value will get incremented by 1
A B C D E F G H I
===========================
0|0|0|1|0|0|ABC|15:44:07|0
0|0|0|1|0|0|ABC|15:44:07|0
0|0|0|0|0|0|ABC|15:44:07|0
0|1|0|0|0|1|ABC|15:44:07|0
So final out put will be like below :
$finalHash->{D} = 2;
$finalHash->{F} = 1;
$finalHash->{B} = 1;
I am able to read the file line by line :
my $final_hash;
for my $line ( <$ifh> ) {
my #el = split( /\|/, $line );
#Now here have to set the value in the hashkey where we will get the 1 value
}
Don't use for to iterate file lines, use while. The former might exhaust memory if the file was very large, as it needs to create a list of all the lines; the latter reads one line at a time.
#! /usr/bin/perl
use warnings;
use strict;
my #column_names = qw( A B C D E F );
my %counts;
while (<DATA>) {
my #columns = split /\|/;
for my $i (0 .. $#column_names) {
++$counts{ $column_names[$i] } if $columns[$i];
}
}
use Data::Dumper; print Dumper \%counts;
__DATA__
0|0|0|1|0|0|ABC|15:44:07|0
0|0|0|1|0|0|ABC|15:44:07|0
0|0|0|0|0|0|ABC|15:44:07|0
0|1|0|0|0|1|ABC|15:44:07|0

How to get only uniq column values from a line using linux or perl?

I have a file like below
1 B B C D B
2 K B D D K
1 B B C D B
2 K B D D K
1 B B C D B
2 K B D D K
I want the output to look like this
1 B C D
2 K B D
1 B C D
2 K B D
1 B C D
2 K B D
Sort commands don't work, so I tried writing a Perl program -
use strict;
use Data::Dumper;
my $file1 = <$ARGV[0]>;
open (IF2, "$file1") || die "Cannot open the file\n";
open (OUT, ">$file1.out") || die "Cannot open the out file\n";
my $k = 0;my $i=0;
my #line;
my $m;
my #line2;
while ( chomp($m = <IF2>) ) {
my $count = 0;
#line2 = split(/\t/,$m);#<stdin>;
my $l = length #line2;print $l;<stdin>;
for (my $x = 0; $x < $l;$x++) {
my $k = 0;
for (my $y = 0;$y < $l; $y) {
$i++;
#
if ($count == 0)
{
print OUT "\t$line2[$x]";
$count++;
}
if ($count != 0 && $x != $y)
{
if ($line2[$x] eq $line2[$y])
{
$k++;
}
}
}
if ($k == 0)
{
print OUT "\t$line2[$x]";
}
}
print OUT "\n";
}
print $i;
close IF2;
close OUT;
But it didn't work.
Can some one help?
Note The input and output examples in the question were edited: now the data is consistent with the title, asking to remove all duplicates. (See the original question for how it looked.) I am leaving the answer to the original question as well, until we hear from the poster.
I would like to make a general remark first.
Your posted code is an honest attempt to write a C-style program in Perl. I recommend to resist that and to learn how to use Perl instead. It does take a little investment of time and effort but it goes quick and pays off quick. As you can see below, it makes many things incomparably easier.
Even so, there are issues with the posted code but I can't write a code review now.
Edited question
We need to remove all duplicates from each line. One way: split the line by whitespace then remove duplicates from the list, which is a standard task with ready solutions.
use warnings;
use strict;
use List::MoreUtils qw(uniq);
my $file = '...';
my $fileout = '...';
open my $fh, '<', $filen or die "Can't open $file: $!";
open my $fh_out, '>', $fileout or die "Can't open $fileout: $!";
while (<$fh>)
{
my #unique = uniq split;
print $fh_out "#unique\n";
}
close $fh;
close $fh_out;
The line with spaces is restored by printing the quoted array, when spaces (or whatever is in $") are inserted between elements . The other option is to join the resulting list
my $res = join ' ', uniq split;
print $fh_out $res, "\n";
or just print $fh_out join(' ', uniq split), "\n";.
This uses uniq from List::MoreUtils module. Note from uniq that
The order of elements in the returned list is the same as in LIST.
Once List::MoreUtils got mentioned, take note of the core module List::Util as well.
A one-liner version
perl -MList::MoreUtils=uniq -lne'print join " ", uniq split' input > output
or
perl -MList::MoreUtils=uniq -lane'print join " ", uniq #F' input > output
See Command switches in perlrun
Original question (see it in the edit history)
I was answering the question with the belief that this was the input
1 B B C D B 2 K B D D K 1 B B C D B 2 K B D D K 1 B B C D B 2 K B D D K
and this was the desired output
1 B C D 2 K B D 1 B C D 2 K B D 1 B C D 2 K B D
By your desired output you want to remove only adjacent duplicates (not get "uniq" as the title says).
For that you can use the ability of regex to match repeated patterns, by using backreferences. First we need to strip all spaces, and we'll put them back in the end. For example
use warnings;
use strict;
my $file = '...';
my $fileout = '...';
open my $fh, '<', $filen or die "Can't open $file: $!";
open my $fh_out, '>', $fileout or die "Can't open $fileout: $!";
while (my $line = <$fh>)
{
$line =~ s/\s*//g; # remove spaces /
$line =~ s/(.)\1+/$1/g; # remove adjacent duplicates
$line =~ s/(.)/$1 /g; # restore space
print $fh_out $line;
}
close $fh;
close $fh_out;
The . matches any character, replace it with something more restrictive if needed (for example \w, for a "word" character). See perlretut. Note that we cannot restore space along with replacement (like s/(.)\1+/$1 /g) since non-repeating chars aren't matched and don't get the space back.
This can be done in more concise ways.
A one-liner version
perl -pe's/\s*//g; s/(.)\1+/$1/g; s/(.)/$1 /g' input > output
See Command switches in perlrun
I suggest something like this. It finds all occurrences of whitespace followed by non-whitespace, and checks whether the non-whitespace has been seen before on the current line. The matched substring is deleted if the non-whitespace has been seen before, otherwise it is left untouched
use strict;
use warnings 'all';
while ( <DATA> ) {
my %seen;
s/(\s+(\S+))/ $seen{$2}++ ? '' : $1 /eg;
print;
}
__DATA__
1 B B C D B
2 K B D D K
1 B B C D B
2 K B D D K
1 B B C D B
2 K B D D K
output
1 B C D
2 K B D
1 B C D
2 K B D
1 B C D
2 K B D
This may be done in a one-liner like this
perl -pe 'my %s; s/(\s+(\S+))/ $s{$2}++ ? "" : $1 /eg' myfile

Two inconsistent hashes from same input text file in Perl

I've created a hash that lists each unique word from a text file and the number of times each word appears. The resulting output from this hash indicates it has read the entire file correctly.
However, later efforts to parse text from the same file appear to only capture some of the contents.
For illustrative purposes, a second hash designed to capture each word from the infile as a hash value and that word's relative ordering in the document as the hash key captures only a small fraction of all the words in the document.
Any insights as to the cause of this phenomenon?
#!/usr/bin/perl
use strict;
use warnings;
my $path = "U:/Perl";
chdir($path) or die "Cant chdir to $path $!";
# Starting off computing a simple word count for each word in the 10-K.
my %hash;
my $word;
my %words;
while (<>) {
my #words = split;
# Here creating an index of each word in the 10-K and the number of times
# it appears. This hash works correctly.
foreach my $i (0 .. $#words) {
my $word = $words[$i];
++$hash{$word};
# Here trying to create a hash where each word from the infile is a value,
# and the word's order in the doc is the key.
$words{$i} = $word;
}
}
# The code below simply sorts each hash and prints it to an external file.
my #keys = sort { "\L$a" <=> "\L$b" } keys %words;
open(my $fh2, '>', 'report2.txt');
foreach (#keys) {
print $fh2 "$_ \t $words{$_}\n ";
}
close $fh2;
#keys = sort {
"\L$hash{$a}" <=> "\L$hash{$b}" or
"\L$a" cmp "\L$b"
} keys %hash;
open(my $fh3, '>', 'report3.txt');
foreach (#keys) {
print $fh3 "$_ \t $hash{$_}\n ";
}
close $fh3;
I presume you're talking about the %words hash? You are keying that hash by the position of each word on each line, so it will only ever have as many entries as the longest line in the file has words.
When processing this data
a b c d e f
g h i j k
l m n o
p q r
s t
u
your program will build %hash with 21 elements, with the keys a to u and all the values equal to 1 as they are all different.
But the hash %words will have six elements -- the number of words in the longest line -- and the values will be overwritten by the last word at that position in the line. So your hash will look like
{ 0 => 'u', 1 => 't', 2 => 'r', 3 => 'o', 4 => 'k', 5 => 'f' }
as the last word in the first position on the line is u, the last one in the second position is t, etc.
Surely you don't want a hash indexing the words across the whole file? That would be an array!
Update
Thank you for explaining your intention. I don't think putting all the words from the file into a hash one by one is going to help you to count all the unique six-word sequences.
In any case, using numeric keys from 1 to N, where N is the number of words in the file, is misusing a hash, and as I intimated above what you really want is an array that is meant to be indexed by integers.
I think you should keep an array that holds the current six-word sequence. If you add each word to the end of the array and drop them from the beginning then it will always hold the most recent six words from the file.
Something like this, perhaps
use strict;
use warnings;
my #sequence;
my %sequences;
while (<DATA>) {
for (split) {
push #sequence, $_;
if (#sequence >= 6) {
shift #sequence while #sequence > 6;
++$sequences{"#sequence"};
}
}
}
use Data::Dump;
dd \%sequences;
__DATA__
a b c d e f
g h i j k
l m n o
p q r
s t
u
output
{
"a b c d e f" => 1,
"b c d e f g" => 1,
"c d e f g h" => 1,
"d e f g h i" => 1,
"e f g h i j" => 1,
"f g h i j k" => 1,
"g h i j k l" => 1,
"h i j k l m" => 1,
"i j k l m n" => 1,
"j k l m n o" => 1,
"k l m n o p" => 1,
"l m n o p q" => 1,
"m n o p q r" => 1,
"n o p q r s" => 1,
"o p q r s t" => 1,
"p q r s t u" => 1,
}

Why can't my Perl code implement the reverse function?

Here is my code named reverse.pl
#!usr/bin/perl -w
use 5.016;
use strict;
while(my $line=<>)
{
my #array=();
push (#array,$line);
#array=reverse#array;
say #array;
}
Test file named a.txt
A B C D
E F G H
I J K L
M N O P
Q R S T
My command is perl reverse.pl a.txt
Why it can't implement the reverse function?
I want to show the result is:
D C B A
H G F E
and so on.
Reverse in a scalar context reverses a scalar.
Reverse in a list context reverses the list, but not each scalar within the list.
You explicitly turn your scalar $line into a list with one item and then reverse the order of the items.
Try this:
#!/usr/bin/perl
use 5.016;
use strict;
while (my $line=<>) {
chomp($line);
say scalar reverse $line;
}
If you have an array and want to reverse each element (but not the elements), use map:
my #array = qw(Alpha Beta Gamma);
#array = map { scalar reverse $_ } #array;
print "#array\n";
If you want to do both (reverse each element and the elements themselves), do:
#array = map { scalar reverse $_ } reverse #array;
or:
#array = reverse map { scalar reverse $_ } #array;
When you say:
push #array, $line;
You're creating an array of one value that's equal to the line.
$array[0] = "A B C D";
When you say:
#array = reverse #array;
You are reversing that single member array. The first element becomes the last, and the last element becomes the first, etc.. However, you only have one element, so there's nothing to reverse.
What you want to do is create an array with your line:
my #array = split /\s+/, $line;
This will create an array with each character being a separate element of the array. For example, your first line:
$array[0] = "A";
$array[1] = "B";
$array[2] = "C";
$array[3] = "D";
Now, if you use reverse on this array, you'll get:
$array[0] = "D";
$array[1] = "C";
$array[2] = "B";
$array[3] = "A";
Here's the program:
use strict;
use warnings;
use feature qw(say);
while ( my $line = <> ) {
chomp $line;
my #array = split /\s+/, $line;
say join " ", reverse $line;
}
The join function takes an array, and joins each element into a single line -- thus rebuilding your line.
By the way, I could have done this:
#array = reverse #array;
say "#array"; #Quotes are important!
This is because Perl will automatically join an array with whatever character is in $". This is a Perl variable that is used for joining arrays when that array is placed in quotation marks, and the default value is a single space.
Personally, I rather prefer the say join " ", reverse $line;. It's more obvious what is going on, and doesn't depend upon the value of rarely used variables.

In Perl, how do I sort by frequency of a value?

I am trying to create a program to count the different values that occur in a column of a data file. So, it would be something like, if the possible values of a column are A, B, C. The output is something like
A 456
B 234
C 344
I have been able to get the running counts of A, B and C easily by doing something like this
my %count;
for my $f (#ffile) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
chomp;
my #U = split / /;
$count{$U[2]}++;
}
}
foreach my $w (sort keys %count) {
printf $w\t$count{$w};
}
For instance here I am counting the second column of the file in the path given.
How do I sort the output of the printf by the counts rather than the keys (or values A, B, C) to get -
A 456
C 344
B 234
This is a FAQ:
perldoc -q sort
use warnings;
use strict;
my %count = (
A => 456,
B => 234,
C => 344
);
for my $w (sort { $count{$b} <=> $count{$a} } keys %count) {
print "$w\t$count{$w}\n";
}
__END__
A 456
C 344
B 234
for my $w (sort {$count{$b} <=> $count{$a}} keys %count) {
print "$w\t$count{$w}\n";
}
Some additional comments:
The output is something like...by doing something like this
You help us help you if you paste your actual code, abbreviated where possible.
When people recreate their actual code, they often obscure or omit the very source of their problem.
chomp;
my #U = split / /;
This splits on space characters and looks for the count after the second space; it's often easier to do:
my #U = split ' ';
split used with a constant space instead of a regex splits on any sequence of whitespace, like split /\s+/ except that it ignores trailing whitespace...this is a common enough thing to do that there is this special syntax for it. Note that the chomp becomes unnecessary.