Remove duplicates from list of files in perl - perl

I know this should be pretty simple and the shell version is something like:
$ sort example.txt | uniq -u
in order to remove duplicate lines from a file. How would I go about doing this in Perl?

The interesting spin on this question is the uniq -u! I don't think the other answers I've seen tackle this; they deal with sort -u example.txt or (somewhat wastefully) sort example.txt | uniq.
The difference is that the -u option eliminates all occurrences of duplicated lines, so the output is of lines that appear only once.
To tackle this, you need to know how many times each name appears, and then you need to print the names that appear just once. Assuming the list is to be read from standard input, then this code does the trick:
my %counts;
while (<>)
{
chomp;
$counts{$_}++;
}
foreach my $name (sort keys %counts)
{
print "$name\n" if $counts{$name} == 1;
}
Or, using using grep:
my %counts;
while (<>)
{
chomp;
$counts{$_}++;
}
{
local $, = "\n";
print grep { $counts{$_} == 1 } sort keys %counts;
}
Or, if you don't need to remove the newlines (because you're only going to print the names):
my %counts;
$counts{$_}++ for (<>);
print grep { $counts{$_} == 1 } sort keys %counts;
If you do in fact want every name that appears in the input to appear in the output (but only once), then any of the other solutions will do the trick (or, with minimal adaptation, will do the trick). In fact, since the input lines will end with a newline, you can generate the answer in just two lines:
my %counts = map { $_, 1 } <>;
print sort keys %counts;
No, you can't do it in one by simply replacing %counts in the print line with the map in the first line:
print sort keys map { $_, 1 } <>;
You get the error:
Type of arg 1 to keys must be hash or array (not map iterator) at ...

or use 'uniq' sub from List::MoreUtils module after reading all the file to a list (although its not a good solution)

Are you wanting to update a list of files to remove duplicate lines?
Or process a list of files, ignoring duplicate lines?
Or remove duplicate filenames from a list?
Assuming the latter:
my %seen;
#filenames = grep !$seen{$_}++, #filenames;
or other solutions from perldoc -q duplicate

First of all, sort -u xxx.txt would have been smarter than sort | uniq -u.
Second, perl -ne 'print unless $seen{$_}++' is prone to integer overflow, so a more sophisticated way of perl -ne 'if(!$seen{$_}){print;$seen{$_}=1}' seems preferable.

Related

How to sort file based on line length including whitespace Perl [duplicate]

This question already has answers here:
Sort a text file by line length including spaces
(13 answers)
Closed 4 years ago.
Im trying to sort a file based on line length:
my name is tim
I like pineapples alot
hi
into something like this
hi
my name is tim
i like pineapples alot
I have to include whitespaces in there too, so what i have tried doing is putting each line into an array as one line = 1 string in the array. Then I tried to sort the array but that didnt work out too well.
This (from one of the existing "duplicate" answers):
perl -ne 'push #a, $_; END{ print sort { length $a <=> length $b } #a }' file
And this:
#!/usr/bin/env perl
use warnings;
use strict;
my #arr;
sub compareLength
{
length $::a <=> length $::b;
}
while (<>)
{
push #arr, "$_";
}
print (sort {compareLength} #arr);
Are basically the exact same thing. -n creates the while(<>), END puts stuff after it, and the comparison function is anonymous (vs named) and inline. (But it still exists and is still a subroutine).
Oh, and I renamed #a to #arr, because I don't like two different variables both named a (#a and $a)
I hope this helps.

Perl: Find a match, remove the same lines, and to get the last field

Being a Perl newbie, please pardon me for asking this basic question.
I have a text file #server1 that shows a bunch of sentences (white space is the field separator) on many lines in the file.
I needed to match lines with my keyword, remove the same lines, and extract only the last field, so I have tried with:
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my #uniqmatchedline = split(/ /, #allmatchedlines);
my $lastfield = $uniqmatchedline[-1]\n";
print "$lastfield\n";
and it gives me the output showing:
1
I don't know why it's giving me just "1".
Could someone please explain why I'm getting "1" and how I can get the last field of the matched line correctly?
Thank you!
my #uniqmatchedline = split(/ /, #allmatchedlines);
You're getting "1" because split takes a scalar, not an array. An array in scalar context returns the number of elements.
You need to split on each individual line. Something like this:
my #uniqmatchedline = map { split(/ /, $_) } #allmatchedlines;
There are two issues with your code:
split is expecting a scalar value (string) to split on; if you are passing an array, it will convert the array to scalar (which is just the array length)
You did not have a way to remove same lines
To address these, the following code should work (not tested as no data):
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my %existing;
my #uniqmatchedline = grep !$existing{$_}++, #allmatchedlines; #this will return the unique lines
my #lastfields = map { ((split / /, $_)[-1]) . "\n" } #uniqmatchedline ; #this maps the last field in each line into an array
print for #lastfields;
Apart from two errors in the code, I find the statement "remove the same lines and extract only the last field" unclear. Once duplicate matching lines are removed, there may still be multiple distinct sentences with the pattern.
Until a clarification comes, here is code that picks the last field from the last such sentence.
use warnings 'all';
use strict;
use List::MoreUtils qw(uniq)
my $file = '/tmp/myfile.txt';
my $cmd = "ssh user1\#server1 cat $file";
open my $fh, '-|', $cmd // die "Error opening $cmd: $!"; # /
while (<$fh>) {
chomp;
push #allmatchedlines, $_ if /mysearch/;
}
close(output1);
my #unique_matched_lines = uniq #allmatchedlines;
my $lastfield = ( split ' ', $unique_matched_lines[-1] )[-1];
print $lastfield, "\n";
I changed to the three-argument open, with error checking. Recall that open for a process involves a fork and returns pid, so an "error" doesn't at all relate to what happened with the command itself. See open. (The # / merely turns off wrong syntax highlighting.) Also note that # under "..." indicates an array and thus need be escaped.
The (default) pattern ' ' used in split splits on any amount of whitespace. The regex / / turns off this behavior and splits on a single space. You most likely want to use ' '.
For more comments please see the original post below.
The statement #allmatchedlines = $_ if /mysearch/; on every iteration assigns to the array, overwriting whatever has been in it. So you end up with only the last line that matched mysearch. You want push #allmatchedlines, $_ ... to get all those lines.
Also, as shown in the answer by Justin Schell, split needs a scalar so it is taking the length of #allmatchedlines – which is 1 as explained above. You should have
my #words_in_matched_lines = map { split } #allmatchedlines;
When all this is straightened out, you'll have words in the array #uniqmatchedline and if that is the intention then its name is misleading.
To get unique elements of the array you can use the module List::MoreUtils
use List::MoreUtils qw(uniq);
my #unique_elems = uniq #whole_array;

Sort 2nd Field Descending from text file perl

I have a text file, tab delimited that looks like this in the format, name and age:
chris 19
bobby 29
doofus 67
I wanted to pull in the text file, and then sort via the second field. I can pull in the text file, and format the data, but I can't sort it right and as such have removed the sort code I had...
Here is the simple file pull: How could I modify it?
open (FILEHERE, 'ages.txt');
while (<FILEHERE>) {
chomp;
my($n, $s) = split("\t");
print "$a\t $s";
}
close (FILEHERE);
A Schwartzian transform (ST) can help here:
use strict;
use warnings;
my $data = <<END;
chris 19
doofus 67
bobby 29
END
open my $fh, '<', \$data or die $!;
print map $_->[0],
sort { $a->[1] <=> $b->[1] }
map { [ $_, /(\d+)$/ ] }
<$fh>;
close $fh;
Output:
chris 19
bobby 29
doofus 67
Read from the bottom of the ST up. The routine takes a file line, and then within map places that line as the first element of an anonymous array. The second element is the captured numeric value, from the second column. The sort takes an anonymous subroutine to sort on the anonymous array's second element (thus, the dereferencing arrow operator $a->[1]). The results are passed to map to access the sorted lines and those are finally printed.
Hope this helps!
You could read the file into an array of array references and then sort based on each array's second field:
my #lines;
open (FILEHERE, 'ages.txt');
while(<FILEHERE>) {
push #lines, [split /\t/];
}
my #sorted = sort { $b->[1] <=> $a->[1] } #lines;
Or, what might be easier is to write your Perl script assuming that your data is sorted properly, and just read from stdin: sort -grk2 ages.txt | perl yourscript.pl
This one liner from How to sort an array or table by column in perl? should work:
perl -anE 'push #t,[#F]}{ say "#$_" for sort {$a->[1] <=> $b->[1]} #t' names.txt
As with #reo katoa it uses an array of arrays - but leverages -a to autosplit the lines into #F first. See perlrun for details on autosplit.
You can also call sort -k 2,2 in perl to sort the file on 2nd field, of course use -n if they are numbers and -r to do a reverse sort.
I use the following one-liner to see squid access logs, it shows the longest sessions at top
sort -rn -k 2,2 access.log | perl -lpe 's/^([0-9]{10})(.\d{3})/scalar localtime$1/e'

How can I determine if an element exists in an array (perl)

I'm looping through an array, and I want to test if an element is found in another array.
In pseudo-code, what I'm trying to do is this:
foreach $term (#array1) {
if ($term is found in #array2) {
#do something here
}
}
I've got the "foreach" and the "do something here" parts down-pat ... but everything I've tried for the "if term is found in array" test does NOT work ...
I've tried grep:
if grep {/$term/} #array2 { #do something }
# this test always succeeds for values of $term that ARE NOT in #array2
if (grep(/$term/, #array2)) { #do something }
# this test likewise succeeds for values NOT IN the array
I've tried a couple different flavors of "converting the array to a hash" which many previous posts have indicated are so simple and easy ... and none of them have worked.
I am a long-time low-level user of perl, I understand just the basics of perl, do not understand all the fancy obfuscated code that comprises 99% of the solutions I read on the interwebs ... I would really, truly, honestly appreciate any answers that are explicit in the code and provide a step-by-step explanation of what the code is doing ...
... I seriously don't grok $_ and any other kind or type of hidden, understood, or implied value, variable, or function. I would really appreciate it if any examples or samples have all variables and functions named with clear terms ($term as opposed to $_) ... and describe with comments what the code is doing so I, in all my mentally deficient glory, may hope to possibly understand it some day. Please. :-)
...
I have an existing script which uses 'grep' somewhat succesfully:
$rc=grep(/$term/, #array);
if ($rc eq 0) { #something happens here }
but I applied that EXACT same code to my new script and it simply does NOT succeed properly ... i.e., it "succeeds" (rc = zero) when it tests a value of $term that I know is NOT present in the array being tested. I just don't get it.
The ONLY difference in my 'grep' approach between 'old' script and 'new' script is how I built the array ... in old script, I built array by reading in from a file:
#array=`cat file`;
whereas in new script I put the array inside the script itself (coz it's small) ... like this:
#array=("element1","element2","element3","element4");
How can that result in different output of the grep function? They're both bog-standard arrays! I don't get it!!!! :-(
########################################################################
addendum ... some clarifications or examples of my actual code:
########################################################################
The term I'm trying to match/find/grep is a word element, for example "word123".
This exercise was just intended to be a quick-n-dirty script to find some important info from a file full of junk, so I skip all the niceties (use strict, warnings, modules, subroutines) by choice ... this doesn't have to be elegant, just simple.
The term I'm searching for is stored in a variable which is instantiated via split:
foreach $line(#array1) {
chomp($line); # habit
# every line has multiple elements that I want to capture
($term1,$term2,$term3,$term4)=split(/\t/,$line);
# if a particular one of those terms is found in my other array 'array2'
if (grep(/$term2/, #array2) {
# then I'm storing a different element from the line into a 3rd array which eventually will be outputted
push(#known, $term1) unless $seen{$term1}++;
}
}
see that grep up there? It ain't workin right ... it is succeeding for all values of $term2 even if it is definitely NOT in array2 ... array1 is a file of a couple thousand lines. The element I'm calling $term2 here is a discrete term that may be in multiple lines, but is never repeated (or part of a larger string) within any given line. Array2 is about a couple dozen elements that I need to "filter in" for my output.
...
I just tried one of the below suggestions:
if (grep $_ eq $term2, #array2)
And this grep failed for all values of $term2 ... I'm getting an all or nothing response from grep ... so I guess I need to stop using grep. Try one of those hash solutions ... but I really could use more explanation and clarification on those.
This is in perlfaq. A quick way to do it is
my %seen;
$seen{$_}++ for #array1;
for my $item (#array2) {
if ($seen{$item}) {
# item is in array2, do something
}
}
If letter case is not important, you can set the keys with $seen{ lc($_) } and check with if ($seen{ lc($item) }).
ETA:
With the changed question: If the task is to match single words in #array2 against whole lines in #array1, the task is more complicated. Trying to split the lines and match against hash keys will likely be unsafe, because of punctuation and other such things. So, a regex solution will likely be the safest.
Unless #array2 is very large, you might do something like this:
my $rx = join "|", #array2;
for my $line (#array1) {
if ($line =~ /\b$rx\b/) { # use word boundary to avoid partial matches
# do something
}
}
If #array2 contains meta characters, such as *?+|, you have to make sure they are escaped, in which case you'd do something like:
my $rx = join "|", map quotemeta, #array2;
# etc
You could use the (infamous) "smart match" operator, provided you are on 5.10 or later:
#!/usr/bin/perl
use strict;
use warnings;
my #array1 = qw/a b c d e f g h/;
my #array2 = qw/a c e g z/;
print "a in \#array1\n" if 'a' ~~ #array1;
print "z in \#array1\n" if 'z' ~~ #array1;
print "z in \#array2\n" if 'z' ~~ #array2;
The example is very simple, but you can use an RE if you need to as well.
I should add that not everyone likes ~~ because there are some ambiguities and, um, "undocumented features". Should be OK for this though.
This should work.
#!/usr/bin/perl
use strict;
use warnings;
my #array1 = qw/a b c d e f g h/;
my #array2 = qw/a c e g z/;
for my $term (#array1) {
if (grep $_ eq $term, #array2) {
print "$term found.\n";
}
}
Output:
a found.
c found.
e found.
g found.
#!/usr/bin/perl
#ar = ( '1','2','3','4','5','6','10' );
#arr = ( '1','2','3','4','5','6','7','8','9' ) ;
foreach $var ( #arr ){
print "$var not found\n " if ( ! ( grep /$var/, #ar )) ;
}
Pattern matching is the most efficient way of matching elements. This would do the trick. Cheers!
print "$element found in the array\n" if ("#array" =~ m/$element/);
Your 'actual code' shouldn't even compile:
if (grep(/$term2/, #array2) {
should be:
if (grep (/$term2/, #array2)) {
You have unbalanced parentheses in your code. You may also find it easier to use grep with a callback (code reference) that operates on its arguments (the array.) It helps keep the parenthesis from blurring together. This is optional, though. It would be:
if (grep {/$term2/} #array2) {
You may want to use strict; and use warnings; to catch issues like this.
The example below might be helpful, it tries to see if any element in #array_sp is present in #my_array:
#! /usr/bin/perl -w
#my_array = qw(20001 20003);
#array_sp = qw(20001 20002 20004);
print "#array_sp\n";
foreach $case(#my_array){
if("#array_sp" =~ m/$case/){
print "My God!\n";
}
}
use pattern matching can solve this. Hope it helps
-QC
1. grep with eq , then
if (grep {$_ eq $term2} #array2) {
print "$term2 exists in the array";
}
2. grep with regex , then
if (grep {/$term2/} #array2) {
print "element with pattern $term2 exists in the array";
}

How do I print unique elements in Perl array?

I'm pushing elements into an array during a while statement. Each element is a teacher's name. There ends up being duplicate teacher names in the array when the loop finishes. Sometimes they are not right next to each other in the array, sometimes they are.
How can I print only the unique values in that array after its finished getting values pushed into it? Without having to parse the entire array each time I want to print an element.
Heres the code after everything has been pushed into the array:
$faculty_len = #faculty;
$i=0;
while ($i != $faculty_len)
{
printf $fh '"'.$faculty[$i].'"';
$i++;
}
use List::MoreUtils qw/ uniq /;
my #unique = uniq #faculty;
foreach ( #unique ) {
print $_, "\n";
}
Your best bet would be to use a (basically) built-in tool, like uniq (as described by innaM).
If you don't have the ability to use uniq and want to preserve order, you can use grep to simulate that.
my %seen;
my #unique = grep { ! $seen{$_}++ } #faculty;
# printing, etc.
This first gives you a hash where each key is each entry. Then, you iterate over each element, counting how many of them there are, and adding the first one. (Updated with comments by brian d foy)
I suggest pushing it into a hash.
like this:
my %faculty_hash = ();
foreach my $facs (#faculty) {
$faculty_hash{$facs} = 1;
}
my #faculty_unique = keys(%faculty_hash);
#array1 = ("abc", "def", "abc", "def", "abc", "def", "abc", "def", "xyz");
#array1 = grep { ! $seen{ $_ }++ } #array1;
print "#array1\n";
This question is answered with multiple solutions in perldoc. Just type at command line:
perldoc -q duplicate
Please note: Some of the answers containing a hash will change the ordering of the array. Hashes dont have any kind of order, so getting the keys or values will make a list with an undefined ordering.
This doen't apply to grep { ! $seen{$_}++ } #faculty
This is a one liner command to print unique lines in order it appears.
perl -ne '$seen{$_}++ || print $_' fileWithDuplicateValues
I just found hackneyed 3 liner, enjoy
my %uniq;
undef #uniq(#non_uniq_array);
my #uniq_array = keys %uniq;
Just another way to do it, useful only if you don't care about order:
my %hash;
#hash{#faculty}=1;
my #unique=keys %hash;
If you want to avoid declaring a new variable, you can use the somehow underdocumented global variable %_
#_{#faculty}=1;
my #unique=keys %_;
If you need to process the faculty list in any way, a map over the array converted to a hash for key coalescing and then sorting keys is another good way:
my #deduped = sort keys %{{ map { /.*/? ($_,1):() } #faculty }};
print join("\n", #deduped)."\n";
You process the list by changing the /.*/ regex for selecting or parsing and capturing accordingly, and you can output one or more mutated, non-unique keys per pass by making ($_,1):() arbitrarily complex.
If you need to modify the data in-flight with a substitution regex, say to remove dots from the names (s/\.//g), then a substitution according to the above pattern will mutate the original #faculty array due to $_ aliasing. You can get around $_ aliasing by making an anonymous copy of the #faculty array (see the so-called "baby cart" operator):
my #deduped = sort keys %{{ map {/.*/? do{s/\.//g; ($_,1)}:()} #{[ #faculty ]} }};
print join("\n", #deduped)."\n";
print "Unmolested array:\n".join("\n", #faculty)."\n";
In more recent versions of Perl, you can pass keys a hashref, and you can use the non-destructive substitution:
my #deduped = sort keys { map { /.*/? (s/\.//gr,1):() } #faculty };
Otherwise, the grep or $seen[$_]++ solutions elsewhere may be preferable.