Sort 2nd Field Descending from text file perl - perl

I have a text file, tab delimited that looks like this in the format, name and age:
chris 19
bobby 29
doofus 67
I wanted to pull in the text file, and then sort via the second field. I can pull in the text file, and format the data, but I can't sort it right and as such have removed the sort code I had...
Here is the simple file pull: How could I modify it?
open (FILEHERE, 'ages.txt');
while (<FILEHERE>) {
chomp;
my($n, $s) = split("\t");
print "$a\t $s";
}
close (FILEHERE);

A Schwartzian transform (ST) can help here:
use strict;
use warnings;
my $data = <<END;
chris 19
doofus 67
bobby 29
END
open my $fh, '<', \$data or die $!;
print map $_->[0],
sort { $a->[1] <=> $b->[1] }
map { [ $_, /(\d+)$/ ] }
<$fh>;
close $fh;
Output:
chris 19
bobby 29
doofus 67
Read from the bottom of the ST up. The routine takes a file line, and then within map places that line as the first element of an anonymous array. The second element is the captured numeric value, from the second column. The sort takes an anonymous subroutine to sort on the anonymous array's second element (thus, the dereferencing arrow operator $a->[1]). The results are passed to map to access the sorted lines and those are finally printed.
Hope this helps!

You could read the file into an array of array references and then sort based on each array's second field:
my #lines;
open (FILEHERE, 'ages.txt');
while(<FILEHERE>) {
push #lines, [split /\t/];
}
my #sorted = sort { $b->[1] <=> $a->[1] } #lines;
Or, what might be easier is to write your Perl script assuming that your data is sorted properly, and just read from stdin: sort -grk2 ages.txt | perl yourscript.pl

This one liner from How to sort an array or table by column in perl? should work:
perl -anE 'push #t,[#F]}{ say "#$_" for sort {$a->[1] <=> $b->[1]} #t' names.txt
As with #reo katoa it uses an array of arrays - but leverages -a to autosplit the lines into #F first. See perlrun for details on autosplit.

You can also call sort -k 2,2 in perl to sort the file on 2nd field, of course use -n if they are numbers and -r to do a reverse sort.
I use the following one-liner to see squid access logs, it shows the longest sessions at top
sort -rn -k 2,2 access.log | perl -lpe 's/^([0-9]{10})(.\d{3})/scalar localtime$1/e'

Related

Sorting hashes on value length whilst preserving order

I'm currently writing a Perl script to sort lines from stdin and print the lines in order of line length whilst preserving order for the ones that are equal.
My sorting code consists of the following:
while (my $curr_line = <STDIN>) {
chomp($curr_line);
$lines{$curr_line} = length $curr_line;
}
for my $line (sort{ $lines{$a} <=> $lines{$b} } keys %lines){
print $line, "\n";
}
For example my stdin consists of the following:
tiny line
medium line
big line
huge line
rand line
megahugegigantic line
I'd get the following output:
big line
rand line
tiny line
huge line
medium line
megahugegigantic line
Is there any way I can preserve the order for lines of equal length such that tiny would come before huge which comes before rand? Also, the order seems to change everytime I run the script.
Thanks in advance
One possible solution
You can save the position of the line in the input file handle as well as the length. The $. magic variable (input line number) provides this. You can then sort on both values.
use strict;
use warnings;
my %lines;
while ( my $curr_line = <DATA> ) {
chomp($curr_line);
$lines{$curr_line} = [ length $curr_line, $. ];
}
for my $line (
sort {
$lines{$a}->[0] <=> $lines{$b}->[0]
|| $lines{$a}->[1] <=> $lines{$b}->[1]
} keys %lines
) {
print $line, "\n";
}
__DATA__
tiny lin1
medium line
big line
huge lin2
rand lin3
megahugegigantic line
This will always output
big line
tiny lin1
huge lin2
rand lin3
medium line
megahugegigantic line
You can of course use a hash to make the code more readable, too.
$lines{$curr_line} = {
length => length $curr_line,
position => $.,
};
Explanation of your implementation
Your results changed their order every time because of random hash ordering. The way keys returns the list of keys is random, because of the way Perl implements hashes. This is by design, and a security feature. Since there are several keys that have the same value, the sort will sometimes return different results, based on which of the equal value keys came out first.
You could mitigate this by sticking another sort in front of your keys call. That would sort the keys by name, at least making the order of the undesired result be consistent.
# vvvv
for my $line (sort{ $lines{$a} <=> $lines{$b} } sort keys %lines) { ... }
Note that you don't have to chomp the input if you put the \n back when you print. It's always of the same length anyway. If you do, you should print a $/, which is the input record separator that chomp removed, or you falsify your data.
Your problem is not with sort Perl uses the quick sort algorithm which is a stable sort, inputs that match the same sort key have the same order on output of the sort as input.
Your problem is that you are storing the lines in a hash. A hash is an unordered collection of key value pairs so adding the lines to the hash and then printing them out again with out the sort will give you the lines in a random order.
You need to read all the lines into an array and then sort them on length, the quickest way being to use a Schwartzian Transformation see below.
my #lines = <STDIN>;
chomp(#lines);
my #sorted = # This is the clever bit and needs to be red from the last map up
map { $_->[0] } # Get the lines
sort { $a->[1] <=> $b->[1] } # Sort on length
map { [$_, length $_] } # Create a list of array refs containing
# the line and the length of the line
#lines;
print join "\n", #sorted; # print out the sorted lines
Nowhere do you store the original order, so you can't possibly sort by it. The easiest fix is to store the lines in an array, and ensure that Perl is using a stable sort.
use sort 'stable';
my #lines = <>;
chomp(#lines);
for my $line ( sort { length($a) <=> length($b) } #lines) {
say $line;
}
[ ST is overkill for this. It' such overkill that it probably even slows things down! ]
As has been explained, the randomness comes from your use of hash keys to store the strings. There is no need for this, or anything more elaborate like a Schwartzian Transform, to make this work
All Perl versions since v5.8 have used a stable sort, which will keep values that sort equally in the same order. But you can insist that the sort operator you get is a stable one using the sort pragma with
use sort 'stable'
Here's how I would write your program. It stops reading input at end of file, or when it sees a blank line in case you want to enter the data from the keyboard
use strict;
use warnings 'all';
use feature 'say';
use sort 'stable';
my #list;
while ( <> ) {
last unless /\S/;
chomp;
push #list, $_;
}
say for sort { length $a <=> length $b } #list;
Using the same input as you use in the question, this produces
output
big line
tiny line
huge line
rand line
medium line
megahugegigantic line

How can I extract specific columns in perl?

chr1 1 10 el1
chr1 13 20 el2
chr1 50 55 el3
I have this tab delimited file and I want to extract the second and third column using perl. How can I do that?
I tried reading the file using file handler and storing it in a string, then converting the string to an array but it didn't get me anywhere.
My attempt is:
while (defined($line=<FILE_HANDLE>)) {
my #tf1;
#tf1 = split(/\t/ , $line);
}
Simply autosplit on tab
# ↓ index starts on 0
$ perl -F'\t' -lane'print join ",", #F[1,2]' inputfile
Output:
1,10
13,20
50,55
See perlrun.
use strict;
my $input=shift or die "must provide <input_file> as an argument\n";
open(my $in,"<",$input) or die "Cannot open $input for reading: $!";
while(<$in>)
{
my #tf1=split(/\t/,$_);
print "$tf1[1]|$tf1[2]\n"; # $tf1[1] is the second column and $tf1[2] is the third column
}
close($in)
What problem are you having? Your code already does all the hard parts.
while (defined($line=<FILE_HANDLE>)) {
my #tf1;
#tf1 = split(/\t/ , $line);
}
You have all three columns in your #tf1 array (by the way - your variable naming needs serious work!) All you need to do now is to print the second and third elements from the array (but remember that Perl array elements are numbered from zero).
print "$tf1[1] / $tf1[2]\n";
It's possible to simplify your code quite a lot by taking advantage of Perl's default behaviours.
while (<FILE_HANDLE>) { # Store record in $_
my #tf1 = split(/\t/); # Declare and initialise on one line
# split() works on $_ by default
print "$tf1[1] / $tf1[2]\n";
}
Even more pithily than #daxim as a one-liner:
perl -aE 'say "#F[1,2]" ' file
See also: How to sort an array or table by column in perl?

variable with multiple lines.delete first two lines in perl

I have a result of an sql query.it returns some 10 rows like below:
if i do the below in my perl script.
print $result
it gives me the output :
key value
----------- ------------------------------
1428116300 0003000
560779655 0003001
173413463 0003002
315642 0003003
1164414857 0003004
429589116 0003005
i just want to acheive that the first two lines to be deleted. and store the rest of each line in an array.
could any body please tell how do i achive this?
With something like :
my #lines = split /\n/, $result;
splice #lines,0,2;
Explanations :
split /\n/, $result is cutting your variable into an array of lines.
grep /^[\s\d]+$/ is filtering this array, and only keeps the elements that are a single line of spaces or digits (thus removing the first two lines)
Data-independent, little roundabout way: If you print $result out in a file, you can
use Tie::File;
tie #lines, Tie::File, $file or die "can't update $file: $!";
delete $lines[1];
delete $lines[2];
(untested)

Remove duplicates from list of files in perl

I know this should be pretty simple and the shell version is something like:
$ sort example.txt | uniq -u
in order to remove duplicate lines from a file. How would I go about doing this in Perl?
The interesting spin on this question is the uniq -u! I don't think the other answers I've seen tackle this; they deal with sort -u example.txt or (somewhat wastefully) sort example.txt | uniq.
The difference is that the -u option eliminates all occurrences of duplicated lines, so the output is of lines that appear only once.
To tackle this, you need to know how many times each name appears, and then you need to print the names that appear just once. Assuming the list is to be read from standard input, then this code does the trick:
my %counts;
while (<>)
{
chomp;
$counts{$_}++;
}
foreach my $name (sort keys %counts)
{
print "$name\n" if $counts{$name} == 1;
}
Or, using using grep:
my %counts;
while (<>)
{
chomp;
$counts{$_}++;
}
{
local $, = "\n";
print grep { $counts{$_} == 1 } sort keys %counts;
}
Or, if you don't need to remove the newlines (because you're only going to print the names):
my %counts;
$counts{$_}++ for (<>);
print grep { $counts{$_} == 1 } sort keys %counts;
If you do in fact want every name that appears in the input to appear in the output (but only once), then any of the other solutions will do the trick (or, with minimal adaptation, will do the trick). In fact, since the input lines will end with a newline, you can generate the answer in just two lines:
my %counts = map { $_, 1 } <>;
print sort keys %counts;
No, you can't do it in one by simply replacing %counts in the print line with the map in the first line:
print sort keys map { $_, 1 } <>;
You get the error:
Type of arg 1 to keys must be hash or array (not map iterator) at ...
or use 'uniq' sub from List::MoreUtils module after reading all the file to a list (although its not a good solution)
Are you wanting to update a list of files to remove duplicate lines?
Or process a list of files, ignoring duplicate lines?
Or remove duplicate filenames from a list?
Assuming the latter:
my %seen;
#filenames = grep !$seen{$_}++, #filenames;
or other solutions from perldoc -q duplicate
First of all, sort -u xxx.txt would have been smarter than sort | uniq -u.
Second, perl -ne 'print unless $seen{$_}++' is prone to integer overflow, so a more sophisticated way of perl -ne 'if(!$seen{$_}){print;$seen{$_}=1}' seems preferable.

How do I print unique elements in Perl array?

I'm pushing elements into an array during a while statement. Each element is a teacher's name. There ends up being duplicate teacher names in the array when the loop finishes. Sometimes they are not right next to each other in the array, sometimes they are.
How can I print only the unique values in that array after its finished getting values pushed into it? Without having to parse the entire array each time I want to print an element.
Heres the code after everything has been pushed into the array:
$faculty_len = #faculty;
$i=0;
while ($i != $faculty_len)
{
printf $fh '"'.$faculty[$i].'"';
$i++;
}
use List::MoreUtils qw/ uniq /;
my #unique = uniq #faculty;
foreach ( #unique ) {
print $_, "\n";
}
Your best bet would be to use a (basically) built-in tool, like uniq (as described by innaM).
If you don't have the ability to use uniq and want to preserve order, you can use grep to simulate that.
my %seen;
my #unique = grep { ! $seen{$_}++ } #faculty;
# printing, etc.
This first gives you a hash where each key is each entry. Then, you iterate over each element, counting how many of them there are, and adding the first one. (Updated with comments by brian d foy)
I suggest pushing it into a hash.
like this:
my %faculty_hash = ();
foreach my $facs (#faculty) {
$faculty_hash{$facs} = 1;
}
my #faculty_unique = keys(%faculty_hash);
#array1 = ("abc", "def", "abc", "def", "abc", "def", "abc", "def", "xyz");
#array1 = grep { ! $seen{ $_ }++ } #array1;
print "#array1\n";
This question is answered with multiple solutions in perldoc. Just type at command line:
perldoc -q duplicate
Please note: Some of the answers containing a hash will change the ordering of the array. Hashes dont have any kind of order, so getting the keys or values will make a list with an undefined ordering.
This doen't apply to grep { ! $seen{$_}++ } #faculty
This is a one liner command to print unique lines in order it appears.
perl -ne '$seen{$_}++ || print $_' fileWithDuplicateValues
I just found hackneyed 3 liner, enjoy
my %uniq;
undef #uniq(#non_uniq_array);
my #uniq_array = keys %uniq;
Just another way to do it, useful only if you don't care about order:
my %hash;
#hash{#faculty}=1;
my #unique=keys %hash;
If you want to avoid declaring a new variable, you can use the somehow underdocumented global variable %_
#_{#faculty}=1;
my #unique=keys %_;
If you need to process the faculty list in any way, a map over the array converted to a hash for key coalescing and then sorting keys is another good way:
my #deduped = sort keys %{{ map { /.*/? ($_,1):() } #faculty }};
print join("\n", #deduped)."\n";
You process the list by changing the /.*/ regex for selecting or parsing and capturing accordingly, and you can output one or more mutated, non-unique keys per pass by making ($_,1):() arbitrarily complex.
If you need to modify the data in-flight with a substitution regex, say to remove dots from the names (s/\.//g), then a substitution according to the above pattern will mutate the original #faculty array due to $_ aliasing. You can get around $_ aliasing by making an anonymous copy of the #faculty array (see the so-called "baby cart" operator):
my #deduped = sort keys %{{ map {/.*/? do{s/\.//g; ($_,1)}:()} #{[ #faculty ]} }};
print join("\n", #deduped)."\n";
print "Unmolested array:\n".join("\n", #faculty)."\n";
In more recent versions of Perl, you can pass keys a hashref, and you can use the non-destructive substitution:
my #deduped = sort keys { map { /.*/? (s/\.//gr,1):() } #faculty };
Otherwise, the grep or $seen[$_]++ solutions elsewhere may be preferable.