Sort CSV based on a certain column? - perl

I'm sure I've done this in the past and there is something small I'm forgetting, but how can I sort a CSV file on a certain column? I'm interested in answers with and without 3rd party Perl modules. Mainly methods without, since I don't always have access to install additional modules.
Example data:
name,25,female
name,24,male
name,27,female
name,21,male
desired end result after sorting on the 2nd numeric column:
name,21,male
name,24,male
name,25,female
name,27,female

As CSV is a pretty complex format, it is better to use a module that does the work for us.
Following is an example using the Text::CSV module:
#!/usr/bin/env perl
use strict;
use warnings;
use constant AGE => 1;
use Text::CSV;
my $csv = Text::CSV->new();
my #rows;
while ( my $row_ref = $csv->getline( \*DATA ) ) {
push #rows, $row_ref;
}
#rows = sort { $a->[AGE] <=> $b->[AGE] } #rows;
for my $row_ref (#rows) {
$csv->combine(#$row_ref);
print $csv->string(), "\n";
}
__DATA__
name,25,female
name,24,male
name,27,female
name,21,male

In the spirit of there always being another way to do it, bear in mind that plain old GNU sort might be enough.
$ sort -t, -k2 -n unsorted.txt
name,21,male
name,24,male
name,25,female
name,27,female
Where the command line args are:
-t, # use comma as the record separator
-k2 # sort on the second key (record) in the line
-n # sort using numerical comparison (like using <=> instead of cmp in perl)
If you want a Perl solution, wrap it in qx() ;-)

There is also DBD::CSV:
#!/usr/bin/perl
use strict; use warnings;
use DBI;
my $dbh = DBI->connect('dbi:CSV:', undef, undef, {
RaiseError => 1,
f_ext => '.csv',
csv_tables => { test => { col_names => [qw' name age sex '] } },
});
my $sth = $dbh->prepare(q{
SELECT name, age, sex FROM test ORDER BY age
});
$sth->execute;
while ( my #row = $sth->fetchrow_array ) {
print join(',' => #row), "\n";
}
$sth->finish;
$dbh->disconnect;
Output:
name,21,male
name,24,male
name,25,female
name,27,female

The original poster asked for no third-party modules (which I take to mean nothing from CPAN). Whilst this is restriction that will horribly limit your ability to write good modern Perl code, in this instance it's possible using the (core) Text::ParseWords module in place of the (non-core) Text::CSV. So, borrowing heavily from Alan's example, we get:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::ParseWords;
my #rows;
while (<DATA>) {
push #rows, [ parse_line(',', 0, $_) ];
}
#rows = sort { $a->[1] <=> $b->[1] } #rows;
foreach (#rows) {
print join ',', #$_;
}
__DATA__
name,25,female
name,24,male
name,27,female
name,21,male

When you provide your own comparison code, you can sort on anything. Just extract the desired element with a regex, or probably a split in this case, and then compare on that. If you have a lot of elements, I would parse the data into a list of lists and then the comparison code can access it without parsing. That would eliminate parsing the same row over and over as it's compared with other rows.

using Raku (née Perl6)
This is a fairly quick-and-dirty solution, mainly intended for "hand-rolled" CSV. The code works as long as there's only one (1) age-per-row: Read lines $a, comb for 1-to-3 <digit> surrounded by commas and assign to #b, derive sorting index $c, use $c to reorder lines $a:
~$ raku -e 'my $a=lines(); my #b=$a.comb(/ \, <(\d**1..3)> \, /).pairs; my $c=#b.sort(*.values)>>.keys.flat; $a[$c.flat]>>.put;' sort_age.txt
name,21,male
name,24,male
name,25,female
name,27,female
I prepended a few dummy lines to the OP's input file see how the code above reacts with 1). a blank age field, 2). a blank "" string for age, 3). a bogus "9999" for age, and 4). a bogus "NA" for age. The code above fails catastrophically. To fix this you have to write a ternary that inserts a numeric placeholder value (e.g. zero) whenever the regex fails to match a line.
Below is a longer but more robust solution. Note--I use a placeholder value of 999 to move lines with blank/invalid ages to the bottom:
~$ raku -e 'my #a=lines(); my #b = do for #a {if $_ ~~ m/ \, <(\d**1..3)> \, / -> { +$/ } else { 999 }; }; my $c=#b.pairs.sort(*.values)>>.keys.flat; #a[$c.flat]>>.put;' sort_age.txt
name,21,male
name,24,male
name,25,female
name,27,female
name,,male
name,"",female
name,9999,male
name,NA,male
To sort in reverse, add .reverse to the end of the method chain that creates $c. Again, change the else placeholder argument to move lines absent a valid age to the top or to the bottom. Also, creation of #b above can be written using the ternary operator: my #b = do for #a {(m/ \, <(\d**1..3)> \, /) ?? +$/ !! 999 };, as an alternative.
Here's the unsorted input file for posterity:
$ cat sort_age.txt
name,,male
name,"",female
name,9999,male
name,NA,male
name,25,female
name,24,male
name,27,female
name,21,male
HTH.
https://raku.org/

I would do something like this:
#!/usr/bin/perl
use warnings;
use strict;
my #rows = map { chomp; [split /[,\s]+/, $_] } <DATA>; #read each row into an array
my #sorted = sort { $a->[1] <=> $b->[1] } #rows; # sort the rows (numerically) by second column
for (#sorted) {
print join(', ', #$_) . "\n"; # print them out as CSV
}
__DATA__
name,25,female
name,24,male
name,27,female
name,21,male

Related

how to remove last single line available in file using perl

how to remove last single line available in file using perl.
I have my data like below.
"A",1,-2,-1,-4,
"B",3,-5,-2.-5,
how to remove the last line... I am summing all the numbers but receiving a null value at the end.
Tried using chomp but did not work.
Here is the code currently being used:
while (<data>) {
chomp(my #row = (split ',' , $_ , -1);
say sum #row[1 .. $#row];
}
Try this (shell one-liner) :
perl -lne '!eof() and print' file
or as part of a script :
while (defined($_ = readline ARGV)) {
print $_ unless eof();
}
You should be using Text::CSV or Text::CSV_XS for handling comma separated value files. Those modules are available on CPAN. That type of solution would look like this:
use Text::CSV;
use List::Util qw(sum);
my $csv = Text::CSV->new({binary => 1})
or die "Cannot use CSV: " . Text::CSV->error_diag;
while(my $row = $csv->getline($fh)) {
next unless ($row->[0] || '') =~ m/\w/; # Reject rows that don't start with an identifier.
my $sum = sum(#$row[1..$#$row]);
print "$sum\n";
}
If you are stuck with a solution that doesn't use a proper CSV parser, then at least you'll need to add this to your existing while loop, immediately after your chomp:
next unless scalar(#row) && length $row[0]; # Skip empty rows.
The point to this line is to detect when a row is empty -- has no elements, or elements were empty after the chomp.
I suspect this is an X/Y question. You think you want to avoid processing the final (empty?) line in your input when actually you should be ensuring that all of your input data is in the format you expect.
There are a number of things you can do to check the validity of your data.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use List::Util 'sum';
use Scalar::Util 'looks_like_number';
while (<DATA>) {
# Chomp the input before splitting it.
chomp;
# Remove the -1 from your call to split().
# This automatically removes any empty trailing fields.
my #row = split /,/;
# Skip lines that are empty.
# 1/ Ensure there is data in #row.
# 2/ Ensure at least one element in #row contains
# non-whitespace data.
next unless #row and grep { /\S/ } #row;
# Ensure that all of the data you pass to sum()
# looks like numbers.
say sum grep { looks_like_number $_ } #row[1 .. $#row];
}
__DATA__
"A",1.2,-1.5,4.2,1.4,
"B",2.6,-.50,-1.6,0.3,-1.3,

Parse report in blocks to CSV

I have lots of data dumps in a pretty huge amount of data structured as follow
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet
Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all
Which I would like to transform to something like:
Key1,Key2,Key3,Key5
Value,Other value,Maybe another value yet,
Different value,,Invaluable,Has no value at all
I mean:
Generate a collection of all the keys
Generate a header line with all the Keys
Map all the values to their correct "columns" (notice that in this example I have no "Key4", and Key3/Key5 interchanged)
Possibly in Perl, since it would be easier to use in various environments.
But I am not sure if this format is unusual, or if there is a tool that already does this.
This is fairly easy using hashes and the Text::CSV_XS module:
use strict;
use warnings;
use Text::CSV_XS;
my #rows;
my %headers;
{
local $/ = "";
while (<DATA>) {
chomp;
my %record;
for my $line (split(/\n/)) {
next unless $line =~ /^([^:]+):\.+\s(.+)/;
$record{$1} = $2;
$headers{$1} = $1;
}
push(#rows, \%record);
}
}
unshift(#rows, \%headers);
my $csv = Text::CSV_XS->new({binary => 1, auto_diag => 1, eol => $/});
$csv->column_names(sort(keys(%headers)));
for my $row_ref (#rows) {
$csv->print_hr(*STDOUT, $row_ref);
}
__DATA__
Key1:.............. Value
Key2:.............. Other value
Key3:.............. Maybe another value yet
Key1:.............. Different value
Key3:.............. Invaluable
Key5:.............. Has no value at all
Output:
Key1,Key2,Key3,Key5
Value,"Other value","Maybe another value yet",
"Different value",,Invaluable,"Has no value at all"
If your CSV format is 'complicated' - e.g. it contains commas, etc. - then use one of the Text::CSV modules. But if it isn't - and this is often the case - I tend to just work with split and join.
What's useful in your scenario, is that you can map key-values within a record quite easily using a regex. Then use a hash slice to output:
#!/usr/bin/env perl
use strict;
use warnings;
#set paragraph mode - records are blank line separated.
local $/ = "";
my #rows;
my %seen_header;
#read STDIN or files on command line, just like sed/grep
while ( <> ) {
#multi - line pattern, that matches all the key-value pairs,
#and then inserts them into a hash.
my %this_row = m/^(\w+):\.+ (.*)$/gm;
push ( #rows, \%this_row );
#add the keys we've seen to a hash, so we 'know' what we've seen.
$seen_header{$_}++ for keys %this_row;
}
#extract the keys, make them unique and ordered.
#could set this by hand if you prefer.
my #header = sort keys %seen_header;
#print the header row
print join ",", #header, "\n";
#iterate the rows
foreach my $row ( #rows ) {
#use a hash slice to select the values matching #header.
#the map is so any undefined values (missing keys) don't report errors, they
#just return blank fields.
print join ",", map { $_ // '' } #{$row}{#header},"\n";
}
This for you sample input, produces:
Key1,Key2,Key3,Key5,
Value,Other value,Maybe another value yet,,
Different value,,Invaluable,Has no value at all,
If you want to be really clever, then most of that initial building of the loop can be done with:
my #rows = map { { m/^(\w+):\.+ (.*)$/gm } } <>;
The problem then is - you would need to build up the 'headers' array still, and that means a bit more complicated:
$seen_header{$_}++ for map { keys %$_ } #rows;
It works, but I don't think it's as clear about what's happening.
However the core of your problem may be the file size - that's where you have a bit of a problem, because you need to read the file twice - first time to figure out which headings exist throughout the file, and then second time to iterate and print:
#!/usr/bin/env perl
use strict;
use warnings;
open ( my $input, '<', 'your_file.txt') or die $!;
local $/ = "";
my %seen_header;
while ( <$input> ) {
$seen_header{$_}++ for m/^(\w+):/gm;
}
my #header = sort keys %seen_header;
#return to the start of file:
seek ( $input, 0, 0 );
while ( <$input> ) {
my %this_row = m/^(\w+):\.+ (.*)$/gm;
print join ",", map { $_ // '' } #{$this_row}{#header},"\n";
}
This will be slightly slower, as it'll have to read the file twice. But it won't use nearly as much memory footprint, because it isn't holding the whole file in memory.
Unless you know all your keys in advance, and you can just define them, you'll have to read the file twice.
This seems to work with the data you've given
use strict;
use warnings 'all';
my %data;
while ( <> ) {
next unless /^(\w+):\W*(.*\S)/;
push #{ $data{$1} }, $2;
}
use Data::Dump;
dd \%data;
output
{
Key1 => ["Value", "Different value"],
Key2 => ["Other value"],
Key3 => ["Maybe another value yet", "Invaluable"],
Key5 => ["Has no value at all"],
}

Sorting names of files in Perl?

I'm writing a script in Perl that I want to run on all the .csv files in a given directory. The names of the files are of the type: CCCC0.csv, CCCC1.csv, ..., CCCC198.csv. However, I want Perl to first run the script on file CCCC0.csv, than on CCCC1.csv etc...So, basically, according to the increasing value of the number at the end of the file name.
If I write:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
my $file;
my #files = <*.csv>;
my #orderedfiles = sort #files;
for $file (#orderedfiles) {
... do stuff
}
it first runs on CCCC100.csv rather than CCCC11.csv while if i write
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
my $file;
my #files = <*.csv>;
my #orderedfiles = sort { substr($a, 4) <=> substr($b, 4) } #files;
for $file (#orderedfiles) {
... do stuff
}
it gives me an error telling me that I'm not ordering a numeric (I assume that he doesn't understand that it's a number after the 4 characters rather than another character.)
I have looked at the countless questions on Stackoverflow or perlmonks that deal with sorting but i haven't been able to find an answer to my question.
EDIT: I'm using a windows machine.
You were almost there... the '.CSV' is still there. You'd be better served using regex to read just numeric characters.
my #sorted = sort { ($a =~ /(\d+)/)[0] <=> ($b =~ /(\d+)/)[0] } #files;
There is an idiom called the Schwartzian Transform that can also do this, though it takes a CS major to understand :D
my #sorted = map { $_->[0] } # return the sorted file names
#
sort { $a->[1] <=> $b->[1] } # sort on the numeric portion
#
map { [$_, /(\d+)/] } # wrap the file names in a temporary
#files; # array with their numeric portions.
# ^^ read from bottom to top ^^
You could give Sort::Key::Natural a spin. From the synopsis:
use Sort::Key::Natural qw(natsort);
my #data = qw(foo1 foo23 foo6 bar12 bar1
foo bar2 bar-45 foomatic b-a-r-45);
my #sorted = natsort #data;
print "#sorted\n";
# prints:
# b-a-r-45 bar1 bar2 bar12 bar-45 foo foo1 foo6 foo23 foomatic
I believe that the substr($a, 4) is returning "100.csv" in your example, so that you need to trim the .csv suffix off it still.

Read from input and store comma separated values in Hash

I have a Perl question like this:
Write a Perl program that will read a series of last names and phone numbers from the given input. The names and numbers should be separated by a comma. Then print the names and numbers alphabetically according to last name.Use hashes.
Any idea how to solve this?
There's more than one way to do it :)
my %phonebook;
while(<>) {
chomp;
my ($name, $phone) = split /,/;
$phonebook{$name} = $phone;
}
print "$_ => $phonebook{$_}\n" for sort keys %phonebook;
Something like the following perhaps.
my %hash;
foreach(<>){ #reads yor args from commandline or input-file
my #arr = split(/\,/); #split at comma, every line
$hash{$arr[0]} = $arr[1]; #assign to hash
}
#print hash here
foreach my $key (sort keys %hash ) #sort and iterate
{
print "Name: " . $key . " Number: " . $hash{$key} . "\n";
}
Tasks like this are the strength of perl's command line switches. See perldoc perlrun for more infos!
Command line input
$ perl -naF',\s*' -lE'$d{$F[0]}=$F[1];END{say"$_: $d{$_}"for sort keys%d}'
Moe, 12345
Pi, 31416
Homer, 54321
Output
Homer: 54321
Moe: 12345
Pi: 31416
Assuming that we split on commas (you should use Text::CSV generally), we can actually create this hash with a simple application of the map function and the diamond operator (<>).
#!/usr/bin/env perl
use strict;
use warnings;
my %phonebook = map { chomp; split /,/ } <>;
use Data::Dumper;
print Dumper \%phonebook;
The last two lines are just to visualize the result, and the upper three should be in all scripts. The meat of the work is done all in the one line.

Using Perl to create another Perl file

I have an input file that looks like
*firsttitle
nameA
nameB
nameC
*secondtitle
xnameA
xnameB
xnameC
I want to create a Perl script that takes this file and basically will create another perl script that looks like
#!/usr/bin/perl
use strict;
use warnings;
my %tags = (
"firsttitle" => [ qw (nameA nameB nameC) ],
"secondtitle" => [ qw (xnameA xnameB xnameC) ]);
my $rx = join '|', keys %tags;
while (<>) {
s/^\s*($rx):\s*(\d+)/$1: $tags{$1}[$2]/;
print;
}
My thought process is that I have to first match print out the regular perl code (#!,use..etc.).Then add " my%tags=(. Then take the input file and look for the * and that's the lookup for the hash and start parsing everything after until the next(*) or end of life. If it's another * then do it again. If it's EOF then add ");" and end. And then finish with printing the last bit of perl code. Help/ideas would be appreciated. If you're going to post code snippets could you go through and explain what each part is doing? Thanks!
Very simple script. First just parse through the input file. Lines that start with * will be titles, and all the following lines up until the next *-line will be values. We put this into a hash of arrays.
The map statement gives us a list of the hash key (the title), and it's values joined together with space. We put this in an array for printing. The printing itself is done with printf, which can be a bit difficult to use, since meta characters will mess us up. Any % that are to be literal must be written as %%. I also changed single quotes from the original to double quotes. I use single quotes on the printf pattern to avoid accidental interpolation of variables.
An alternative - possibly better one - is to not just printf at all, and simply concatenate the string in a normal fashion.
use strict;
use warnings;
my ($title, %hash);
while (<DATA>) {
chomp;
if (/^\*(.+)$/) {
$title = $1;
} else {
push #{$hash{$title}}, $_;
}
}
my #args = ( map { $_, join(' ', #{$hash{$_}}) } keys %hash );
printf '#!/usr/bin/perl
use strict;
use warnings;
my %%tags = (
"%s" => [ qw ( %s ) ],
"%s" => [ qw ( %s ) ]);
my $rx = join "|", keys %%tags;
while (<>) {
s/^\s*($rx):\s*(\d+)/$1: $tags{$1}[$2]/;
print;
}', #args;
__DATA__
*firsttitle
nameA
nameB
nameC
*secondtitle
xnameA
xnameB
xnameC
Update:
This will use a different method of printing, which will be more stable.
my #args = ( map { " '$_' => [ qw ( #{$hash{$_}} ) ],\n" } keys %hash );
print '#!/usr/bin/perl
use strict;
use warnings;
my %tags = (
', #args, '
);
my $rx = join "|", keys %tags;
while (<>) {
s/^\s*($rx):\s*(\d+)/$1: $tags{$1}[$2]/;
print;
}';