Sorting CSV file column value based on column headers

Sorting CSV file column value based on column headers - perl

Hi am newbie fir perl scripting, i need a help implement a logic for sorting CSV file header based column values,.
Example:
S.NO,NAME,S2,S5,S3,S4,S1
1,aaaa,88,99,77,55,66
2,bbbb,66,77,88,99,55
3,cccc,55,44,77,88,66
4,dddd,77,55,66,88,99
now i want to sort this file as below..
s.no,s2,s4,s5,s1,s0,name => that's how i want is as i defined order of headers like s.no,name,s1,s2,s3,s4,s5 and it's respective whole columns values also should change based on header exchange, how to do it perl this one...?
That's the required output is like following bellow,
S.NO,NAME,S1,S2,S3,S4,S5
1,aaaaaaa,66,88,77,55,99
2,bbbbbbb,55,66,88,77,99
3,ccccccc,66,55,77,88,44
4,ddddddd,99,77,66,88,55
or what the order i want in column headers, like below.
S.NO,NAME,S5,S4,S3,S2,S1 -> like as per my requirement i need to re-order my columns header and it's respective columns value also..
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $file = 'a1.csv';
my $size = 3;
my #files;
my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1, sep_char => ';' });
open my $in, "<:encoding(utf8)", $file or die "$file: $!";
while (my $row = $csv->getline($in)) {
if (not #files) {
my $file_counter = int #$row / $size;
$file_counter++ if #$row % $size;
for my $i (1 .. $file_counter) {
my $outfile = "output$i.csv";
open my $out, ">:encoding(utf8)", $outfile or die "$outfile: $!";
push #files, $out;
}
}
my #fields = #$row;
foreach my $i (0 .. $#files) {
my $from = $i*$size;
my $to = $i*$size+$size-1;
$to = $to <= $#fields ? $to : $#fields;
my #data = #fields[$from .. $to];
$csv->print($files[$i], \#data);
print {$files[$i]} "\n";
}
}

#!/usr/bin/perl
use strict;
use warnings;
use autodie;
use Text::CSV qw();
my #headers = qw(s.no name s1 s2 s3 s4 s5);
my $csv_in = Text::CSV->new({binary => 1, auto_diag => 1});
my $csv_out = Text::CSV->new({binary => 1, auto_diag => 1});
open my $in, '<:encoding(UTF-8)', 'a1.csv';
open my $out, '>:encoding(UTF-8)', 'output1.csv';
$csv_in->header($in);
$csv_out->say($out, [#headers]);
while (my $row = $csv_in->getline_hr($in)) {
$csv_out->say($out, [$row->#{#headers}]);
}

The handy Text::AutoCSV module lets you rearrange the column order as a one-liner:
$ perl -MText::AutoCSV -e 'Text::AutoCSV->new(in_file=>"in.csv",out_file=>"out.csv",out_fields=>["SNO","NAME","S1","S2","S3","S5"])->write()'
$ cat out.csv
s.no,name,s1,s2,s3,s5
1,aaaa,66,55,77,99
2,bbbb,55,99,88,77
3,cccc,66,88,77,44
4,dddd,99,88,66,55
I'm not sure what your actual desired order of fields is because you have two and both of them include columns that aren't in the sample input file (It has two s2 columns; is one of them supposed to be s4?), but you should get the idea. Column names have to be all caps with special characters like . removed, but the actual names are used for the output.

my $eix = "001"; $csv_in->header ($in, munge_column_names => sub { s/^$/"E".$eix++/er/; });

Related

Remove/Extract rows based on Unique/duplicate Id from a CSV file

Depending on how you look at it I need to remove rows based on if the Id is unique or extract rows if the Id has duplicates (keeping all duplicates).
And I'm unsure/don't have enough knowledge of Perl to accomplish this. I've found similair topics but didn't have much succes. These are the examples I'm using example 1, example 2 and example 3. In a previous problem someone showed me a solution with the List::MoreUtils module, so I could merge values with a common Id. This is not the case now, this one is removing rows if the id is unique. I know I can probably do this with the List::MoreUtils module but I want to do it without. This is my dummy data (copied example data from other question since the data doesn't matter), here you can see what I'm after. Order is not important.
Before:
Cat_id;Cat_name;Id;Name;Amount;Colour;Bla
101;Fruits;50010;Grape;500;Red;1
101;Fruits;50020;Strawberry;500;Red;1
201;Vegetables;60010;Carrot;500;White;1
101;Fruits;50060;Apple;1000;Red;1
101;Fruits;50030;Banana;1000;Green;1
101;Fruits;50060;Apple;500;Green;1
101;Fruits;50020;Strawberry;1000;Red;1
201;Vegetables;60010;Carrot;100;Purple;1
101;Fruits;50020;Strawberry;200;Red;1
After:
Cat_id;Cat_name;Id;Name;Amount;Colour;Bla
101;Fruits;50020;Strawberry;500;Red;1
201;Vegetables;60010;Carrot;500;White;1
101;Fruits;50060;Apple;1000;Red;1
101;Fruits;50060;Apple;500;Green;1
101;Fruits;50020;Strawberry;1000;Red;1
201;Vegetables;60010;Carrot;100;Purple;1
101;Fruits;50020;Strawberry;200;Red;1
You can see that the rows of Grape and Banana with id 50010 and 50030 have been removed because there only exists one entry for both.
This is my script, I'm struggeling with the part where I select the unique values from the hash and to output them (taking the Text::CSV_XS module in account). Can someone show me how to do this?
#!/usr/bin/perl -w
use strict;
use warnings;
use Text::CSV_XS;
my $inputfile = shift || die "Give input and output names!\n";
my $outputfile = shift || die "Give output name!\n";
open (my $infile, '<:encoding(iso-8859-1)', $inputfile) or die "Sourcefile in use / not found :$!\n";
open (my $outfile, '>:encoding(UTF-8)', $outputfile) or die "Outputfile in use :$!\n";
my $csv_in = Text::CSV_XS->new({binary => 1,sep_char => ";",auto_diag => 1,always_quote => 1,eol => $/});
my $csv_out = Text::CSV_XS->new({binary => 1,sep_char => "|",auto_diag => 1,always_quote => 1,eol => $/});
my $header = $csv_in->getline($infile);
$csv_out->print($outfile, $header);
my %data;
while (my $elements = $csv_in->getline($infile)){
my #columns = #{ $elements };
my $id = $columns[2];
push #{ $data{$id} }, \#columns;
}
for my $id ( sort keys %data ){ # Sort not important
if #{ $data{$id} } > 1 # Here I have no idea anymore..
$csv_out->print($outfile, \#columns); #
}

Rather than loading a hash with the entire dataset, I think I'd go ahead and read the file twice, loading a hash with just your ID values. This will definitely take longer, but as your file grows, there may be disadvantages of having all of that data in memory.
That said, I did not use Text::CSV_XS but this is a notional idea of what I had in mind.
my %count;
open (my $infile, '<:encoding(iso-8859-1)', $inputfile) or die;
open (my $outfile, '>:encoding(UTF-8)', $outputfile) or die;
while (<$infile>) {
next if $. == 1;
my ($id) = (split /;/, $_, 4)[2];
$count{$id}++;
}
seek $infile, 0, 0;
while (<$infile>) {
my #fields = split /;/;
print $outfile join '|', #fields if $count{$fields[2]} > 1 or $. == 1;
}
close $infile;
close $outfile;
The $. == 1 at the end is so you don't lose your header row.
-- EDIT --
#!/usr/bin/perl -w
use strict;
use warnings;
use Text::CSV_XS;
my $inputfile = shift || die "Give input and output names!\n";
my $outputfile = shift || die "Give output name!\n";
open (my $infile, '<:encoding(iso-8859-1)', $inputfile) or die;
open (my $outfile, '>:encoding(UTF-8)', $outputfile) or die;
my $csv_in = Text::CSV_XS->new({binary => 1,sep_char => ";",
auto_diag => 1,always_quote => 1,eol => $/});
my $csv_out = Text::CSV_XS->new({binary => 1,sep_char => "|",
auto_diag => 1,always_quote => 1,eol => $/});
my ($count, %count) = (1);
while (my $elements = $csv_in->getline($infile)){
$count{$$elements[2]}++;
}
seek $infile, 0, 0;
while (my $elements = $csv_in->getline($infile)){
$csv_out->print($outfile, $elements)
if $count{$$elements[2]} > 1 or $count++ == 1;
}
close $infile;
close $outfile;

How to take only latest uniq record based on any column

I am writing a script in perl. but got stuck in one part. Below is the sample of my csv files.
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"GJ","919904303790","20150806125002","prepaid","prepaid","2G","3G"
"MH","919921990805","20150806125003","prepaid","prepaid","2G"
"MP","918120197922","20150806125004","prepaid","prepaid","2G"
"MUM","919904303790","20150806125005","prepaid","prepaid","2G","3G"
"MUM","918652624178","20150806125005","","prepaid","","2G","NEW"
"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"
Now I need to take unique records on the basis of 2nd column (i.e. mobile numbers ) but considering only the latest value of 3rd column (ie timestamp)
eg: for mobile number "918120197922".
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"MP","918120197922","20150806125004","prepaid","prepaid","2G"
"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"
it should select the 3rd record as it has the latest value of timestamp (20150806125005). Please help.
Additional Info:
Sorry for inconsistency in data..I have rectified it now.
Yes data is in order which means latest timestamp will appear in the latest rows.
One more thing that my file has the size of more than 1 gb so is there any way to efficiently do this? Will awk work faster than perl in this case. Please help?

Use Text::CSV to process CSV files.
Hash the lines by the 2nd column, only keep the most recent one in the hash.
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV;
my $csv = 'Text::CSV'->new() or die 'Text::CSV'->error_diag;
my %hash;
open my $CSV, '<', '1.csv' or die $!;
while (my $row = $csv->getline($CSV)) {
my ($number, $timestamp) = #$row[1, 2];
# Store the row if the timestamp is more recent than the stored one.
$hash{$number} = $row if $timestamp gt ($hash{$number}[2] || q());
}
$csv->eol("\n");
$csv->always_quote(1);
open my $OUT, '>', 'uniq.csv' or die $!;
for my $row (values %hash) {
$csv->print($OUT, $row);
}
close $OUT or die $!;

If you know your data is ordered by timestamp you can exploit this and read them backwards and transform your task into a problem to output the first occurrence of each phone number.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Text::CSV_XS;
use constant PHONENUM_FIELD => 1;
my $filename = shift;
die "Usage: $0 <filename>\n" unless defined $filename;
open my $in, '-|', 'tac', $filename;
my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 1, eol => $/ } );
my %seen;
while ( my $row = $csv->getline($in) ) {
$csv->print( *STDOUT, $row ) unless $seen{ $row->[PHONENUM_FIELD] }++;
}
If you would like to have output in same order as input you can write into tac as well:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Text::CSV_XS;
use constant PHONENUM_FIELD => 1;
my $filename = shift;
die "Usage: $0 <filename>\n" unless defined $filename;
open my $in, '-|', 'tac', $filename;
open my $out, '|-', 'tac';
my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 1, eol => $/ } );
my %seen;
while ( my $row = $csv->getline($in) ) {
$csv->print( $out, $row ) unless $seen{ $row->[PHONENUM_FIELD] }++;
}
1GB should not be a problem at any decent HW. On my old notebook, it took 2m3.393s for processing 29360128 rows and 1.8GB. It is more than 230krows/s but YMMV. Add always_quote => 1 to $csv constructor parameters if you are interested to gain quoted all values at the output.

Use CSV_XS to read in a .CSV file and select specified columns

I would like to read in a .csv file using CSV_XS then select columns from with by header to match what is stored in an array outputting a new .csv
use strict;
use warnings;
use Text::CSV_XS;
my $csvparser = Text::CSV_XS->new () or die "".Text::CSV_XS->error_diag();
my $file;
my #headers;
foreach $file (#args){
my #CSVFILE;
my $csvparser = Text::CSV_XS->new () or die "".Text::CSV_XS->error_diag();
for my $line (#csvfileIN) {
$csvparser->parse($line);
my #fields = $csvparser->fields;
$line = $csvparser->combine(#fields);
}
}

The following example, just parse a CSV file to a variable, then you can match, remove, add lines to that variable, and write back the variable to the same CSV file.
In this example i just remove one entry line from the CSV.
First, i would just parse the CSV file.
use Text::CSV_XS qw( csv );
$parsed_file_array_of_hashesv = csv(
in => "$input_csv_filename",
sep => ';',
headers => "auto"
); # as array of hash
Second, once you have the $parsed_file_array_of_hashesv, now you can loop that array in perl and detect the line you want to remove from the array.
and then remove it using
splice ARRAY, OFFSET, LENGTH
removes anything from the OFFSET index through the index OFFSET+LENGT
lets assume index 0
my #extracted_array = #$parsed_file_array_of_hashesv; #dereference hashes reference
splice #extracted_array, 0, 1;#remove entry 0
$ref_removed_line_parsed = \#extracted_array; #referece to array
Third, write back the array to the CSV file
$current_metric_file = csv(
in => $ref_removed_line_parsed, #only accepts referece
out => "$output_csv_filename",
sep => ';',
eol => "\n", # \r, \n, or \r\n or undef
#headers => \#sorted_column_names, #only accepts referece
headers => "auto"
);
Notice, that if you use the \#sorted_column_names you will be able to control the order of the columns
my #sorted_column_names;
foreach my $name (sort {lc $a cmp lc $b} keys %{ $parsed_file_array_of_hashesv->[0] }) { #all hashes have the same column names so we choose the first one
push(#sorted_column_names,$name);
}
That should write the CSV file without your line.

use open ":std", ":encoding(UTF-8)";
use Text::CSV_XS qw( );
# Name of columns to copy to new file.
my #col_names_out = qw( ... );
my $csv = Text::CSV_XS->new({ auto_diag => 2, binary => 1 });
for (...) {
my $qfn_in = ...;
my $qfn_out = ...;
open(my $fh_in, "<", $qfn_in)
or die("Can't open \"$qfn_in\": $!\n");
open(my $fh_out, "<", $qfn_out)
or die("Can't create \"$qfn_out\": $!\n");
$csv->column_names(#{ $csv->getline($fh_in) });
$csv->say($fh_out, \#col_names_out);
while (my $row = $csv->getline_hr($fh_in)) {
$csv->say($fh_out, [ #$row{#col_names_out} ]);
}
}

Parsing CSV with Text::CSV

I am trying to parse a file where the header row is at row 8. From row 9-n is my data. How can I use Text::CSV to do this? I am having trouble, my code is below:
my #cols = #{$csv->getline($io, 8)};
my $row = {};
$csv->bind_columns (\#{$row}{#cols});
while($csv->getline($io, 8)){
my $ip_addr = $row->{'IP'};
}

use Text::CSV;
my $csv = Text::CSV->new( ) or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $io, "test.csv" or die "test.csv: $!";
my $array_ref = $csv->getline_all($io, 8);
my $record = "";
foreach $record (#$array_ref) {
print "$record->[0] \n";
}
close $io or die "test.csv: $!";

Are you dead-set on using bind_columns? I think I see what you're trying to do, and it's notionally very creative, but if all you want is a way to reference the column by the header name, how about something like this:
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1 } );
my (%header);
open my $io, "<", '/var/tmp/foo.csv' or die $!;
while (my $row = $csv->getline ($io)) {
next unless $. > 7;
my #fields = #$row;
unless (%header) {
$header{$fields[$_]} = $_ for 0..$#fields;
next;
}
my $ip_addr = $fields[$header{'IP'}];
print "$. => $ip_addr\n";
}
close $io;
Sample Input:
Test Data,,,
Trash,,,
Test Data,,,
Trash,,,
Beans,Joe,10.224.38.189,XYZ
Beans,Joe,10.224.38.190,XYZ
Beans,Joe,10.224.38.191,XYZ
Last Name,First Name,IP,Computer
Beans,Joe,10.224.38.192,XYZ
Beans,Joe,10.224.38.193,XYZ
Beans,Joe,10.224.38.194,XYZ
Beans,Joe,10.224.38.195,XYZ
Beans,Joe,10.224.38.196,XYZ
Beans,Joe,10.224.38.197,XYZ
Output:
9 => 10.224.38.192
10 => 10.224.38.193
11 => 10.224.38.194
12 => 10.224.38.195
13 => 10.224.38.196
14 => 10.224.38.197

Perl - csv parsing - rearrange csv data when fields are dynamics

Using Perl, i need to parse and rearrange csv files that has some dynamic fields (devices and associated values)
Here is the original csv (the header is here for description only)
DISKBSIZE,sn_unknown,hostname,timestamp,origin-timestamp,sda,sda1,sda2,sda3,sdb,sdb1,sdb2,sdb3
DISKBSIZE,sn_unknown,host001,19-FEB-2014 20:55:47,T0001,0.0,0.0,0.0,0.0,18.0,0.0,18.0,0.0
DISKBSIZE,sn_unknown,host001,19-FEB-2014 20:55:49,T0002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DISKBSIZE,sn_unknown,host001,19-FEB-2014 20:55:51,T0003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DISKBSIZE,sn_unknown,host001,19-FEB-2014 20:55:53,T0004,0.0,0.0,0.0,0.0,369.8,0.0,369.8,0.0
DISKBSIZE,sn_unknown,host001,19-FEB-2014 20:55:55,T0005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I need it to be transformed into:
DISKBSIZE,sn_unknown,hostname,timestamp,origin-timestamp,device,value
DISKBSIZE,sn_unknown,host001,19-FEB-2014 20:55:47,T0001,sda,0.0
DISKBSIZE,sn_unknown,host001,19-FEB-2014 20:55:47,T0001,sda1,0.0
... and so on
Here is the sample code that generates the csv file based on original data:
if (((rindex $l,"DISKBUSY,") > -1)) {
#Open destination file
if( ! open(FILE,">>".$dstfile_DISKBUSY) ) {
exit(1);
}
(my #line) = split(",",$l);
my $section = "DISKBUSY";
my $write = $section.",".$SerialNumber.",".$hostnameT.",".
$timestamp.",".$line[1];
my $i = 2;
while ($i <= $#line) {
$write = $write.','.$line[$i];
$i = $i + 1;
}
print (FILE $write."\n");
close( FILE );
}
I need to rearrange it as described to be able to work with the data in a generic way, but dynamic fields (name of devices) drives me crazy :-)
Many thanks for any help !

You can use Text::CSV:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
auto_diag => 1,
eol => "\n"
}) or die "Cannot use CSV: " . Text::CSV->error_diag();
open my $fh, '<', 'file.csv' or die $!;
my #columns = #{ $csv->getline($fh) };
my #device_columns = #columns[5..$#columns];
my #header = (#columns[0..4], "device", "value");
$csv->print(\*STDOUT, \#header);
while (my $row = $csv->getline($fh)) {
foreach my $i (0..$#device_columns) {
my #output = (#$row[0..4], $device_columns[$i], $row->[5+$i]);
$csv->print(\*STDOUT, \#output);
}
}
close $fh;
Output:
DISKBSIZE,sn_unknown,hostname,timestamp,origin-timestamp,device,value
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sda,0.0
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sda1,0.0
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sda2,0.0
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sda3,0.0
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sdb,18.0
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sdb1,0.0
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sdb2,18.0
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sdb3,0.0
(this is only the output for the first row of your input data)
Better solution
The following uses getline_hr to return each row in the input CSV as a hashref, which makes the code a bit cleaner:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
auto_diag => 1,
eol => "\n"
}) or die "Cannot use CSV: " . Text::CSV->error_diag();
open my $fh, '<', 'file.csv' or die $!;
$csv->column_names($csv->getline($fh));
my #cols = ( $csv->column_names );
my #devices = splice #cols, 5;
my #header = ( #cols, "device", "value" );
$csv->print(\*STDOUT, \#header);
while (my $hr = $csv->getline_hr($fh)) {
foreach my $device (#devices) {
my #output = ( #$hr{#cols}, $device, $hr->{$device} );
$csv->print(\*STDOUT, \#output);
}
}
close $fh;

Use the Text::CSV module.
You can assign header names with $csv->column_names(#column_names) and then use $csv->getline_hr to get the line as a hash reference where the hash reference will be keyed by your column names. This will make it much easier to parse your file.
You don't have to use Text::CSV to write back your file (although it makes sure your file is written correctly), but you should use it to parse your data.