Remove/Extract rows based on Unique/duplicate Id from a CSV file - perl

Depending on how you look at it I need to remove rows based on if the Id is unique or extract rows if the Id has duplicates (keeping all duplicates).
And I'm unsure/don't have enough knowledge of Perl to accomplish this. I've found similair topics but didn't have much succes. These are the examples I'm using example 1, example 2 and example 3. In a previous problem someone showed me a solution with the List::MoreUtils module, so I could merge values with a common Id. This is not the case now, this one is removing rows if the id is unique. I know I can probably do this with the List::MoreUtils module but I want to do it without. This is my dummy data (copied example data from other question since the data doesn't matter), here you can see what I'm after. Order is not important.
Before:
Cat_id;Cat_name;Id;Name;Amount;Colour;Bla
101;Fruits;50010;Grape;500;Red;1
101;Fruits;50020;Strawberry;500;Red;1
201;Vegetables;60010;Carrot;500;White;1
101;Fruits;50060;Apple;1000;Red;1
101;Fruits;50030;Banana;1000;Green;1
101;Fruits;50060;Apple;500;Green;1
101;Fruits;50020;Strawberry;1000;Red;1
201;Vegetables;60010;Carrot;100;Purple;1
101;Fruits;50020;Strawberry;200;Red;1
After:
Cat_id;Cat_name;Id;Name;Amount;Colour;Bla
101;Fruits;50020;Strawberry;500;Red;1
201;Vegetables;60010;Carrot;500;White;1
101;Fruits;50060;Apple;1000;Red;1
101;Fruits;50060;Apple;500;Green;1
101;Fruits;50020;Strawberry;1000;Red;1
201;Vegetables;60010;Carrot;100;Purple;1
101;Fruits;50020;Strawberry;200;Red;1
You can see that the rows of Grape and Banana with id 50010 and 50030 have been removed because there only exists one entry for both.
This is my script, I'm struggeling with the part where I select the unique values from the hash and to output them (taking the Text::CSV_XS module in account). Can someone show me how to do this?
#!/usr/bin/perl -w
use strict;
use warnings;
use Text::CSV_XS;
my $inputfile = shift || die "Give input and output names!\n";
my $outputfile = shift || die "Give output name!\n";
open (my $infile, '<:encoding(iso-8859-1)', $inputfile) or die "Sourcefile in use / not found :$!\n";
open (my $outfile, '>:encoding(UTF-8)', $outputfile) or die "Outputfile in use :$!\n";
my $csv_in = Text::CSV_XS->new({binary => 1,sep_char => ";",auto_diag => 1,always_quote => 1,eol => $/});
my $csv_out = Text::CSV_XS->new({binary => 1,sep_char => "|",auto_diag => 1,always_quote => 1,eol => $/});
my $header = $csv_in->getline($infile);
$csv_out->print($outfile, $header);
my %data;
while (my $elements = $csv_in->getline($infile)){
my #columns = #{ $elements };
my $id = $columns[2];
push #{ $data{$id} }, \#columns;
}
for my $id ( sort keys %data ){ # Sort not important
if #{ $data{$id} } > 1 # Here I have no idea anymore..
$csv_out->print($outfile, \#columns); #
}

Rather than loading a hash with the entire dataset, I think I'd go ahead and read the file twice, loading a hash with just your ID values. This will definitely take longer, but as your file grows, there may be disadvantages of having all of that data in memory.
That said, I did not use Text::CSV_XS but this is a notional idea of what I had in mind.
my %count;
open (my $infile, '<:encoding(iso-8859-1)', $inputfile) or die;
open (my $outfile, '>:encoding(UTF-8)', $outputfile) or die;
while (<$infile>) {
next if $. == 1;
my ($id) = (split /;/, $_, 4)[2];
$count{$id}++;
}
seek $infile, 0, 0;
while (<$infile>) {
my #fields = split /;/;
print $outfile join '|', #fields if $count{$fields[2]} > 1 or $. == 1;
}
close $infile;
close $outfile;
The $. == 1 at the end is so you don't lose your header row.
-- EDIT --
#!/usr/bin/perl -w
use strict;
use warnings;
use Text::CSV_XS;
my $inputfile = shift || die "Give input and output names!\n";
my $outputfile = shift || die "Give output name!\n";
open (my $infile, '<:encoding(iso-8859-1)', $inputfile) or die;
open (my $outfile, '>:encoding(UTF-8)', $outputfile) or die;
my $csv_in = Text::CSV_XS->new({binary => 1,sep_char => ";",
auto_diag => 1,always_quote => 1,eol => $/});
my $csv_out = Text::CSV_XS->new({binary => 1,sep_char => "|",
auto_diag => 1,always_quote => 1,eol => $/});
my ($count, %count) = (1);
while (my $elements = $csv_in->getline($infile)){
$count{$$elements[2]}++;
}
seek $infile, 0, 0;
while (my $elements = $csv_in->getline($infile)){
$csv_out->print($outfile, $elements)
if $count{$$elements[2]} > 1 or $count++ == 1;
}
close $infile;
close $outfile;

Related

Sorting CSV file column value based on column headers

Hi am newbie fir perl scripting, i need a help implement a logic for sorting CSV file header based column values,.
Example:
S.NO,NAME,S2,S5,S3,S4,S1
1,aaaa,88,99,77,55,66
2,bbbb,66,77,88,99,55
3,cccc,55,44,77,88,66
4,dddd,77,55,66,88,99
now i want to sort this file as below..
s.no,s2,s4,s5,s1,s0,name => that's how i want is as i defined order of headers like s.no,name,s1,s2,s3,s4,s5 and it's respective whole columns values also should change based on header exchange, how to do it perl this one...?
That's the required output is like following bellow,
S.NO,NAME,S1,S2,S3,S4,S5
1,aaaaaaa,66,88,77,55,99
2,bbbbbbb,55,66,88,77,99
3,ccccccc,66,55,77,88,44
4,ddddddd,99,77,66,88,55
or what the order i want in column headers, like below.
S.NO,NAME,S5,S4,S3,S2,S1 -> like as per my requirement i need to re-order my columns header and it's respective columns value also..
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $file = 'a1.csv';
my $size = 3;
my #files;
my $csv = Text::CSV->new ({ binary => 1, auto_diag => 1, sep_char => ';' });
open my $in, "<:encoding(utf8)", $file or die "$file: $!";
while (my $row = $csv->getline($in)) {
if (not #files) {
my $file_counter = int #$row / $size;
$file_counter++ if #$row % $size;
for my $i (1 .. $file_counter) {
my $outfile = "output$i.csv";
open my $out, ">:encoding(utf8)", $outfile or die "$outfile: $!";
push #files, $out;
}
}
my #fields = #$row;
foreach my $i (0 .. $#files) {
my $from = $i*$size;
my $to = $i*$size+$size-1;
$to = $to <= $#fields ? $to : $#fields;
my #data = #fields[$from .. $to];
$csv->print($files[$i], \#data);
print {$files[$i]} "\n";
}
}
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
use Text::CSV qw();
my #headers = qw(s.no name s1 s2 s3 s4 s5);
my $csv_in = Text::CSV->new({binary => 1, auto_diag => 1});
my $csv_out = Text::CSV->new({binary => 1, auto_diag => 1});
open my $in, '<:encoding(UTF-8)', 'a1.csv';
open my $out, '>:encoding(UTF-8)', 'output1.csv';
$csv_in->header($in);
$csv_out->say($out, [#headers]);
while (my $row = $csv_in->getline_hr($in)) {
$csv_out->say($out, [$row->#{#headers}]);
}
The handy Text::AutoCSV module lets you rearrange the column order as a one-liner:
$ perl -MText::AutoCSV -e 'Text::AutoCSV->new(in_file=>"in.csv",out_file=>"out.csv",out_fields=>["SNO","NAME","S1","S2","S3","S5"])->write()'
$ cat out.csv
s.no,name,s1,s2,s3,s5
1,aaaa,66,55,77,99
2,bbbb,55,99,88,77
3,cccc,66,88,77,44
4,dddd,99,88,66,55
I'm not sure what your actual desired order of fields is because you have two and both of them include columns that aren't in the sample input file (It has two s2 columns; is one of them supposed to be s4?), but you should get the idea. Column names have to be all caps with special characters like . removed, but the actual names are used for the output.
my $eix = "001"; $csv_in->header ($in, munge_column_names => sub { s/^$/"E".$eix++/er/; });

How to take only latest uniq record based on any column

I am writing a script in perl. but got stuck in one part. Below is the sample of my csv files.
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"GJ","919904303790","20150806125002","prepaid","prepaid","2G","3G"
"MH","919921990805","20150806125003","prepaid","prepaid","2G"
"MP","918120197922","20150806125004","prepaid","prepaid","2G"
"MUM","919904303790","20150806125005","prepaid","prepaid","2G","3G"
"MUM","918652624178","20150806125005","","prepaid","","2G","NEW"
"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"
Now I need to take unique records on the basis of 2nd column (i.e. mobile numbers ) but considering only the latest value of 3rd column (ie timestamp)
eg: for mobile number "918120197922".
"MP","918120197922","20150806125001","prepaid","prepaid","3G","2G"
"MP","918120197922","20150806125004","prepaid","prepaid","2G"
"MP","918120197922","20150806125005","prepaid","prepaid","2G","3G"
it should select the 3rd record as it has the latest value of timestamp (20150806125005). Please help.
Additional Info:
Sorry for inconsistency in data..I have rectified it now.
Yes data is in order which means latest timestamp will appear in the latest rows.
One more thing that my file has the size of more than 1 gb so is there any way to efficiently do this? Will awk work faster than perl in this case. Please help?
Use Text::CSV to process CSV files.
Hash the lines by the 2nd column, only keep the most recent one in the hash.
#!/usr/bin/perl
use warnings;
use strict;
use Text::CSV;
my $csv = 'Text::CSV'->new() or die 'Text::CSV'->error_diag;
my %hash;
open my $CSV, '<', '1.csv' or die $!;
while (my $row = $csv->getline($CSV)) {
my ($number, $timestamp) = #$row[1, 2];
# Store the row if the timestamp is more recent than the stored one.
$hash{$number} = $row if $timestamp gt ($hash{$number}[2] || q());
}
$csv->eol("\n");
$csv->always_quote(1);
open my $OUT, '>', 'uniq.csv' or die $!;
for my $row (values %hash) {
$csv->print($OUT, $row);
}
close $OUT or die $!;
If you know your data is ordered by timestamp you can exploit this and read them backwards and transform your task into a problem to output the first occurrence of each phone number.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Text::CSV_XS;
use constant PHONENUM_FIELD => 1;
my $filename = shift;
die "Usage: $0 <filename>\n" unless defined $filename;
open my $in, '-|', 'tac', $filename;
my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 1, eol => $/ } );
my %seen;
while ( my $row = $csv->getline($in) ) {
$csv->print( *STDOUT, $row ) unless $seen{ $row->[PHONENUM_FIELD] }++;
}
If you would like to have output in same order as input you can write into tac as well:
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
use Text::CSV_XS;
use constant PHONENUM_FIELD => 1;
my $filename = shift;
die "Usage: $0 <filename>\n" unless defined $filename;
open my $in, '-|', 'tac', $filename;
open my $out, '|-', 'tac';
my $csv = Text::CSV_XS->new( { binary => 1, auto_diag => 1, eol => $/ } );
my %seen;
while ( my $row = $csv->getline($in) ) {
$csv->print( $out, $row ) unless $seen{ $row->[PHONENUM_FIELD] }++;
}
1GB should not be a problem at any decent HW. On my old notebook, it took 2m3.393s for processing 29360128 rows and 1.8GB. It is more than 230krows/s but YMMV. Add always_quote => 1 to $csv constructor parameters if you are interested to gain quoted all values at the output.

Parsing CSV with Text::CSV

I am trying to parse a file where the header row is at row 8. From row 9-n is my data. How can I use Text::CSV to do this? I am having trouble, my code is below:
my #cols = #{$csv->getline($io, 8)};
my $row = {};
$csv->bind_columns (\#{$row}{#cols});
while($csv->getline($io, 8)){
my $ip_addr = $row->{'IP'};
}
use Text::CSV;
my $csv = Text::CSV->new( ) or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $io, "test.csv" or die "test.csv: $!";
my $array_ref = $csv->getline_all($io, 8);
my $record = "";
foreach $record (#$array_ref) {
print "$record->[0] \n";
}
close $io or die "test.csv: $!";
Are you dead-set on using bind_columns? I think I see what you're trying to do, and it's notionally very creative, but if all you want is a way to reference the column by the header name, how about something like this:
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new ( { binary => 1 } );
my (%header);
open my $io, "<", '/var/tmp/foo.csv' or die $!;
while (my $row = $csv->getline ($io)) {
next unless $. > 7;
my #fields = #$row;
unless (%header) {
$header{$fields[$_]} = $_ for 0..$#fields;
next;
}
my $ip_addr = $fields[$header{'IP'}];
print "$. => $ip_addr\n";
}
close $io;
Sample Input:
Test Data,,,
Trash,,,
Test Data,,,
Trash,,,
Beans,Joe,10.224.38.189,XYZ
Beans,Joe,10.224.38.190,XYZ
Beans,Joe,10.224.38.191,XYZ
Last Name,First Name,IP,Computer
Beans,Joe,10.224.38.192,XYZ
Beans,Joe,10.224.38.193,XYZ
Beans,Joe,10.224.38.194,XYZ
Beans,Joe,10.224.38.195,XYZ
Beans,Joe,10.224.38.196,XYZ
Beans,Joe,10.224.38.197,XYZ
Output:
9 => 10.224.38.192
10 => 10.224.38.193
11 => 10.224.38.194
12 => 10.224.38.195
13 => 10.224.38.196
14 => 10.224.38.197

perl hash mapping/retrieval issues with split and select columns

Perl find and replace multiple(huge) strings in one shot
P.S.This question is related to the answer for above question.
When I try to replace this code:
Snippet-1
open my $map_fh, '<', 'map.csv' or die $!;
my %replace = map { chomp; split /,/ } <$map_fh>;
close $map_fh;
with this code:
Snippet-2
my %replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
even though the key exists (as in the dumper), exists statement doesn't return the value for the key.
For same input file, it works perfectly with just split alone (Snippet-1) whereas not returning anything when i select specific columns after split(Snippet-2).
Is there some integer/string datatype mess-up happening here?
Input Mapping File
483329,Buffalo
483330,Buffalo
483337,Buffalo
Script Output
$VAR1 = {
'483329' => 'Buffalo',
'46546' => 'Chicago_CW',
'745679' => 'W. Washington',
};
1 search is ENB
2 search is 483329 **expected Buffalo here**
3 search is 483330
4 search is 483337
Perl Code
open my $map_fh, '<', $MarketMapFile or die $!;
if ($MapSelection =~ /eNodeBID/i) {
my %replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
use Data::Dumper;
print Dumper(\%replace);
}
close $map_fh;
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1, eol => $/,quote_space => 0 });
my $tmpCSVFile = $CSVFile."tmp";
open my $in_fh, '<', $CSVFile or die $!;
open my $out_fh, '>', $tmpCSVFile or die $!;
my $cnt=1;
while (my $row = $csv->getline($in_fh)) {
my $search = $row->[5];
$search =~ s/[^[:print:]]+//g;
if ($MapSelection =~ /eNodeBID/i) {
$search =~ s/(...)-(...)-//g;
$search =~ s/\(M\)//g;
}
my $match = (exists $replace{$search}) ? $replace{$search} : undef;
print "\n$cnt search is $search ";
if (defined($match)) {
$match =~ s/[^[:print:]]+//g;
print "and match is $match";
}
push #$row, $match;
#print " match is $match";
$csv->print($out_fh, $row);
$cnt++;
}
# untie %replace;
close $in_fh;
close $out_fh;
You have a problem of scope. Your code:
if ($MapSelection =~ /eNodeBID/i) {
my %replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
use Data::Dumper;
print Dumper(\%replace);
}
declares %replace within the if block. Move it outside so that it can also be seen by later code:
my %replace;
if ($MapSelection =~ /eNodeBID/i) {
%replace = map { chomp; (split /,/)[0,1] } <$map_fh>;
use Data::Dumper;
print Dumper(\%replace);
}
Putting use strict and use warnings at the top of your code helps you find these kinds of issues.
Also, you can just use my $match = $replace{$search} since it's equivalent to your ?: operation.
Always include use strict; and use warnings; at the top of EVERY perl script. If you had done that and been maintaining good coding practice with declaring your variables, you would've gotten error:
Global symbol "%replace" requires explicit package name at
That would've let you know there was a scoping issue with your code. One way to avoid that is to use a ternary in your initialization of %replace
my %replace = ($MapSelection =~ /eNodeBID/i)
? map { chomp; (split /,/)[0,1] } <$map_fh>
: ();

Perl - csv parsing - rearrange csv data when fields are dynamics

Using Perl, i need to parse and rearrange csv files that has some dynamic fields (devices and associated values)
Here is the original csv (the header is here for description only)
DISKBSIZE,sn_unknown,hostname,timestamp,origin-timestamp,sda,sda1,sda2,sda3,sdb,sdb1,sdb2,sdb3
DISKBSIZE,sn_unknown,host001,19-FEB-2014 20:55:47,T0001,0.0,0.0,0.0,0.0,18.0,0.0,18.0,0.0
DISKBSIZE,sn_unknown,host001,19-FEB-2014 20:55:49,T0002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DISKBSIZE,sn_unknown,host001,19-FEB-2014 20:55:51,T0003,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
DISKBSIZE,sn_unknown,host001,19-FEB-2014 20:55:53,T0004,0.0,0.0,0.0,0.0,369.8,0.0,369.8,0.0
DISKBSIZE,sn_unknown,host001,19-FEB-2014 20:55:55,T0005,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
I need it to be transformed into:
DISKBSIZE,sn_unknown,hostname,timestamp,origin-timestamp,device,value
DISKBSIZE,sn_unknown,host001,19-FEB-2014 20:55:47,T0001,sda,0.0
DISKBSIZE,sn_unknown,host001,19-FEB-2014 20:55:47,T0001,sda1,0.0
... and so on
Here is the sample code that generates the csv file based on original data:
if (((rindex $l,"DISKBUSY,") > -1)) {
#Open destination file
if( ! open(FILE,">>".$dstfile_DISKBUSY) ) {
exit(1);
}
(my #line) = split(",",$l);
my $section = "DISKBUSY";
my $write = $section.",".$SerialNumber.",".$hostnameT.",".
$timestamp.",".$line[1];
my $i = 2;
while ($i <= $#line) {
$write = $write.','.$line[$i];
$i = $i + 1;
}
print (FILE $write."\n");
close( FILE );
}
I need to rearrange it as described to be able to work with the data in a generic way, but dynamic fields (name of devices) drives me crazy :-)
Many thanks for any help !
You can use Text::CSV:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
auto_diag => 1,
eol => "\n"
}) or die "Cannot use CSV: " . Text::CSV->error_diag();
open my $fh, '<', 'file.csv' or die $!;
my #columns = #{ $csv->getline($fh) };
my #device_columns = #columns[5..$#columns];
my #header = (#columns[0..4], "device", "value");
$csv->print(\*STDOUT, \#header);
while (my $row = $csv->getline($fh)) {
foreach my $i (0..$#device_columns) {
my #output = (#$row[0..4], $device_columns[$i], $row->[5+$i]);
$csv->print(\*STDOUT, \#output);
}
}
close $fh;
Output:
DISKBSIZE,sn_unknown,hostname,timestamp,origin-timestamp,device,value
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sda,0.0
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sda1,0.0
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sda2,0.0
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sda3,0.0
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sdb,18.0
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sdb1,0.0
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sdb2,18.0
DISKBSIZE,sn_unknown,host001,"19-FEB-2014 20:55:47",T0001,sdb3,0.0
(this is only the output for the first row of your input data)
Better solution
The following uses getline_hr to return each row in the input CSV as a hashref, which makes the code a bit cleaner:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use Text::CSV;
my $csv = Text::CSV->new({
binary => 1,
auto_diag => 1,
eol => "\n"
}) or die "Cannot use CSV: " . Text::CSV->error_diag();
open my $fh, '<', 'file.csv' or die $!;
$csv->column_names($csv->getline($fh));
my #cols = ( $csv->column_names );
my #devices = splice #cols, 5;
my #header = ( #cols, "device", "value" );
$csv->print(\*STDOUT, \#header);
while (my $hr = $csv->getline_hr($fh)) {
foreach my $device (#devices) {
my #output = ( #$hr{#cols}, $device, $hr->{$device} );
$csv->print(\*STDOUT, \#output);
}
}
close $fh;
Use the Text::CSV module.
You can assign header names with $csv->column_names(#column_names) and then use $csv->getline_hr to get the line as a hash reference where the hash reference will be keyed by your column names. This will make it much easier to parse your file.
You don't have to use Text::CSV to write back your file (although it makes sure your file is written correctly), but you should use it to parse your data.