Using Parse::CSV to limit splits - perl

I am trying to use Parse::CSV to parse through a simple CSV file with a header and 2 columns. The second column may contain commas but I want to ignore them. Is there anyway to limit how many times it splits on commas? Here is what I have so far
#!/usr/bin/perl
use Parse::CSV;
my $csv = Parse::CSV->new(file => 'file.csv');
while (my $row = $csv->fetch) {
print $row->[0] . "\t" . $row->[1] . "\n";
}
Here is an example of what my data looks like:
1234,text1,text2
5678,text3
90,text4,text5
This would return
1234 text1,text2
5678 text3
90 text4,text5

If you're really wed to Parse::CSV, you can do this using a filter:
use strict;
use warnings;
use 5.010;
use Parse::CSV;
my $parser = Parse::CSV->new(
file => 'input.csv',
filter => sub { return [ shift #$_, join(',', #$_) ] }
);
while ( my $row = $parser->fetch ) {
say join("\t", #$row);
}
die $parser->errstr if $parser->errstr;
Output:
1234 text1,text2
5678 text3
90 text4,text5
Note that performance will be poor because Parse::CSV is splitting the columns for you, but then you immediately join them back together again.
However, since it appears that you're not working with a true CSV (columns containing the delimiter aren't quoted or escaped in any way), why not just use split with a third argument to specify the maximum number of fields?
use strict;
use warnings;
use 5.010;
open my $fh, '<', 'input.csv' or die $!;
while (<$fh>) {
chomp;
my #fields = split(',', $_, 2);
say join("\t", #fields);
}
close $fh;

Related

Perl: Read columns and convert to array

I am new to perl, trying to read a file with columns and creating an array.
I am having a file with following columns.
file.txt
A 15
A 20
A 33
B 20
B 45
C 32
C 78
I wanted to create an array for each unique item present in A with its values assigned from second column.
eg:
#A = (15,20,33)
#B = (20,45)
#C = (32,78)
Tried following code, only for printing 2 columns
use strict;
use warnings;
my $filename = $ARGV[0];
open(FILE, $filename) or die "Could not open file '$filename' $!";
my %seen;
while (<FILE>)
{
chomp;
my $line = $_;
my #elements = split (" ", $line);
my $row_name = join "\t", #elements[0,1];
print $row_name . "\n" if ! $seen{$row_name}++;
}
close FILE;
Thanks
Firstly some general Perl advice. These days, we like to use lexical variables as filehandles and pass three arguments to open().
open(my $fh, '<', $filename) or die "Could not open file '$filename' $!";
And then...
while (<$fh>) { ... }
But, given that you have your filename in $ARGV[0], another tip is to use an empty file input operator (<>) which will return data from the files named in #ARGV without you having to open them. So you can remove your open() line completely and replace the while with:
while (<>) { ... }
Second piece of advice - don't store this data in individual arrays. Far better to store it in a more complex data structure. I'd suggest a hash where the key is the letter and the value is an array containing all of the numbers matching that letter. This is surprisingly easy to build:
use strict;
use warnings;
use feature 'say';
my %data; # I'd give this a better name if I knew what your data was
while (<>) {
chomp;
my ($letter, $number) = split; # splits $_ on whitespace by default
push #{ $data{$letter} }, $number;
}
# Walk the hash to see what we've got
for (sort keys %data) {
say "$_ : #{ $data{$_ } }";
}
Change the loop to be something like:
while (my $line = <FILE>)
{
chomp($line);
my #elements = split (" ", $line);
push(#{$seen{$elements[0]}}, $elements[1]);
}
This will create/append a list of each item as it is found, and result in a hash where the keys are the left items, and the values are lists of the right items. You can then process or reassign the values as you wish.

Reformat Dates in Perl (for later use in MySQL)

I am writing a Perl script to open a .csv file, make some changes, sort it on four fields, then write it back to a new file. Found out that because this data will then be used to load a MySQL table that I also need to reformat the Date variables. Currently, Dates are in the file as 00/00/0000 and for MySQL, need to have them formatted as 0000-00-00. Right now, I simply tried to do it for one field, although I actually need to do it on three Date fields for each line from the .csv file.
This script is running - but it is not reformatting the Date field I'm trying to test this on.
#!/usr/bin/perl/
use strict;
use warnings;
use Data::Dumper;
#my $filename = '/swpkg/shared/batch_processing/mistints/mistints.csv';
my $filename = 'tested.csv';
open my $FH, $filename
or die "Could not read from $filename <$!>, program halting.";
# Read the header line.
chomp(my $line = <$FH>);
my #fields = split(/,/, $line);
#print "Field Names:\n", Dumper(#fields), $/;
print Dumper(#fields), $/;
my #data;
# Read the lines one by one.
while($line = <$FH>) {
# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
chomp($line);
my #fields = split(/,/, $line);
unshift #fields, join '_', #fields[0..2];
push #data, \#fields;
my $in_date = $fields[14];
my $db_date = join '-', reverse split /\D/, $in_date;
}
close $FH;
print "Unsorted:\n", Dumper(#data); #, $/;
#data = sort {
$a->[0] cmp $b->[0] ||
$a->[20] cmp $b->[20] ||
$a->[23] cmp $b->[23] ||
$a->[26] cmp $b-> [26]
} #data;
open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedMistints.csv';
#print $OFH Dumper(#data);
print $OFH join(',', #$_), $/ for #data;
close $OFH;
#print "Sorted:\n", Dumper(#data);
#print "Sorted:", Dumper(#data);
exit;
The two lines I added to this script (which are not working) are the my $in_date and my $db_date lines. Now I will also need to reformat two fields (at the end of each line) that are DATETIME, i.e. 10/23/2015 10:47, where I will only need to reformat the date within that field, and I'm not even sure where to begin tackling that one.
And please go easy since I'm a noob with Perl.
EDIT - SORRY, had to re-edit because I didn't notice the first part of my script had not copied.
Rather than using a bunch of string functions, it's better to use the Time::Piece module to parse and reformat date-time values. It has strptime and strftime methods to do this for you. This short program shows the reformatting of both date-time formats that you mention. ymd is a convenience method, and is equivalent to strftime('%Y-%m-%d')
use strict;
use warnings 'all';
use feature 'say';
use Time::Piece;
my $in_date = '01/02/2003';
my $db_date = Time::Piece->strptime($in_date, '%m/%d/%Y')->ymd;
say "$in_date -> $db_date";
$in_date = '01/02/2003 04:05';
$db_date = Time::Piece->strptime($in_date, '%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M');
say "$in_date -> $db_date";
output
01/02/2003 -> 2003-02-01
01/02/2003 04:05 -> 2003-02-01 04:05
Update
If you prefer, you could write a subroutine that takes the original date and its format string, together with the desired format. Like this
use strict;
use warnings 'all';
use feature 'say';
use Time::Piece;
my $in_date = '01/02/2003';
my $db_date = date_from_to($in_date, '%m/%d/%Y', '%Y-%m-%d');
say "$in_date -> $db_date";
$in_date = '01/02/2003 04:05';
$db_date = date_from_to($in_date, '%m/%d/%Y %H:%M', '%Y-%m-%d %H:%M');
say "$in_date -> $db_date";
sub date_from_to {
my ($date, $from, $to) = #_;
Time::Piece->strptime($date, $from)->strftime($to);
}
The output is identical to that of the program above
Update
Regarding your comment, your code should look like this
$_ = join '-', (split /\//)[2,0,1] for $fields[14, 20, 23];
$_ = Time::Piece->strptime($_,'%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M') for #fields[38,39];
push #data, \#fields;
But I would prefer to see some consistency in the way the date fields are handled, like this
$_ = Time::Piece->strptime($_, '%m/%d/%Y')->strftime('%Y-%m-%d') for #fields[14,20,23];
$_ = Time::Piece->strptime($_, '%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M') for #fields[38,39];
push #data, \#fields

Two csv files: Change one csv by the other and pull out that line

I have two CSV files. The first is a list file, it contains the ID and names. For example
1127100,Acanthocolla cruciata
1127103,Acanthocyrta haeckeli
1127108,Acanthometra fusca
The second is what I want to exchange and extract the line by the first number if a match is found. The first column of numbers correspond in each file. For example
1127108,1,0.60042
1127103,1,0.819671
1127100,2,0.50421,0.527007
10207,3,0.530422,0.624466
So I want to end up with CSV file like this
Acanthometra fusca,1,0.60042
Acanthocyrta haeckeli,1,0.819671
Acanthocolla cruciata,2,0.50421,0.527007
I tried Perl but opening two files at once proved messy. So I tried converting one of the CSV files to a string and parse it that way, but didnt work. But then I was reading about grep and other one-liners but I am not familiar with it. Would this be possible with grep?
This is the Perl code I tried
use strict;
use warnings;
open my $csv_score, '<', "$ARGV[0]" or die qq{Failed to open "$ARGV[0]" for input: $!\n};
open my $csv_list, '<', "$ARGV[1]" or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open my $out, ">$ARGV[0]_final.txt" or die qq{Failed to open for output: $!\n};
my $string = <$csv_score>;
while ( <$csv_list> ) {
my ($find, $replace) = split /,/;
$string =~ s/$find/$replace/g;
if ($string =~ m/^$replace/){
print $out $string;
}
}
close $csv_score;
close $csv_list;
close $out;
The general purpose text processing tool that comes with all UNIX installations is named awk:
$ awk -F, -v OFS=, 'NR==FNR{m[$1]=$2;next} $1=m[$1]' file1 file2
Acanthometra fusca,1,0.60042
Acanthocyrta haeckeli,1,0.819671
Acanthocolla cruciata,2,0.50421,0.527007
Your code was failing because you only read the first line from the $csv_score file, and you tried to print $string every time it is changed. You also failed to remove the newline from the end of the lines from your $csv_list file. If you fix those things then it looks like this
use strict;
use warnings;
open my $csv_score, '<', "$ARGV[0]" or die qq{Failed to open "$ARGV[0]" for input: $!\n};
open my $csv_list, '<', "$ARGV[1]" or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open my $out, ">$ARGV[0]_final.txt" or die qq{Failed to open for output: $!\n};
my $string = do {
local $/;
<$csv_score>;
};
while ( <$csv_list> ) {
chomp;
my ( $find, $replace ) = split /,/;
$string =~ s/$find/$replace/g;
}
print $out $string;
close $csv_score;
close $csv_list;
close $out;
output
Acanthometra fusca,1,0.60042
Acanthocyrta haeckeli,1,0.819671
Acanthocolla cruciata,2,0.50421,0.527007
10207,3,0.530422,0.624466
However that's not a safe way of doing things, as IDs may be found elsewhere than at the start of a line
I would build a hash out of the $csv_list file like this, which also makes the program more concise
use strict;
use warnings;
use v5.10.1;
use autodie;
my %ids;
{
open my $fh, '<', $ARGV[1];
while ( <$fh> ) {
chomp;
my ($id, $name) = split /,/;
$ids{$id} = $name;
}
}
open my $in_fh, '<', $ARGV[0];
open my $out_fh, '>', "$ARGV[0]_final.txt";
while ( <$in_fh> ) {
s{^(\d+)}{$ids{$1} // $1}e;
print $out_fh $_;
}
The output is identical to that of the first program above
The problem with the code as written is that you only do this once:
my $string = <$csv_score>;
This reads one line from $csv_score and you don't ever use the rest.
I would suggest that you need to:
Read the first file into a hash
Iterate the second file, and do a replace on the first column.
using Text::CSV is generally a good idea for processing it, but it doesn't seem to be necessary for your example.
So:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
use Data::Dumper;
my $csv = Text::CSV->new( { binary => 1 } );
my %replace;
while ( my $row = $csv->getline( \*DATA ) ) {
last if $row->[0] =~ m/NEXT/;
$replace{ $row->[0] } = $row->[1];
}
print Dumper \%replace;
my $search = join( "|", map {quotemeta} keys %replace );
$search =~ qr/($search)/;
while ( my $row = $csv->getline( \*DATA ) ) {
$row->[0] =~ s/^($search)$/$replace{$1}/;
$csv->print( \*STDOUT, $row );
print "\n";
}
__DATA__
1127100,Acanthocolla cruciata
1127103,Acanthocyrta haeckeli
1127108,Acanthometra fusca
NEXT
1127108,1,0.60042
1127103,1,0.819671
1127100,2,0.50421,0.527007
10207,3,0.530422,0.624466
Note - this still prints that last line of your source content:
"Acanthometra fusca ",1,"0.60042 "
"Acanthocyrta haeckeli ",1,"0.819671 "
"Acanthocolla cruciata ",2,0.50421,"0.527007 "
(Your data contained whitespace, so Text::CSV wraps it in quotes)
If you want to discard that, then you could test if the replace actually occurred:
if ( $row->[0] =~ s/^($search)$/$replace{$1}/ ) {
$csv->print( \*STDOUT, $row );
print "\n";
}
(And you can of course, keep on using split /,/ if you're sure you won't have any of the whacky things that CSV supports normally).
I would like to provide a very different approach.
Let's say you are way more comfortable with databases than with Perl's data structures. You can use DBD::CSV to turn your CSV files into kind of relational databases. It uses Text::CSV under the hood (hat tip to #Sobrique). You will need to install it from CPAN as it's not bundled in the default DBI distribution though.
use strict;
use warnings;
use Data::Printer; # for p
use DBI;
my $dbh = DBI->connect( "dbi:CSV:", undef, undef, { f_ext => '.csv' } );
$dbh->{csv_tables}->{names} = { col_names => [qw/id name/] };
$dbh->{csv_tables}->{numbers} = { col_names => [qw/id int float/] };
my $sth_select = $dbh->prepare(<<'SQL');
SELECT names.name, numbers.int, numbers.float
FROM names
JOIN numbers ON names.id = numbers.id
SQL
# column types will be silently discarded
$dbh->do('CREATE TABLE result ( name CHAR(255), int INTEGER, float INTEGER )');
my $sth_insert =
$dbh->prepare('INSERT INTO result ( name, int, float ) VALUES ( ?, ?, ? ) ');
$sth_select->execute;
while (my #res = $sth_select->fetchrow_array ) {
p #res;
$sth_insert->execute(#res);
}
What this does is set up column names for the two tables (your CSV files) as those do not have a first row with names. I made the names up based on the data types. It will then create a new table (CSV file) named result and fill it by writing one row at a time.
At the same time it will output data (for debugging purposes) to STDERR through Data::Printer.
[
[0] "Acanthocolla cruciata",
[1] 2,
[2] 0.50421
]
[
[0] "Acanthocyrta haeckeli",
[1] 1,
[2] 0.819671
]
[
[0] "Acanthometra fusca",
[1] 1,
[2] 0.60042
]
The resulting file looks like this:
$ cat scratch/result.csv
name,int,float
"Acanthocolla cruciata",2,0.50421
"Acanthocyrta haeckeli",1,0.819671
"Acanthometra fusca",1,0.60042

How to get the maximum number of columns present in a file in perl

I have a test.csv file which has data something like this.
"a","usa","24-Nov-2011","100.98","Extra1","Extra2"
"B","zim","23-Nov-2011","123","Extra22"
"C","can","23-Nov-2011","123"
I want to fetch the maximum number of columns in this file (i,e 6 in this case) and then store this in a variable.
Like
Variable=6
Can you provide me some suggestions on how to proceed.
Try using Text::CSV
Read each line through, parse through this module, and compare the number of fields to your variable.
#!/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new;
my $max = 0;
open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
my $count = scalar #$rows;
$max = $count > $max ? $count : $max;
}
One of the main reasons given why people use split on a CSV file rather than Text::CSV is that Text::CSV isn't a standard Perl module, so it might not be available.
Then use Text::ParseWords. This is a standard module ans should be readily available:
#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);
use Text::ParseWords qw(quotewords);
my $keep = 0;
for my $line ( <DATA> ) {
chomp $line;
my #columns = quotewords ("\s*,\s*", $keep, $line );
say "<" . join( "> <", #columns ) . ">";
}
__DATA__
"a","usa","24-Nov-2011","100.98","Extra1","Extra2"
"B","zim","23-Nov-2011","123","Extra22"
"C","can","23-Nov-2011","123"
"D","can, can, can","23-Nov-2011","123"
This produces:
<a> <usa> <24-Nov-2011> <100.98> <Extra1> <Extra2>
<B> <zim> <23-Nov-2011> <123> <Extra22>
<C> <can> <23-Nov-2011> <123>
<D> <can, can, can> <23-Nov-2011> <123>
Note that the commas inside the quotes didn't throw off the parsing. Now, there are no more excuses for using split.

Dynamically Change the Key Value based on Delimiter in Perl

I'm reading from a CSV file and populating a Hash based on Key-Value Pairs.
The First Column of the record is the key, and the rest of the record is the value. However, for some file I need to make first 2 columns as Key and the rest of the record is value. I have written it as below based on if loop by checking the number of Key Columns, but I wanted to know if there is any better way to do this?
use strict;
use warnings;
open my $fh, '<:encoding(utf8)', 'Sample.csv'
or die "Couldn't open Sample.csv";
my %hash;
my $KeyCols=2;
while (<$fh>) {
chomp;
if ($KeyCols==1) {
next unless /^(.*?),(.*)$/;
$hash{$1} = $2;
}
elsif ($KeyCols==2) {
next unless /^(.*?),(.*?),(.*)$/;
$hash{$1.$2} = $3;
}
}
Here is one way to allow for any number of key columns (not just 1 or 2), but it uses split instead of a regex:
use warnings;
use strict;
my %hash;
my $KeyCols = 2;
while (<DATA>) {
chomp;
my #cols = split /,/, $_, $KeyCols+1;
next unless #cols > $KeyCols;
my $v = pop #cols;
my $k = join '', #cols;
$hash{$k} = $v;
}
__DATA__
a,b,c,d,e,f
q,w,e,r,t,y
This is a self-contained code example.
A big assumption is that your CSV file does not contain commas in the data itself. You should be using a CSV parser such as Text::CSV anyway.
Perhaps it is better to define variables at first lines of the code -- otherwise you have to jump all over the code.
You can define regex based on your $KeyCols and processing code will be same as before.
use strict;
use warnings;
use feature 'say';
my $KeyCols = 2;
my $fname = 'Sample.csv';
my %hash;
my $re;
if( $KeyCols == 2 ) {
$re = qr/^(.*?,.*?),(.*)$/
} else {
$re = qr/^(.*?),(.*)$/;
}
open my $fh, '<:encoding(utf8)', $fname
or die "Couldn't open $fname";
while (<$fh>) {
chomp;
next unless /$re/;
$hash{$1} = $2;
}
close $fh;