I am trying to use Parse::CSV to parse through a simple CSV file with a header and 2 columns. The second column may contain commas but I want to ignore them. Is there anyway to limit how many times it splits on commas? Here is what I have so far
#!/usr/bin/perl
use Parse::CSV;
my $csv = Parse::CSV->new(file => 'file.csv');
while (my $row = $csv->fetch) {
print $row->[0] . "\t" . $row->[1] . "\n";
}
Here is an example of what my data looks like:
1234,text1,text2
5678,text3
90,text4,text5
This would return
1234 text1,text2
5678 text3
90 text4,text5
If you're really wed to Parse::CSV, you can do this using a filter:
use strict;
use warnings;
use 5.010;
use Parse::CSV;
my $parser = Parse::CSV->new(
file => 'input.csv',
filter => sub { return [ shift #$_, join(',', #$_) ] }
);
while ( my $row = $parser->fetch ) {
say join("\t", #$row);
}
die $parser->errstr if $parser->errstr;
Output:
1234 text1,text2
5678 text3
90 text4,text5
Note that performance will be poor because Parse::CSV is splitting the columns for you, but then you immediately join them back together again.
However, since it appears that you're not working with a true CSV (columns containing the delimiter aren't quoted or escaped in any way), why not just use split with a third argument to specify the maximum number of fields?
use strict;
use warnings;
use 5.010;
open my $fh, '<', 'input.csv' or die $!;
while (<$fh>) {
chomp;
my #fields = split(',', $_, 2);
say join("\t", #fields);
}
close $fh;
Related
I am new to perl, trying to read a file with columns and creating an array.
I am having a file with following columns.
file.txt
A 15
A 20
A 33
B 20
B 45
C 32
C 78
I wanted to create an array for each unique item present in A with its values assigned from second column.
eg:
#A = (15,20,33)
#B = (20,45)
#C = (32,78)
Tried following code, only for printing 2 columns
use strict;
use warnings;
my $filename = $ARGV[0];
open(FILE, $filename) or die "Could not open file '$filename' $!";
my %seen;
while (<FILE>)
{
chomp;
my $line = $_;
my #elements = split (" ", $line);
my $row_name = join "\t", #elements[0,1];
print $row_name . "\n" if ! $seen{$row_name}++;
}
close FILE;
Thanks
Firstly some general Perl advice. These days, we like to use lexical variables as filehandles and pass three arguments to open().
open(my $fh, '<', $filename) or die "Could not open file '$filename' $!";
And then...
while (<$fh>) { ... }
But, given that you have your filename in $ARGV[0], another tip is to use an empty file input operator (<>) which will return data from the files named in #ARGV without you having to open them. So you can remove your open() line completely and replace the while with:
while (<>) { ... }
Second piece of advice - don't store this data in individual arrays. Far better to store it in a more complex data structure. I'd suggest a hash where the key is the letter and the value is an array containing all of the numbers matching that letter. This is surprisingly easy to build:
use strict;
use warnings;
use feature 'say';
my %data; # I'd give this a better name if I knew what your data was
while (<>) {
chomp;
my ($letter, $number) = split; # splits $_ on whitespace by default
push #{ $data{$letter} }, $number;
}
# Walk the hash to see what we've got
for (sort keys %data) {
say "$_ : #{ $data{$_ } }";
}
Change the loop to be something like:
while (my $line = <FILE>)
{
chomp($line);
my #elements = split (" ", $line);
push(#{$seen{$elements[0]}}, $elements[1]);
}
This will create/append a list of each item as it is found, and result in a hash where the keys are the left items, and the values are lists of the right items. You can then process or reassign the values as you wish.
I am writing a Perl script to open a .csv file, make some changes, sort it on four fields, then write it back to a new file. Found out that because this data will then be used to load a MySQL table that I also need to reformat the Date variables. Currently, Dates are in the file as 00/00/0000 and for MySQL, need to have them formatted as 0000-00-00. Right now, I simply tried to do it for one field, although I actually need to do it on three Date fields for each line from the .csv file.
This script is running - but it is not reformatting the Date field I'm trying to test this on.
#!/usr/bin/perl/
use strict;
use warnings;
use Data::Dumper;
#my $filename = '/swpkg/shared/batch_processing/mistints/mistints.csv';
my $filename = 'tested.csv';
open my $FH, $filename
or die "Could not read from $filename <$!>, program halting.";
# Read the header line.
chomp(my $line = <$FH>);
my #fields = split(/,/, $line);
#print "Field Names:\n", Dumper(#fields), $/;
print Dumper(#fields), $/;
my #data;
# Read the lines one by one.
while($line = <$FH>) {
# split the fields, concatenate the first three fields,
# and add it to the beginning of each line in the file
chomp($line);
my #fields = split(/,/, $line);
unshift #fields, join '_', #fields[0..2];
push #data, \#fields;
my $in_date = $fields[14];
my $db_date = join '-', reverse split /\D/, $in_date;
}
close $FH;
print "Unsorted:\n", Dumper(#data); #, $/;
#data = sort {
$a->[0] cmp $b->[0] ||
$a->[20] cmp $b->[20] ||
$a->[23] cmp $b->[23] ||
$a->[26] cmp $b-> [26]
} #data;
open my $OFH, '>', '/swpkg/shared/batch_processing/mistints/parsedMistints.csv';
#print $OFH Dumper(#data);
print $OFH join(',', #$_), $/ for #data;
close $OFH;
#print "Sorted:\n", Dumper(#data);
#print "Sorted:", Dumper(#data);
exit;
The two lines I added to this script (which are not working) are the my $in_date and my $db_date lines. Now I will also need to reformat two fields (at the end of each line) that are DATETIME, i.e. 10/23/2015 10:47, where I will only need to reformat the date within that field, and I'm not even sure where to begin tackling that one.
And please go easy since I'm a noob with Perl.
EDIT - SORRY, had to re-edit because I didn't notice the first part of my script had not copied.
Rather than using a bunch of string functions, it's better to use the Time::Piece module to parse and reformat date-time values. It has strptime and strftime methods to do this for you. This short program shows the reformatting of both date-time formats that you mention. ymd is a convenience method, and is equivalent to strftime('%Y-%m-%d')
use strict;
use warnings 'all';
use feature 'say';
use Time::Piece;
my $in_date = '01/02/2003';
my $db_date = Time::Piece->strptime($in_date, '%m/%d/%Y')->ymd;
say "$in_date -> $db_date";
$in_date = '01/02/2003 04:05';
$db_date = Time::Piece->strptime($in_date, '%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M');
say "$in_date -> $db_date";
output
01/02/2003 -> 2003-02-01
01/02/2003 04:05 -> 2003-02-01 04:05
Update
If you prefer, you could write a subroutine that takes the original date and its format string, together with the desired format. Like this
use strict;
use warnings 'all';
use feature 'say';
use Time::Piece;
my $in_date = '01/02/2003';
my $db_date = date_from_to($in_date, '%m/%d/%Y', '%Y-%m-%d');
say "$in_date -> $db_date";
$in_date = '01/02/2003 04:05';
$db_date = date_from_to($in_date, '%m/%d/%Y %H:%M', '%Y-%m-%d %H:%M');
say "$in_date -> $db_date";
sub date_from_to {
my ($date, $from, $to) = #_;
Time::Piece->strptime($date, $from)->strftime($to);
}
The output is identical to that of the program above
Update
Regarding your comment, your code should look like this
$_ = join '-', (split /\//)[2,0,1] for $fields[14, 20, 23];
$_ = Time::Piece->strptime($_,'%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M') for #fields[38,39];
push #data, \#fields;
But I would prefer to see some consistency in the way the date fields are handled, like this
$_ = Time::Piece->strptime($_, '%m/%d/%Y')->strftime('%Y-%m-%d') for #fields[14,20,23];
$_ = Time::Piece->strptime($_, '%m/%d/%Y %H:%M')->strftime('%Y-%m-%d %H:%M') for #fields[38,39];
push #data, \#fields
I have two CSV files. The first is a list file, it contains the ID and names. For example
1127100,Acanthocolla cruciata
1127103,Acanthocyrta haeckeli
1127108,Acanthometra fusca
The second is what I want to exchange and extract the line by the first number if a match is found. The first column of numbers correspond in each file. For example
1127108,1,0.60042
1127103,1,0.819671
1127100,2,0.50421,0.527007
10207,3,0.530422,0.624466
So I want to end up with CSV file like this
Acanthometra fusca,1,0.60042
Acanthocyrta haeckeli,1,0.819671
Acanthocolla cruciata,2,0.50421,0.527007
I tried Perl but opening two files at once proved messy. So I tried converting one of the CSV files to a string and parse it that way, but didnt work. But then I was reading about grep and other one-liners but I am not familiar with it. Would this be possible with grep?
This is the Perl code I tried
use strict;
use warnings;
open my $csv_score, '<', "$ARGV[0]" or die qq{Failed to open "$ARGV[0]" for input: $!\n};
open my $csv_list, '<', "$ARGV[1]" or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open my $out, ">$ARGV[0]_final.txt" or die qq{Failed to open for output: $!\n};
my $string = <$csv_score>;
while ( <$csv_list> ) {
my ($find, $replace) = split /,/;
$string =~ s/$find/$replace/g;
if ($string =~ m/^$replace/){
print $out $string;
}
}
close $csv_score;
close $csv_list;
close $out;
The general purpose text processing tool that comes with all UNIX installations is named awk:
$ awk -F, -v OFS=, 'NR==FNR{m[$1]=$2;next} $1=m[$1]' file1 file2
Acanthometra fusca,1,0.60042
Acanthocyrta haeckeli,1,0.819671
Acanthocolla cruciata,2,0.50421,0.527007
Your code was failing because you only read the first line from the $csv_score file, and you tried to print $string every time it is changed. You also failed to remove the newline from the end of the lines from your $csv_list file. If you fix those things then it looks like this
use strict;
use warnings;
open my $csv_score, '<', "$ARGV[0]" or die qq{Failed to open "$ARGV[0]" for input: $!\n};
open my $csv_list, '<', "$ARGV[1]" or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open my $out, ">$ARGV[0]_final.txt" or die qq{Failed to open for output: $!\n};
my $string = do {
local $/;
<$csv_score>;
};
while ( <$csv_list> ) {
chomp;
my ( $find, $replace ) = split /,/;
$string =~ s/$find/$replace/g;
}
print $out $string;
close $csv_score;
close $csv_list;
close $out;
output
Acanthometra fusca,1,0.60042
Acanthocyrta haeckeli,1,0.819671
Acanthocolla cruciata,2,0.50421,0.527007
10207,3,0.530422,0.624466
However that's not a safe way of doing things, as IDs may be found elsewhere than at the start of a line
I would build a hash out of the $csv_list file like this, which also makes the program more concise
use strict;
use warnings;
use v5.10.1;
use autodie;
my %ids;
{
open my $fh, '<', $ARGV[1];
while ( <$fh> ) {
chomp;
my ($id, $name) = split /,/;
$ids{$id} = $name;
}
}
open my $in_fh, '<', $ARGV[0];
open my $out_fh, '>', "$ARGV[0]_final.txt";
while ( <$in_fh> ) {
s{^(\d+)}{$ids{$1} // $1}e;
print $out_fh $_;
}
The output is identical to that of the first program above
The problem with the code as written is that you only do this once:
my $string = <$csv_score>;
This reads one line from $csv_score and you don't ever use the rest.
I would suggest that you need to:
Read the first file into a hash
Iterate the second file, and do a replace on the first column.
using Text::CSV is generally a good idea for processing it, but it doesn't seem to be necessary for your example.
So:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
use Data::Dumper;
my $csv = Text::CSV->new( { binary => 1 } );
my %replace;
while ( my $row = $csv->getline( \*DATA ) ) {
last if $row->[0] =~ m/NEXT/;
$replace{ $row->[0] } = $row->[1];
}
print Dumper \%replace;
my $search = join( "|", map {quotemeta} keys %replace );
$search =~ qr/($search)/;
while ( my $row = $csv->getline( \*DATA ) ) {
$row->[0] =~ s/^($search)$/$replace{$1}/;
$csv->print( \*STDOUT, $row );
print "\n";
}
__DATA__
1127100,Acanthocolla cruciata
1127103,Acanthocyrta haeckeli
1127108,Acanthometra fusca
NEXT
1127108,1,0.60042
1127103,1,0.819671
1127100,2,0.50421,0.527007
10207,3,0.530422,0.624466
Note - this still prints that last line of your source content:
"Acanthometra fusca ",1,"0.60042 "
"Acanthocyrta haeckeli ",1,"0.819671 "
"Acanthocolla cruciata ",2,0.50421,"0.527007 "
(Your data contained whitespace, so Text::CSV wraps it in quotes)
If you want to discard that, then you could test if the replace actually occurred:
if ( $row->[0] =~ s/^($search)$/$replace{$1}/ ) {
$csv->print( \*STDOUT, $row );
print "\n";
}
(And you can of course, keep on using split /,/ if you're sure you won't have any of the whacky things that CSV supports normally).
I would like to provide a very different approach.
Let's say you are way more comfortable with databases than with Perl's data structures. You can use DBD::CSV to turn your CSV files into kind of relational databases. It uses Text::CSV under the hood (hat tip to #Sobrique). You will need to install it from CPAN as it's not bundled in the default DBI distribution though.
use strict;
use warnings;
use Data::Printer; # for p
use DBI;
my $dbh = DBI->connect( "dbi:CSV:", undef, undef, { f_ext => '.csv' } );
$dbh->{csv_tables}->{names} = { col_names => [qw/id name/] };
$dbh->{csv_tables}->{numbers} = { col_names => [qw/id int float/] };
my $sth_select = $dbh->prepare(<<'SQL');
SELECT names.name, numbers.int, numbers.float
FROM names
JOIN numbers ON names.id = numbers.id
SQL
# column types will be silently discarded
$dbh->do('CREATE TABLE result ( name CHAR(255), int INTEGER, float INTEGER )');
my $sth_insert =
$dbh->prepare('INSERT INTO result ( name, int, float ) VALUES ( ?, ?, ? ) ');
$sth_select->execute;
while (my #res = $sth_select->fetchrow_array ) {
p #res;
$sth_insert->execute(#res);
}
What this does is set up column names for the two tables (your CSV files) as those do not have a first row with names. I made the names up based on the data types. It will then create a new table (CSV file) named result and fill it by writing one row at a time.
At the same time it will output data (for debugging purposes) to STDERR through Data::Printer.
[
[0] "Acanthocolla cruciata",
[1] 2,
[2] 0.50421
]
[
[0] "Acanthocyrta haeckeli",
[1] 1,
[2] 0.819671
]
[
[0] "Acanthometra fusca",
[1] 1,
[2] 0.60042
]
The resulting file looks like this:
$ cat scratch/result.csv
name,int,float
"Acanthocolla cruciata",2,0.50421
"Acanthocyrta haeckeli",1,0.819671
"Acanthometra fusca",1,0.60042
I have a test.csv file which has data something like this.
"a","usa","24-Nov-2011","100.98","Extra1","Extra2"
"B","zim","23-Nov-2011","123","Extra22"
"C","can","23-Nov-2011","123"
I want to fetch the maximum number of columns in this file (i,e 6 in this case) and then store this in a variable.
Like
Variable=6
Can you provide me some suggestions on how to proceed.
Try using Text::CSV
Read each line through, parse through this module, and compare the number of fields to your variable.
#!/bin/env perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new;
my $max = 0;
open my $fh, "<:encoding(utf8)", "test.csv" or die "test.csv: $!";
while ( my $row = $csv->getline( $fh ) ) {
my $count = scalar #$rows;
$max = $count > $max ? $count : $max;
}
One of the main reasons given why people use split on a CSV file rather than Text::CSV is that Text::CSV isn't a standard Perl module, so it might not be available.
Then use Text::ParseWords. This is a standard module ans should be readily available:
#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);
use Text::ParseWords qw(quotewords);
my $keep = 0;
for my $line ( <DATA> ) {
chomp $line;
my #columns = quotewords ("\s*,\s*", $keep, $line );
say "<" . join( "> <", #columns ) . ">";
}
__DATA__
"a","usa","24-Nov-2011","100.98","Extra1","Extra2"
"B","zim","23-Nov-2011","123","Extra22"
"C","can","23-Nov-2011","123"
"D","can, can, can","23-Nov-2011","123"
This produces:
<a> <usa> <24-Nov-2011> <100.98> <Extra1> <Extra2>
<B> <zim> <23-Nov-2011> <123> <Extra22>
<C> <can> <23-Nov-2011> <123>
<D> <can, can, can> <23-Nov-2011> <123>
Note that the commas inside the quotes didn't throw off the parsing. Now, there are no more excuses for using split.
I'm reading from a CSV file and populating a Hash based on Key-Value Pairs.
The First Column of the record is the key, and the rest of the record is the value. However, for some file I need to make first 2 columns as Key and the rest of the record is value. I have written it as below based on if loop by checking the number of Key Columns, but I wanted to know if there is any better way to do this?
use strict;
use warnings;
open my $fh, '<:encoding(utf8)', 'Sample.csv'
or die "Couldn't open Sample.csv";
my %hash;
my $KeyCols=2;
while (<$fh>) {
chomp;
if ($KeyCols==1) {
next unless /^(.*?),(.*)$/;
$hash{$1} = $2;
}
elsif ($KeyCols==2) {
next unless /^(.*?),(.*?),(.*)$/;
$hash{$1.$2} = $3;
}
}
Here is one way to allow for any number of key columns (not just 1 or 2), but it uses split instead of a regex:
use warnings;
use strict;
my %hash;
my $KeyCols = 2;
while (<DATA>) {
chomp;
my #cols = split /,/, $_, $KeyCols+1;
next unless #cols > $KeyCols;
my $v = pop #cols;
my $k = join '', #cols;
$hash{$k} = $v;
}
__DATA__
a,b,c,d,e,f
q,w,e,r,t,y
This is a self-contained code example.
A big assumption is that your CSV file does not contain commas in the data itself. You should be using a CSV parser such as Text::CSV anyway.
Perhaps it is better to define variables at first lines of the code -- otherwise you have to jump all over the code.
You can define regex based on your $KeyCols and processing code will be same as before.
use strict;
use warnings;
use feature 'say';
my $KeyCols = 2;
my $fname = 'Sample.csv';
my %hash;
my $re;
if( $KeyCols == 2 ) {
$re = qr/^(.*?,.*?),(.*)$/
} else {
$re = qr/^(.*?),(.*)$/;
}
open my $fh, '<:encoding(utf8)', $fname
or die "Couldn't open $fname";
while (<$fh>) {
chomp;
next unless /$re/;
$hash{$1} = $2;
}
close $fh;