Parsing Tab Delimited File into an array - perl

I am attempting to read a CSV into an array in a way that I can access each column in a row. However when I run the following code with the goal of printing a specific column from each row, it only outputs empty lines.
#set command line arguments
my ($infi, $outdir, $idcol) = #ARGV;
#lead file of data to get annotations for
open FILE, "<", $infi or die "Can't read file '$infi' [$!]\n";
my #data;
foreach my $row (<FILE>){
chomp $row;
my #cells = split /\t/, $row;
push #data, #cells;
}
#fetch genes
foreach (#data){
print "#_[$idcol]\n";
# print $geneadaptor->fetch_by_dbID($_[$idcol]);
}
With a test input of
a b c
1 2 3
d e f
4 5 6
I think the issue here isn't so much loading the file, but in treating the resulting array. How should I be approaching this problem?

First of all you need to push #data, \#cells, otherwise you will get all the fields concatenated into a single list.
Then you need to use the loop value in the second for loop.
foreach (#data){
print $_->[$idcol], "\n";
}
#_ is a completely different variable from $_ and is unpopulated here.
You should also consider using
while (my $row = <FILE>) { ... }
to read your file. It reads only a single line at a time whereas for will read the entire file into a list of lines before iterating over it.

I recommend to avoid parsing the CSV file directly and using the Text::CSV module.
use Text::CSV;
use Carp;
#set command line arguments
my ($infi, $outdir, $idcol) = #ARGV;
my $csv = Text::CSV->new({
sep_char => "\t"
});
open(my $fh, "<:encoding(UTF-8)", $infi) || croak "can't open $infi: $!";
# Uncomment if you need to skip header line
# <$fh>;
while (<$fh>) {
if ($csv->parse($_)) {
my #columns = $csv->fields();
print "$columns[0]\t$columns[1]\t$columns[2]\n";
} else {
my $err = $csv->error_input;
print "Failed to parse line: $err";
}
}
close $fh;

Related

Match column in 2 CSV files and display output in third file

I have 2 CSV files, file1.csv and file2.csv . I have to pick each row of column 3 in file1 and iterate through column 3 of file2 to find a match and if the match occurs then display the complete matched rows(from column 1,2 and 3)only from file2.csv in a third csv file.My code till now only fetches the column 3 from both the csv files. How can I match column 3 of both the files and display the matched rows ? Please help.
File1:
Comp_Name,Date,Files
Component1,2013/04/01,/Com/src/folder1/folder2/newfile.txt;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile24;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile25;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile26;
Component1,2013/04/25,/Com/src2;
File2:
Comp_name,Date,Files
Component1,2013/04/07,/Com/src/folder1/folder2/newfile.txt;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile24;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile25;
Component2,2013/04/23,/Com/src/folder1/folder2/newfile.txt;
Component3,2013/04/27,/Com/src/folder1/folder2/testfile24;
Component1,2013/04/25,/Com/src2;
Output format:
Comp_Name,Date,Files
Component1,2013/04/07,/Com/src/folder1/folder2/newfile.txt;
Component2,2013/04/23,/Com/src/folder1/folder2/newfile.txt;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile24;
Component3,2013/04/27,/Com/src/folder1/folder2/testfile24;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile25;
Component1,2013/04/25,/Com/src2;
Code:
use strict;
use warnings;
my $file1 = "C:\\pick\\file1.csv";
my $file2 = "C:\\pick\\file2.csv";
my $file3 = "C:\\pick\\file3.csv";
my $type;
my $type1;
my #fields;
my #fields2;
open(my $fh, '<:encoding(UTF-8)', $file1) or die "Could not open file '$file1' $!"; #Throw error if file doesn't open
while (my $row = <$fh>) # reading each row till end of file
{
chomp $row;
#fields = split ",",$row;
$type = $fields[2];
print"\n$type";
}
open(my $fh2, '<:encoding(UTF-8)', $file2) or die "Could not open file '$file2' $!"; #Throw error if file doesn't open
while (my $row2 = <$fh2>) # reading each row till end of file
{
chomp $row2;
#fields2 = split ",",$row2;
$type1 = $fields2[2];
print"\n$type1";
foreach($type)
{
if ($type eq $type1)
{
print $row2;
}
}
}
This is not a matter to over complicate.. I would personally use a module Text::CSV_XS or as mentioned already Tie::Array::CSV to perform here.
If you're having trouble using a module, I suppose this would be an alternative. You can modify to your desired wants and needs, I used the data you supplied and got the results you want.
use strict;
use warnings;
open my $fh1, '<', 'file1.csv' or die "failed open: $!";
open my $fh2, '<', 'file2.csv' or die "failed open: $!";
open my $out, '>', 'file3.csv' or die "failed open: $!";
my %hash1 = map { $_ => 1 } <$fh1>;
my %hash2 = map { $_ => 1 } <$fh2>;
close $fh1;
close $fh2;
my #result =
map { join ',', $hash1{$_->[2]} ? () : $_->[0], $_->[1], $_->[2] }
sort { $a->[1] <=> $b->[1] || $a->[2] cmp $b->[2] || $a->[0] cmp $b->[0] }
map { s/\s*$//; [split /,/] } keys %hash2;
print $out "$_\n" for #result;
close $out;
__OUTPUT__
Comp_name,Date,Files
Component1,2013/04/07,/Com/src/folder1/folder2/newfile.txt;
Component2,2013/04/23,/Com/src/folder1/folder2/newfile.txt;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile24;
Component3,2013/04/27,/Com/src/folder1/folder2/testfile24;
Component1,2013/04/24,/Com/src/folder1/folder2/testfile25;
Component1,2013/04/25,/Com/src2;
This is a job for a hash ( my %file1)
so instead of continually opening the files you can read the contents into hashes
#fields = split ",",$row;
$type = $fields[2];
$hash1{$type} = $row;
I see you have duplicates too so the hash entry will be replaced upon duplication
so you can store an array of values in the hash
$hash1{$type} = [] unless $hash1{$type};
push #{$hash1{$type}}, $row;
Your next problem is how to traverse the arrays inside hashes
Here is an example using my Tie::Array::CSV module. It uses some clever Perl tricks to represent each CSV file as a Perl array of arrayrefs. I use it to make an index of the first file, then to loop over the second file and finally to output to the third.
#!/usr/bin/env perl
use strict;
use warnings;
use Tie::Array::CSV;
tie my #file1, 'Tie::Array::CSV', 'file1' or die 'Cannot tie file1';
tie my #file2, 'Tie::Array::CSV', 'file2' or die 'Cannot tie file2';
tie my #output, 'Tie::Array::CSV', 'output' or die 'Cannot tie output';
# setup a match table from file2
my %match = map { ( $_->[-1] => 1 ) } #file1[1..$#file1];
#header
push #output, $file2[0];
# iterate over file2
for my $row ( #file2[1..$#file2] ) {
next unless $match{$row->[-1]}; # check for match
push #output, $row; # print to output if match
}
The output I get is different from yours, but I cannot figure out why your output does not include testfile25 and src2.

Parsing a CSV file and Hashing

I am trying to parse a CSV file to read in all the other zip codes. I am trying to create a hash where each key is a zip code and the value is the number it appears in the file. Then I want to print out the contents as Zip Code - Number. Here is the Perl script I have so far.
use strict;
use warnings;
my %hash = qw (
zipcode count
);
my $file = $ARGV[0] or die "Need CSV file on command line \n";
open(my $data, '<', $file) or die "Could not open '$file $!\n";
while (my $line = <$data>) {
chomp $line;
my #fields = split "," , $line;
if (exists($hash{$fields[2]})) {
$hash{$fields[1]}++;
}else {
$hash{$fields[1]} = 1;
}
}
my $key;
my $value;
while (($key, $value) = each(%hash)) {
print "$key - $value\n";
}
exit;
You don't say which column your zip code is in, but you are using the third field to check for an existing hash element, and then the second field to increment it.
There is no need to check whether a hash element already exists: Perl will happily create a non-existent hash element and increment it to 1 the first time you access it.
There is also no need to explicitly open any files passed as command line parameters: Perl will open them and read them if you use the <> operator without a file handle.
This reworking of your own program may work. It assumes the zip code is in the second column of the CSV. If it is anywhere else just change ++$hash{$fields[1]} appropriately.
use strict;
use warnings;
#ARGV or die "Need CSV file on command line \n";
my %counts;
while (my $line = <>) {
chomp $line;
my #fields = split /,/, $line;
++$counts{$fields[1]};
}
while (my ($key, $value) = each %counts) {
print "$key - $value\n";
}
Sorry if this is off-topic, but if you're on a system with the standard Unix text processing tools, you could use this command to count the number of occurrences of each value in field #2, and not need to write any code.
cut -d, -f2 filename.csv | sort | uniq -c
which will generate something like this output, where the count is listed first, and the zipcode second:
12 12345
2 56789
34 78912
1 90210

Why does my Perl script say "Can't call method parse on an undefined value"?

I am new to Perl and still trying to figure out how to code in this language.
I am currently trying to split a long single string of csv into multiple lines.
Data example
a,b,c<br />x,y,x<br />
which I so far have manage to split up, adding in quotes, to add into a CSV file again later on:
"a,b,c""x,y,z"
By having the quotes it just signifies which sets of CSV are together as such.
The problem I am having is that when I try and create a CSV file, passing in data in a string i am getting an error
"Can't call method "parse" on an undefined variable.
When I print out the string which I am passing in, it is defined and holds data. I am hoping that this is something simple which I am doing wrong through lack of experience.
The CSV code which I am using is:
use warnings;
use Text::CSV;
use Data::Dumper;
use constant debug => 0;
use Text::CSV;
print "Running CSV editor......\n";
#my $csv = Text::CSV->new({ sep_char => ',' });
my $file = $ARGV[0] or die "Need to get CSV file on the command line\n";
my $fileextension = substr($file, -4);
#If the file is a CSV file then read in the file.
if ($fileextension =~ m/csv/i)
{
print "Reading and formating: $ARGV[0] \n";
open(my $data, '<', $file) or die "Could not open '$file' $!\n";
my #fields;
my $testline;
my $line;
while ($line = <$data>)
{
#Clears the white space at the end of the line.
chomp $line;
#Splits the line up and removes the <br />.
$testline = join "\" \" ", split qr{<br\s?/>}, $line;
#my $newStr = join $/, #lines;
#print $newStr;
my $q1 = "\"";
$testline = join "", $q1,$testline,$q1;
print "\n printing testline: \n $testline \n";
}
$input_string = $testline;
print "\n Testing input string line:\n $input_string";
if ($csv->parse ($input_string))
{
my #field = $csv->fields;
foreach my $col (0 .. $#field) {
my $quo = $csv->is_binary ($col) ? $csv->{quote_char} : "";
printf "%2d: %s%s%s\n", $col, $quo, $field[$col], $quo;#
}
}
else
{
print STDERR "parse () failed on argument: ",
$csv->error_input, "\n";
$csv->error_diag ();
}
#print $_,$/ for #lines;
print "\n Finished reading and formating: $ARGV[0] \n";
}else
{
print "Error: File is not a CSV file\n"
}
You did not create a Text::CSV object, but you try to use it.
"Can't call method "parse" on an undefined variable
This means that your $csv is not there, thus it does not have a method called parse. Simply create a Text::CSV object first, at the top of your code below all the use lines.
my $csv = Text::CSV->new;
Pleae take a look at the CPAN documentation of Text::CSV.
Also, did I mention you should use strict?

File manipulation in Perl

I have a simple .csv file that has that I want to extract data out of a write to a new file.
I to write a script that reads in a file, reads each line, then splits and structures the columns in a different order, and if the line in the .csv contains 'xxx' - dont output the line to output file.
I have already managed to read in a file, and create a secondary file, however am new to Perl and still trying to work out the commands, the following is a test script I wrote to get to grips with Perl and was wondering if I could aulter this to to what I need?-
open (FILE, "c1.csv") || die "couldn't open the file!";
open (F1, ">c2.csv") || die "couldn't open the file!";
#print "start\n";
sub trim($);
sub trim($)
{
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}
$a = 0;
$b = 0;
while ($line=<FILE>)
{
chop($line);
if ($line =~ /xxx/)
{
$addr = $line;
$post = substr($line, length($line)-18,8);
}
$a = $a + 1;
}
print $b;
print " end\n";
Any help is much appreciated.
To manipulate CSV files it is better to use one of the available modules at CPAN. I like Text::CSV:
use Text::CSV;
my $csv = Text::CSV->new ({ binary => 1, empty_is_undef => 1 }) or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<", 'c1.csv' or die "ERROR: $!";
$csv->column_names('field1', 'field2');
while ( my $l = $csv->getline_hr($fh)) {
next if ($l->{'field1'} =~ /xxx/);
printf "Field1: %s Field2: %s\n", $l->{'field1'}, $l->{'field2'}
}
close $fh;
If you need do this only once, so don't need the program later you can do it with oneliner:
perl -F, -lane 'next if /xxx/; #n=map { s/(^\s*|\s*$)//g;$_ } #F; print join(",", (map{$n[$_]} qw(2 0 1)));'
Breakdown:
perl -F, -lane
^^^ ^ <- split lines at ',' and store fields into array #F
next if /xxx/; #skip lines what contain xxx
#n=map { s/(^\s*|\s*$)//g;$_ } #F;
#trim spaces from the beginning and end of each field
#and store the result into new array #n
print join(",", (map{$n[$_]} qw(2 0 1)));
#recombine array #n into new order - here 2 0 1
#join them with comma
#print
Of course, for the repeated use, or in a bigger project you should use some CPAN module. And the above oneliner has much cavetas too.

help merging perl code routines together for file processing

I need some perl help in putting these (2) processes/code to work together. I was able to get them working individually to test, but I need help bringing them together especially with using the loop constructs. I'm not sure if I should go with foreach..anyways the code is below.
Also, any best practices would be great too as I'm learning this language. Thanks for your help.
Here's the process flow I am looking for:
read a directory
look for a particular file
use the file name to strip out some key information to create a newly processed file
process the input file
create the newly processed file for each input file read (if i read in 10, I create 10 new files)
Part 1:
my $target_dir = "/backups/test/";
opendir my $dh, $target_dir or die "can't opendir $target_dir: $!";
while (defined(my $file = readdir($dh))) {
next if ($file =~ /^\.+$/);
#Get filename attributes
if ($file =~ /^foo(\d{3})\.name\.(\w{3})-foo_p(\d{1,4})\.\d+.csv$/) {
print "$1\n";
print "$2\n";
print "$3\n";
}
print "$file\n";
}
Part 2:
use strict;
use Digest::MD5 qw(md5_hex);
#Create new file
open (NEWFILE, ">/backups/processed/foo$1.name.$2-foo_p$3.out") || die "cannot create file";
my $data = '';
my $line1 = <>;
chomp $line1;
my #heading = split /,/, $line1;
my ($sep1, $sep2, $eorec) = ( "^A", "^E", "^D");
while (<>)
{
my $digest = md5_hex($data);
chomp;
my (#values) = split /,/;
my $extra = "__mykey__$sep1$digest$sep2" ;
$extra .= "$heading[$_]$sep1$values[$_]$sep2" for (0..scalar(#values));
$data .= "$extra$eorec";
print NEWFILE "$data";
}
#print $data;
close (NEWFILE);
You are using an old-style of Perl programming. I recommend you to use functions and CPAN modules (http://search.cpan.org). Perl pseudocode:
use Modern::Perl;
# use...
sub get_input_files {
# return an array of files (#)
}
sub extract_file_info {
# takes the file name and returs an array of values (filename attrs)
}
sub process_file {
# reads the input file, takes the previous attribs and build the output file
}
my #ifiles = get_input_files;
foreach my $ifile(#ifiles) {
my #attrs = extract_file_info($ifile);
process_file($ifile, #attrs);
}
Hope it helps
I've bashed your two code fragments together (making the second a sub that the first calls for each matching file) and, if I understood your description of the objective correctly, this should do what you want. Comments on style and syntax are inline:
#!/usr/bin/env perl
# - Never forget these!
use strict;
use warnings;
use Digest::MD5 qw(md5_hex);
my $target_dir = "/backups/test/";
opendir my $dh, $target_dir or die "can't opendir $target_dir: $!";
while (defined(my $file = readdir($dh))) {
# Parens on postfix "if" are optional; I prefer to omit them
next if $file =~ /^\.+$/;
if ($file =~ /^foo(\d{3})\.name\.(\w{3})-foo_p(\d{1,4})\.\d+.csv$/) {
process_file($file, $1, $2, $3);
}
print "$file\n";
}
sub process_file {
my ($orig_name, $foo_x, $name_x, $p_x) = #_;
my $new_name = "/backups/processed/foo$foo_x.name.$name_x-foo_p$p_x.out";
# - From your description of the task, it sounds like we actually want to
# read from the found file, not from <>, so opening it here to read
# - Better to use lexical ("my") filehandle and three-arg form of open
# - "or" has lower operator precedence than "||", so less chance of
# things being grouped in the wrong order (though either works here)
# - Including $! in the error will tell why the file open failed
open my $in_fh, '<', $orig_name or die "cannot read $orig_name: $!";
open(my $out_fh, '>', $new_name) or die "cannot create $new_name: $!";
my $data = '';
my $line1 = <$in_fh>;
chomp $line1;
my #heading = split /,/, $line1;
my ($sep1, $sep2, $eorec) = ("^A", "^E", "^D");
while (<$in_fh>) {
chomp;
my $digest = md5_hex($data);
my (#values) = split /,/;
my $extra = "__mykey__$sep1$digest$sep2";
$extra .= "$heading[$_]$sep1$values[$_]$sep2"
for (0 .. scalar(#values));
# - Useless use of double quotes removed on next two lines
$data .= $extra . $eorec;
#print $out_fh $data;
}
# - Moved print to output file to here (where it will print the complete
# output all at once) rather than within the loop (where it will print
# all previous lines each time a new line is read in) to prevent
# duplicate output records. This could also be achieved by printing
# $extra inside the loop. Printing $data at the end will be slightly
# faster, but requires more memory; printing $extra within the loop and
# getting rid of $data entirely would require less memory, so that may
# be the better option if you find yourself needing to read huge input
# files.
print $out_fh $data;
# - $in_fh and $out_fh will be closed automatically when it goes out of
# scope at the end of the block/sub, so there's no real point to
# explicitly closing it unless you're going to check whether the close
# succeeded or failed (which can happen in odd cases usually involving
# full or failing disks when writing; I'm not aware of any way that
# closing a file open for reading can fail, so that's just being left
# implicit)
close $out_fh or die "Failed to close file: $!";
}
Disclaimer: perl -c reports that this code is syntactically valid, but it is otherwise untested.