Issues parsing a CSV file in perl using Text::CSV - perl

I'm trying to use Text::CSV to parse this CSV file. Here is how I am doing it:
open my $fh, '<', 'test.csv' or die "can't open csv";
my $csv = Text::CSV_XS->new ({ sep_char => "\t", binary => 1 , eol=> "\n"});
$csv->column_names($csv->getline($fh));
while(my $row = $csv->getline_hr($fh)) {
# use row
}
Because the file has 169,252 rows (not counting the headers line), I expect the loop to run that many times. However, it only runs 8 times and gives me 8 rows. I'm not sure what's happening, because the CSV just seems like a normal CSV file with \n as the line separator and \t as the field separator. If I loop through the file like this:
while(my $line = <$fh>) {
my $fields = $csv->parse($line);
}
Then the loop goes through all rows.

Text::CSV_XS is silently failing with an error. If you put the following after your while loop:
my ($cde, $str, $pos) = $csv->error_diag ();
print "$cde, $str, $pos\n";
You can see if there were errors parsing the file and you get the output:
2034, EIF - Loose unescaped quote, 336
Which means the column:
GT New Coupe 5.0L CD Wheels: 18" x 8" Magnetic Painted/Machined 6 Speakers
has an unquoted escape string (there is no backslash before the ").
The Text::CSV perldoc states:
allow_loose_quotes
By default, parsing fields that have quote_char characters inside an unquoted field, like
1,foo "bar" baz,42
would result in a parse error. Though it is still bad practice to allow this format, we cannot help there are some vendors that make their applications spit out lines styled like this.
If you change your arguments to the creation of Text::CSV_XS to:
my $csv = Text::CSV_XS->new ({ sep_char => "\t", binary => 1,
eol=> "\n", allow_loose_quotes => 1 });
The problem goes away, well until row 105265, when Error 2023 rears its head:
2023, EIQ - QUO character not allowed, 406
Details of this error in the perldoc:
2023 "EIQ - QUO character not allowed"
Sequences like "foo "bar" baz",qu and 2023,",2008-04-05,"Foo, Bar",\n will cause this error.
Setting your quote character empty (setting quote_char => '' on your call to Text::CSV_XS->new()) does seem to work around this and allow processing of the whole file. However I would take time to check if this is a sane option with the CSV data.
TL;DR The long and short is that your CSV is not in the greatest format, and you will have to work around it.

Related

Save a row to csv format

I have a set of rows from a DB that I would like to save to a csv file.
Taking into account that the data are ascii chars without any weird chars would the following suffice?
my $csv_row = join( ', ', #$row );
# save csv_row to file
My concern is if that would create rows that would be acceptable as CSV by any tool and e.g not be concern with quoting etc.
Update:
Is there any difference with this?
my $csv = Text::CSV->new ( { binary => 1, eol => "\n"} );
my $header = join (',', qw( COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4 ) );
$csv->print( $fh, [$header] );
foreach my $row ( #data ) {
$csv->print($fh, $row );
}
This gives me as a first line:
" COL_NAME1,COL_NAME2,COL_NAME3,COL_NAME4"
Please notice the double quotes and the rest of the rows are without any quotes.
What is the difference than my plain join? Also do I need the binary set?
The safest way should be to write clean records with a comma separator. The simpler the better, specially with the format that has so much variation in real life. If needed, double quote each field.
The true strength in using the module is for reading of "real-life" data. But it makes perfect sense to use it for writing as well, for a uniform approach to CSV. Also, options can then be set in a clear way, and the module can iron out some glitches in data.
The Text::CSV documentation tells us about binary option
Important Note: The default behavior is to accept only ASCII characters in the range from 0x20 (space) to 0x7E (tilde). This means that the fields can not contain newlines. If your data contains newlines embedded in fields, or characters above 0x7E (tilde), or binary data, you must set binary => 1 in the call to new. To cover the widest range of parsing options, you will always want to set binary.
I'd say use it. Since you write a file this may be it for options, along with eol (or use say method). But do scan the many useful options and review their defaults.
As for your header, the print method expects an array reference where each field is an element, not a single string with comma-separated fields. So it is wrong to say
my $header = join (',', qw(COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4)); # WRONG
$csv->print( $fh, [$header] );
since the $header is a single string which is then made the sole element of the (anonymous) array reference created by [ ... ]. So it prints this string as the first field in the row, and since it detects in it the separator , itself it also double-quotes. Instead, you should have
$csv->print($fh, [COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4]);
or better assign column names to #header and then do $csv->print($fh, \#header).
This is also an example of why it is good to use the module for writing – if a comma slips into an element of the array, supposed to be a single field, it is handled correctly by double-quoting.
A complete example
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV->new ( { binary => 1, eol => "\n" } )
or die "Cannot use CSV: " . Text::CSV->error_diag();
my $file = 'output.csv';
open my $fh_out , '>', 'output.csv' or die "Can't open $file for writing: $!";
my #headers = qw( COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4 );
my #data = 1..4;
$csv->print($fh_out, \#headers);
$csv->print($fh_out, \#data);
close $fh_out;
what produces the file output.csv
COL_NAME1,COL_NAME2,COL_NAME3,COL_NAME4
1,2,3,4

replace NULL token in Hive CLI output

I am trying to deliver the output of Hive CLI to my (closed source) application, and want to replace all "NULL" tokens with empty string. This is because Hive returns NULL even for numeric fields which the application raises exceptions for. I thought this should be a simple sed, or perl regex, but cant solve the problem so far.
Here's an example of the data record -
NULL<TAB>NULL<TAB>NULL<TAB>NULL<TAB>NULL<TAB>NULL<TAB>NULL<TAB>NULL<TAB>2015-02-08
The perl code I tried is
my %replace = (
"\tNULL\t" => "b",
"^NULL\t" => "a",
"\tNULL\$" => "c"
);
my $regex = join "|", keys %replace;
#$regex = qr/$regex/;
my $filename = hout;
open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>) {
chomp $row;
$row =~ s/($regex)/$replace{$1}/g;
print "$row\n";
}
This is the output I get -
NULLbNULLbNULLbNULL<TAB>2015-02-08
In other words, in a stream of 'fields' delimited by a 'character', I want to replace any field that is equal to the string "NULL" with an empty string, so the delimiters surrounding the field (or start of line + delimiter, or delimiter + end of line) become adjacent.
Any guidance would be much appreciated! Thanks!
P.S. I dont need a perl solution per se; just any terse solution would be awesome (I tried sed as well with similar results)
The root of your problem here is that your patterns overlap. You have a delimiter either side of your 'NULL' which you then modify and move on.
So something like this:
my $string = "NULL\tNULL\tNULL\tsome_value\tNULL\n";
print $string;
$string =~ s/(\A|\t)NULL(\t|\Z)/$1$2/g;
print $string;
$string =~ s/(\A|\t)NULL(\t|\Z)/$1$2/g;
print $string;
You need two passes to process it, because the pattern 'grabs' too much for the next iteration to match.
So with reference to: Matching two overlapping patterns with Perl
What you probably need is:
$string =~ s/(\A|\t)NULL(?=(\t|\Z))/$1/g;
You can use the same model if you're wanting to apply it to separate patterns.
I do this a lot creating output files from hive cli, it may be "simple" but I just pipe my output though this sed string:
hive -f test.hql | sed 's/NULL//g' > test.out
Works fine for me.

Skip bad CSV lines in Perl with Text::CSV

I have a script that is essentially still in testing.
I would like to use Text CSV to breakdown large quantities of CSV files dumped hourly.
These files can be quite large and of inconsistent quality.
Sometimes I'll get strange characters or data, but the usual issue is lines that just stop.
"Something", "3", "hello wor
The closed quote is my biggest hurdle. The script just breaks. The error goes to stderr and my while loop is broken.
While (my $row = $csv->getline($data))
The error I get is...
# CSV_PP ERROR: 2025 - EIQ - Loose unescaped escape
I can't seem to do any kind of error handling for this. If I enable allow_loose_escapes, all I get instead is a lot of errors, because it considers the subsequent new lines as part of the same row.
Allowing the loose escape is not the answer. It just makes your program ignore the error and try to incorporate the broken line with your other lines, as you also mentioned. Instead you can try to catch the problem, and check your $row for definedness:
use strict;
use warnings;
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
});
while (1) {
my $row = $csv->getline(*DATA);
$csv->eof and last;
if (defined $row) {
$csv->print(*STDOUT, $row);
} else {
say "==" x 10;
print "Bad line, skipping\n";
say $csv->error_diag();
say "==" x 10;
}
}
__DATA__
1,2,3,4
a,b,c,d
"Something", "3", "hello wor
11,22,33,44
For me this outputs:
1,2,3,4
a,b,c,d
====================
Bad line, skipping
2034EIF - Loose unescaped quote143
====================
11,22,33,44
If you want to save the broken lines, you can access them with $csv->error_input(), e.g.:
print $badlines $csv->error_input();

Trying to understand Perl split() output

I have a few lines of text that I'm trying to use Perl's split function to convert into an array. The problem is that I'm getting some unusual extra characters in the output, specifically the following string "\cM" (without the quotes). This string appears where there were line breaks in the original text; however, (I believe) those line breaks were removed in the text that I'm trying to split. Does anybody know what's going on with this phenomenon? I posted an example below. Thanks.
Here's the original plain text that I'm trying to split. I'm loading it from a file, in case that matters:
10b2obo12b2o2b$6b3obob3o8bob3o2b$2bobo10bo3b2obo4bo2b$2o4b2o5bo3b4obo
3b2o2b$2bob2o2bo4b3obo5b4obob$8bo4bo13b3o$2bob2o2bo4b3obo5b4obob$2o4b
2o5bo3b4obo3b2o2b$2bobo10bo3b2obo4bo2b$6b3obob3o8bob3o2b$10b2obo12b2o!
Here is my Perl code that is supposed to do the splitting:
while(<$FH>) {
chomp;
$string .= $_;
last if m/!$/;
}
#rows = split(qr/\$/, $string);
print; # a dummy line to provide a breakpoint for the debugger
This what the debugger outputs when it gets to the "print" line. The issue I'm trying to deal with appears in lines 3, 7, and 10:
DB<10> p $string
2o5bo3b4obo3b2o2b$2bobo10bo3b2obo4bo2b$6b3obob3o8bob3o2b$10b2obo12b2o!
DB<11> x #rows
0 '10b2obo12b2o2b'
1 '6b3obob3o8bob3o2b'
2 '2bobo10bo3b2obo4bo2b'
3 "2o4b2o5bo3b4obo\cM3b2o2b"
4 '2bob2o2bo4b3obo5b4obob'
5 '8bo4bo13b3o'
6 '2bob2o2bo4b3obo5b4obob'
7 "2o4b\cM2o5bo3b4obo3b2o2b"
8 '2bobo10bo3b2obo4bo2b'
9 '6b3obob3o8bob3o2b'
10 "10b2obo12b2o!\cM"
You know, changing the file input separator would make this code a lot simpler.
$/ = '$';
my #rows = <$FH>;
chomp #rows;
print "#rows";
The debugger is probably using \cM to represent Ctrl-M which is also known as a carriage return (and sometimes \r or ^M). Text files from Windows use a CR-LF (carriage return, line feed) pair to represent the end of a line. If you read such a file on a Unix system, your chomp will strip off the Unix EOL (a single line feed) but leave the CR as is and you end up with stray CRs in your file.
For a file like you have you can just strip out all the trailing whitespace instead of using chomp:
while(defined(my $line = <$FH>)) {
$line =~ s/\s+$//;
$string .= $line;
last if($line =~ /!$/);
}
You don't say which OS you're on.
Check out binmode and what it has to say about \cM, and that their position coincides with the line endings of your input file:
http://perldoc.perl.org/functions/binmode.html

Perl Text::CSV_XS Encoding Issues

I'm having issues with Unicode characters in Perl. When I receive data in from the web, I often get characters like “ or €. The first one is a quotation mark and the second is the Euro symbol.
Now I can easily substitute in the correct values in Perl and print to the screen the corrected words, but when I try to output to a .CSV file all the substitutions I have done are for nothing and I get garbage in my .CSV file. (The quotes work, guessing since it's such a general character). Also Numéro will give Numéro. The examples are endless.
I wrote a small program to try and figure this issue out, but am not sure what the problem is. I read on another stack overflow thread that you can import the .CSV in Excel and choose UTF8 encoding, this option does not pop up for me though. I'm wondering if I can just encode it into whatever Excel's native character set is (UTF16BE???), or if there is another solution. I have tried many variations on this short program, and let me say again that its just for testing out Unicode problems, not a part of a legit program. Thanks.
use strict;
use warnings;
require Text::CSV_XS;
use Encode qw/encode decode/;
my $text = 'Numéro Numéro Numéro Orkos Capital SAS (√¢¬Ä¬úOrkos√¢¬Ä¬ù) 325M√¢¬Ç¬¨ in 40 companies headquartered';
print("$text\n\n\n");
$text =~ s/“|”/"/sig;
$text =~ s/’s/'s/sig;
$text =~ s/√¢¬Ç¬¨/€/sig;
$text =~ s/√¢¬Ñ¬¢/®/sig;
$text =~ s/ / /sig;
print("$text\n\n\n");
my $CSV = Text::CSV_XS->new ({ binary => 1, eol => "\n" }) or die "Cannot use CSV: ".Text::CSV->error_diag();
open my $OUTPUT, ">:encoding(utf8)", "unicode.csv" or die "unicode.csv: $!";
my #row = ($text);
$CSV->print($OUTPUT, \#row);
$OUTPUT->autoflush(1);
I've also tried these two lines to no avail:
$text = decode("Guess", $text);
$text = encode("UTF-16BE", $text);
First, your strings are encoded in MacRoman. When you interpret them as byte sequences the second results in C3 A2 C2 82 C2 AC. This looks like UTF-8, and the decoded form is E2 82 AC. This again looks like UTF-8, and when you decode it you get €. So what you need to do is:
$step1 = decode("MacRoman", $text);
$step2 = decode("UTF-8", $step1);
$step3 = decode("UTF-8", $step2);
Don't ask me on which mysterious ways this encoding has been created in the first place. Your first character decodes as U+201C, which is indeed the LEFT DOUBLE QUOTATION MARK.
Note: If you are on a Mac, the first decoding step may be unnecessary since the encoding is only in the "presentation layer" (when you copied the Perl source into the HTML form and your browser did the encoding-translation for you) and not in the data itself.
So I figured out the answer, the comment from Roland Illig helped me get there (thanks again!). Decoding more than once causes the wide characters error, and therefore should not be done.
The key here is decoding the UTF-8 Text and then encoding it in MacRoman. To send the .CSV files to my Windows friends I have to save it as .XLSX first so that the coding doesn't get all screwy again.
$text =~ s/“|”/"/sig;
$text =~ s/’s/'s/sig;
$text =~ s/√¢¬Ç¬¨/€/sig;
$text =~ s/√¢¬Ñ¬¢/®/sig;
$text =~ s/ / /sig;
$text = decode("UTF-8", $text);
print("$text\n\n\n");
my $CSV = Text::CSV_XS->new ({ binary => 1, eol => "\n" }) or die "Cannot use CSV: ".Text::CSV->error_diag();
open my $OUTPUT, ">:encoding(MacRoman)", "unicode.csv" or die "unicode.csv: $!";