Skip bad CSV lines in Perl with Text::CSV - perl

I have a script that is essentially still in testing.
I would like to use Text CSV to breakdown large quantities of CSV files dumped hourly.
These files can be quite large and of inconsistent quality.
Sometimes I'll get strange characters or data, but the usual issue is lines that just stop.
"Something", "3", "hello wor
The closed quote is my biggest hurdle. The script just breaks. The error goes to stderr and my while loop is broken.
While (my $row = $csv->getline($data))
The error I get is...
# CSV_PP ERROR: 2025 - EIQ - Loose unescaped escape
I can't seem to do any kind of error handling for this. If I enable allow_loose_escapes, all I get instead is a lot of errors, because it considers the subsequent new lines as part of the same row.

Allowing the loose escape is not the answer. It just makes your program ignore the error and try to incorporate the broken line with your other lines, as you also mentioned. Instead you can try to catch the problem, and check your $row for definedness:
use strict;
use warnings;
use Text::CSV;
use feature 'say';
my $csv = Text::CSV->new({
binary => 1,
eol => $/,
});
while (1) {
my $row = $csv->getline(*DATA);
$csv->eof and last;
if (defined $row) {
$csv->print(*STDOUT, $row);
} else {
say "==" x 10;
print "Bad line, skipping\n";
say $csv->error_diag();
say "==" x 10;
}
}
__DATA__
1,2,3,4
a,b,c,d
"Something", "3", "hello wor
11,22,33,44
For me this outputs:
1,2,3,4
a,b,c,d
====================
Bad line, skipping
2034EIF - Loose unescaped quote143
====================
11,22,33,44
If you want to save the broken lines, you can access them with $csv->error_input(), e.g.:
print $badlines $csv->error_input();

Related

Issues parsing a CSV file in perl using Text::CSV

I'm trying to use Text::CSV to parse this CSV file. Here is how I am doing it:
open my $fh, '<', 'test.csv' or die "can't open csv";
my $csv = Text::CSV_XS->new ({ sep_char => "\t", binary => 1 , eol=> "\n"});
$csv->column_names($csv->getline($fh));
while(my $row = $csv->getline_hr($fh)) {
# use row
}
Because the file has 169,252 rows (not counting the headers line), I expect the loop to run that many times. However, it only runs 8 times and gives me 8 rows. I'm not sure what's happening, because the CSV just seems like a normal CSV file with \n as the line separator and \t as the field separator. If I loop through the file like this:
while(my $line = <$fh>) {
my $fields = $csv->parse($line);
}
Then the loop goes through all rows.
Text::CSV_XS is silently failing with an error. If you put the following after your while loop:
my ($cde, $str, $pos) = $csv->error_diag ();
print "$cde, $str, $pos\n";
You can see if there were errors parsing the file and you get the output:
2034, EIF - Loose unescaped quote, 336
Which means the column:
GT New Coupe 5.0L CD Wheels: 18" x 8" Magnetic Painted/Machined 6 Speakers
has an unquoted escape string (there is no backslash before the ").
The Text::CSV perldoc states:
allow_loose_quotes
By default, parsing fields that have quote_char characters inside an unquoted field, like
1,foo "bar" baz,42
would result in a parse error. Though it is still bad practice to allow this format, we cannot help there are some vendors that make their applications spit out lines styled like this.
If you change your arguments to the creation of Text::CSV_XS to:
my $csv = Text::CSV_XS->new ({ sep_char => "\t", binary => 1,
eol=> "\n", allow_loose_quotes => 1 });
The problem goes away, well until row 105265, when Error 2023 rears its head:
2023, EIQ - QUO character not allowed, 406
Details of this error in the perldoc:
2023 "EIQ - QUO character not allowed"
Sequences like "foo "bar" baz",qu and 2023,",2008-04-05,"Foo, Bar",\n will cause this error.
Setting your quote character empty (setting quote_char => '' on your call to Text::CSV_XS->new()) does seem to work around this and allow processing of the whole file. However I would take time to check if this is a sane option with the CSV data.
TL;DR The long and short is that your CSV is not in the greatest format, and you will have to work around it.

Using Perl to find and fix errors in CSV files

I am dealing with very large amounts of data. Every now and then there is a slip up. I want to identify each row with an error, under a condition of my choice. With that I want the row number along with the line number of each erroneous row. I will be running this script on a handful of files and I will want to output the report to one.
So here is my example data:
File_source,ID,Name,Number,Date,Last_name
1.csv,1,Jim,9876,2014-08-14,Johnson
1.csv,2,Jim,9876,2014-08-14,smith
1.csv,3,Jim,9876,2014-08-14,williams
1.csv,4,Jim,9876,not_a_date,jones
1.csv,5,Jim,9876,2014-08-14,dean
1.csv,6,Jim,9876,2014-08-14,Ruzyck
Desired output:
Row#5,4.csv,4,Jim,9876,not_a_date,jones (this is an erroneous row)
The condition I have chosen is print to output if anything in the date field is not a date.
As you can see, my desired output contains the line number where the error occurred, along with the data itself.
After I have my output that shows the lines within each file that are in error, I want to grab that line from the untouched original CSV file to redo (both modified and original files contain the same amount of rows). After I have a file of these redone rows, I can omit and clean up where needed to prevent interruption of an import.
Folder structure will contain:
Modified: 4.txt
Original: 4.csv
I have something started here, written in Perl, which by the logic will at least return the rows I need. However I believe my syntax is a little off and I do not know how to plug in the other subroutines.
Code:
$count = 1;
while (<>) {
unless ($F[4] =~ /\d+[-]\d+[-]\d+/)
print "Row#" . $count++ . "," . "$_";
}
The code above is supposed to give me my erroneous rows, but to be able to extract them from the originals is beyond me. The above code also contains some syntax errors.
This will do as you ask.
Please be certain that none of the fields in the data can ever contain a comma , otherwise you will need to use Text::CSV to process it instead of just a simple split.
use strict;
use warnings;
use 5.010;
use autodie;
open my $fh, '<', 'example.csv';
<$fh>; # Skip header
while (<$fh>) {
my #fields = split /,/;
if( $fields[4] !~ /^\d{4}-\d{2}-\d{2}$/ ) {
print "Row#$.,$_";
}
}
output
Row#5,4.csv,4,Jim,9876,not_a_date,jones
Update
If you want to process a number of files then you need this instead.
The close ARGV at the end of the loop is there so that the line counter $. is reset to
1 at the start of each file. Without it it just continues from 1 upwards across all the files.
You would run this like
rob#Samurai-U:~$ perl findbad.pl *.csv
or you could list the files individually, separated by spaces.
For the test I have created files 1.csv and 2.csv which are identical to your example data except that the first field of each line is the name of the file containing the data.
You may not want the line in the output that announces each file name, in which case you should replace the entire first if block with just next if $. == 1.
use strict;
use warnings;
#ARGV = map { glob qq{"$_"} } #ARGV; # For Windows
while (<>) {
if ($. == 1) {
print "\n\nFile: $ARGV\n\n";
next;
}
my #fields = split /,/;
unless ( $fields[4] =~ /^\d{4}-\d{2}-\d{2}$/ ) {
printf "Row#%d,%s", $., $_;
}
close ARGV if eof ARGV;
}
output
File: 1.csv
Row#5,1.csv,4,Jim,9876,not_a_date,jones
File: 2.csv
Row#5,2.csv,4,Jim,9876,not_a_date,jones

Perl - Need to append duplicates in a file and write unique value only

I have searched a fair bit and hope I'm not duplicating something someone has already asked. I have what amounts to a CSV that is specifically formatted (as required by a vendor). There are four values that are being delimited as follows:
"Name","Description","Tag","IPAddresses"
The list is quite long (and there are ~150 unique names--only 2 in the sample below) but it basically looks like this:
"2B_AppName-Environment","desc","tag","192.168.1.1"
"2B_AppName-Environment","desc","tag","192.168.22.155"
"2B_AppName-Environment","desc","tag","10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4"
"6G_ServerName-AltEnv","desc","tag","192.192.192.40"
"6G_ServerName-AltEnv","desc","tag","192.168.50.5"
I am hoping for a way in Perl (or sed/awk, etc.) to come up with the following:
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
So basically, the resulting file will APPEND the duplicates to the first match -- there should only be one line per each app/server name with a list of comma-separated IP addresses just like what is shown above.
Note that the "Decription" and "Tag" fields don't need to be considered in the duplication removal/append logic -- let's assume these are blank for the example to make things easier. Also, in the vendor-supplied list, the "Name" entries are all already sorted to be together.
This short Perl program should suit you. It expects the path to the input CSV file as a parameter on the command line and prints the result to STDOUT. It keeps track of the appearance of new name fields in the #names array so that it can print the output in the order that each name first appears, and it takes the values for desc and tag from the first occurrence of each unique name.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({always_quote => 1, eol => "\n"});
my (#names, %data);
while (my $row = $csv->getline(*ARGV)) {
my $name = $row->[0];
if ($data{$name}) {
$data{$name}[3] .= ','.$row->[3];
}
else {
push #names, $name;
$data{$name} = $row;
}
}
for my $name (#names) {
$csv->print(*STDOUT, $data{$name});
}
output
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
Update
Here's a version that ignores any record that doesn't have a valid IPv4 address in the fourth field. I've used Regexp::Common as it's the simplest way to get complex regex patterns right. It may need installing on your system.
use strict;
use warnings;
use Text::CSV;
use Regexp::Common;
my $csv = Text::CSV->new({always_quote => 1, eol => "\n"});
my (#names, %data);
while (my $row = $csv->getline(*ARGV)) {
my ($name, $address) = #{$row}[0,3];
next unless $address =~ $RE{net}{IPv4};
if ($data{$name}) {
$data{$name}[3] .= ','.$address;
}
else {
push #names, $name;
$data{$name} = $row;
}
}
for my $name (#names) {
$csv->print(*STDOUT, $data{$name});
}
I would advise you to use a CSV parser like Text::CSV for this type of problem.
Borodin has already pasted a good example of how to do this.
One of the approaches that I'd advise you NOT to use are regular expressions.
The following one-liner demonstrates how one could do this, but this is a very fragile approach compared to an actual csv parser:
perl -0777 -ne '
while (m{^((.*)"[^"\n]*"\n(?:(?=\2).*\n)*)}mg) {
$s = $1;
$s =~ s/"\n.*"([^"\n]+)(?=")/,$1/g;
print $s
}' test.csv
Outputs:
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
Explanation:
Switches:
-0777: Slurp the entire file
-n: Creates a while(<>){...} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Code:
while (m{^((.*)"[^"]*"\n(?:(?=\2).*\n)*)}mg): Separate text into matching sections.
$s =~ s/"\n.*"([^"\n]+)(?=")/,$1/g;: Join all ip addresses by a comma in matching sections.
print $s: Print the results.

new to Perl - CSV - find a string and print all numbers in that column

I've got a bunch of data in a CSV file, first row is all strings (all text and underscores), all subsequent rows are filled with numbers relating to said strings.
I'm trying to parse through the first line and find particular strings, remember which column that string was in, and then go through the rest of the file and get the data in the same column. I need to do this to three strings.
I've been using Text::CSV but I can't figure out how to get it to increment a counter until it finds the string in the first line and then go to the next line, get the data from that same column, etc. etc. Here's what I've tried so far:
while (<CSV>) {
if ($csv->parse($data)) {
my #field = $csv->fields;
my $count = 0;
for $column (#field) {
print ++$count, " => ", $column, "\n";
}
} else {
my $err = $csv->error_input;
print "Failed to parse line: $err";
}
}
Since $data is in line 1, it prints "1 $data" 25 times (# of lines in CSV file). How do I get it to remember which column it found $data in? Also, since I know all of the strings are in line 1, how do I get it to only parse through line 1, find all of the strings in #data, and then parse through the rest of the file, grabbing data from the necessary columns and putting it into a matrix or array of arrays?
Thanks for the help!
edit: I realized my questions were a bit poorly phrased. I don't know how to get the column number from CSV. How is this done?
Also, once I've got the column number, how do I tell it CSV to run through the subsequent lines and grab data from only that column?
Try something like this:
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({binary=>1});
my $thing_to_match = "blah";
my $matched_index;
my #stored_data = ();
while(my $row= $csv->getline(*DATA)) #grabs lines below __DATA__
#(near the end of the script)
{
my #fields = #$row;
#If we haven't found the matched index, yet, search for it.
if(not defined $matched_index)
{
foreach my $i(0..$#fields)
{
$matched_index = $i if($fields[$i] eq $thing_to_match);
}
}
#NOTE: We're pushing a *reference* to an array!
#Look at perldoc perldata
push #stored_data,\#fields;
}
die "Column for '$thing_to_match' not found!" unless defined $matched_index;
foreach my $row(#stored_data)
{
print $row->[$matched_index] . "\n";
}
__DATA__
stuff,more stuff,yet more stuff
"yes, this thing, is one item",blah,blarg
1,2,3
The output is:
more stuff
blah
2
I don't have time to write up a full example, but I wrote a module that might help you do this. Tie::Array::CSV uses some magic to make your csv file act like a Perl array of arrayrefs. In this way you can use your knowledge of Perl to interact with the file.
A word of warning though! One benefit of my module is that it is read/write. Since you only want read, be careful not to assign to it!

how to put a file into an array and save it in perl

Hello everyone I'm a beginner in perl and I'm facing some problems as I want to put my strings starting from AA to \ in to an array and want to save it. There are about 2000-3000 strings in a txt file starting from same initials i.e., AA to / I'm doing it by this way plz correct me if I'm wrong.
Input File
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\
Source code
$flag = 0
while ($line = <ifh>)
{
if ( $line = m//\/g)
{
$flag = 1;
}
while ( $flag != 0)
{
for ($i = 0; $i <= 10000; $i++)
{ # Missing brace added by editor
$array[$i] = $line;
} # Missing brace added by editor
}
} # Missing close brace added by editor; position guessed!
print $ofh, $line;
close $ofh;
Welcome to StackOverflow.
There are multiple issues with your code. First, please post compilable Perl; I had to add three braces to give it the remotest chance of compiling, and I had to guess where one of them went (and there's a moderate chance it should be on the other side of the print statement from where I put it).
Next, experts have:
use warnings;
use strict;
at the top of their scripts because they know they will miss things if they don't. As a learner, it is crucial for you to do the same; it will prevent you making errors.
With those in place, you have to declare your variables as you use them.
Next, remember to indent your code. Doing so makes it easier to comprehend. Perl can be incomprehensible enough at the best of times; don't make it any harder than it has to be. (You can decide where you like braces - that is open to discussion, though it is simpler to choose a style you like and stick with it, ignoring any discussion because the discussion will probably be fruitless.)
Is the EB vs VB in the data significant? It is hard to guess.
It is also not clear exactly what you are after. It might be that you're after an array of entries, one for each block in the file (where the blocks end at the line containing just a backslash), and where each entry in the array is a hash keyed by the first two letters (or first word) on the line, with the remainder of the line being the value. This is a modestly complex structure, and probably beyond what you're expected to use at this stage in your learning of Perl.
You have the line while ($line = <ifh>). This is not invalid in Perl if you opened the file the old fashioned way, but it is not the way you should be learning. You don't show how the output file handle is opened, but you do use the modern notation when trying to print to it. However, there's a bug there, too:
print $ofh, $line; # Print two values to standard output
print $ofh $line; # Print one value to $ofh
You need to look hard at your code, and think about the looping logic. I'm sure what you have is not what you need. However, I'm not sure what it is that you do need.
Simpler solution
From the comments:
I want to flag each record starting from AA to \ as record 0 till record n and want to save it in a new file with all the record numbers.
Then you probably just need:
#!/usr/bin/env perl
use strict;
use warnings;
my $recnum = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
print "$_\n";
$recnum++;
}
else
{
print "$recnum $_\n";
}
}
This reads from the files specified on the command line (or standard input if there are none), and writes the tagged output to standard output. It prefixes each line except the 'end of record' marker lines with the record number and a space. Choose your output format and file handling to suit your needs. You might argue that the chomp is counter-productive; you can certainly code the program without it.
Overly complex solution
Developed in the absence of clear direction from the questioner.
Here is one possible way to read the data, but it uses moderately advanced Perl (hash references, etc). The Data::Dumper module is also useful for printing out Perl data structures (see: perldoc Data::Dumper).
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #data;
my $hashref = { };
my $nrecs = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
# End of group - save to data array and start new hash
$data[$nrecs++] = $hashref;
$hashref = { };
}
else
{
m/^([A-Z]+)\s+(.*)$/;
$hashref->{$1} = $2;
}
}
foreach my $i (0..$nrecs-1)
{
print "Record $i:\n";
foreach my $key (sort keys $data[$i])
{
print " $key = $data[$i]->{$key}\n";
}
}
print Data::Dumper->Dump([ \#data ], [ '#data' ]);
Sample output for example input:
Record 0:
AA = c0001
BB = afsfjgfjgjgjflffbg
CC = table
DD = hhhfsegsksgk
EB = jksgksjs
Record 1:
AA = e0002
BB = rejwkghewhgsejkhrj
CC = chair
DD = egrhjrhojohkhkhrkfs
VB = rkgjehkrkhkh;r
$#data = [
{
'EB' => 'jksgksjs',
'CC' => 'table',
'AA' => 'c0001',
'BB' => 'afsfjgfjgjgjflffbg',
'DD' => 'hhhfsegsksgk'
},
{
'CC' => 'chair',
'AA' => 'e0002',
'VB' => 'rkgjehkrkhkh;r',
'BB' => 'rejwkghewhgsejkhrj',
'DD' => 'egrhjrhojohkhkhrkfs'
}
];
Note that this data structure is not optimized for searching except by record number. If you need to search the data in some other way, then you need to organize it differently. (And don't hand this code in as your answer without understanding it all - it is subtle. It also does no error checking; beware faulty data.)
It can't be right. I can see two main issues with your while-loop.
Once you enter the following loop
while ( $flag != 0)
{
...
}
you'll never break out because you do not reset the flag whenever you find an break-line. You'll have to parse you input and exit the loop if necessary.
And second you never read any input within this loop and thus process the same $line over and over again.
You should not put the loop inside your code but instead you can use the following pattern (pseudo-code)
if flag != 0
append item to array
else
save array to file
start with new array
end
I believe what you want is to split the files content at \ though it's not too clear.
To achieve this you can slurp the file into a variable by setting the input record separator, then split the content.
To find out about Perl's special variables related to filehandlers read perlvar
#!perl
use strict;
use warnings;
my $content;
{
open my $fh, '<', 'test.txt';
local $/; # slurp mode
$content = <$fh>;
close $fh;
}
my #blocks = split /\\/, $content;
Make sure to localize modifications of Perl's special variables to not interfere with different parts of your program.
If you want to keep the separator you could set $/ to \ directly and skip split.
#!perl
use strict;
use warnings;
my #blocks;
{
open my $fh, '<', 'test.txt';
local $/ = '\\'; # seperate at \
#blocks = <$fh>;
close $fh;
}
Here's a way to read your data into an array. As I said in a comment, "saving" this data to a file is pointless, unless you change it. Because if I were to print the #data array below to a file, it would look exactly like the input file.
So, you need to tell us what it is you want to accomplish before we can give you an answer about how to do it.
This script follows these rules (exactly):
Find a line that begins with "AA",
and save that into $line
Concatenate every new line from the
file into $line
When you find a line that begins with
a backslash \, stop concatenating
lines and save $line into #data.
Then, find the next line that begins
with "AA" and start the loop over.
These matching regexes are pretty loose, as they will match AAARGH and \bonkers as well. If you need them stricter, you can try /^\\$/ and /^AA$/, but then you need to watch out for whitespace at the beginning and end of line. So perhaps /^\s*\\\s*$/ and /^\s*AA\s*$/ instead.
The code:
use warnings;
use strict;
my $line="";
my #data;
while (<DATA>) {
if (/^AA/) {
$line = $_;
while (<DATA>) {
$line .= $_;
last if /^\\/;
}
}
push #data, $line;
}
use Data::Dumper;
print Dumper \#data;
__DATA__
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\