I have a set of rows from a DB that I would like to save to a csv file.
Taking into account that the data are ascii chars without any weird chars would the following suffice?
my $csv_row = join( ', ', #$row );
# save csv_row to file
My concern is if that would create rows that would be acceptable as CSV by any tool and e.g not be concern with quoting etc.
Update:
Is there any difference with this?
my $csv = Text::CSV->new ( { binary => 1, eol => "\n"} );
my $header = join (',', qw( COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4 ) );
$csv->print( $fh, [$header] );
foreach my $row ( #data ) {
$csv->print($fh, $row );
}
This gives me as a first line:
" COL_NAME1,COL_NAME2,COL_NAME3,COL_NAME4"
Please notice the double quotes and the rest of the rows are without any quotes.
What is the difference than my plain join? Also do I need the binary set?
The safest way should be to write clean records with a comma separator. The simpler the better, specially with the format that has so much variation in real life. If needed, double quote each field.
The true strength in using the module is for reading of "real-life" data. But it makes perfect sense to use it for writing as well, for a uniform approach to CSV. Also, options can then be set in a clear way, and the module can iron out some glitches in data.
The Text::CSV documentation tells us about binary option
Important Note: The default behavior is to accept only ASCII characters in the range from 0x20 (space) to 0x7E (tilde). This means that the fields can not contain newlines. If your data contains newlines embedded in fields, or characters above 0x7E (tilde), or binary data, you must set binary => 1 in the call to new. To cover the widest range of parsing options, you will always want to set binary.
I'd say use it. Since you write a file this may be it for options, along with eol (or use say method). But do scan the many useful options and review their defaults.
As for your header, the print method expects an array reference where each field is an element, not a single string with comma-separated fields. So it is wrong to say
my $header = join (',', qw(COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4)); # WRONG
$csv->print( $fh, [$header] );
since the $header is a single string which is then made the sole element of the (anonymous) array reference created by [ ... ]. So it prints this string as the first field in the row, and since it detects in it the separator , itself it also double-quotes. Instead, you should have
$csv->print($fh, [COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4]);
or better assign column names to #header and then do $csv->print($fh, \#header).
This is also an example of why it is good to use the module for writing – if a comma slips into an element of the array, supposed to be a single field, it is handled correctly by double-quoting.
A complete example
use warnings;
use strict;
use Text::CSV_XS;
my $csv = Text::CSV->new ( { binary => 1, eol => "\n" } )
or die "Cannot use CSV: " . Text::CSV->error_diag();
my $file = 'output.csv';
open my $fh_out , '>', 'output.csv' or die "Can't open $file for writing: $!";
my #headers = qw( COL_NAME1 COL_NAME2 COL_NAME3 COL_NAME4 );
my #data = 1..4;
$csv->print($fh_out, \#headers);
$csv->print($fh_out, \#data);
close $fh_out;
what produces the file output.csv
COL_NAME1,COL_NAME2,COL_NAME3,COL_NAME4
1,2,3,4
Related
I have a text file that contains over multiple xmls that look like this:
<queryResponse><entity><devicesDTO><clearedAlarms>1</clearedAlarms><warningAlarms>0</warningAlarms></devicesDTO></entity></queryResponse>
<queryResponse><entity><devicesDTO><clearedAlarms>2</clearedAlarms><warningAlarms>2</warningAlarms></devicesDTO></entity></queryResponse>
I would like to convert each line to a csv:
clearedAlarms, warningAlarms
1, 0
2, 2
Here's what I have now that only enables me to parse a xml and output the csv. The file has actually changed now and I'm supposed to be reading a txt file that contains multiple xmls
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
use XML::Simple;
#Elements, that I want see in my csv
my #Fields = qw{clearedAlarms warningAlarms};
open(my $out, '>', 'test.csv') or die "Output: $!\n";
print $out join(',', #Fields) . "\n";
my $xml = XMLin('test.xml', ForceArray => ['entity']);
foreach my $entity ( #{ $xml->{entity} } ) {
print Dumper $entity;
}
foreach my $entity ( #{ $xml->{entity} } ) {
print $out join( ',', #{ $entity->{devicesDTO} }{#Fields} ) . "\n";
}
It's the origin of the Perl slogan: “There's More Than One Way To Do It!” If you don't want to use xml module ( as you mentioned The file has actually changed now and I'm supposed to be reading a txt file that contains multiple xmls) You can use https://metacpan.org/pod/File::Grep mdoule (Find matches to a pattern in a series of files and related functions) for file operations.and https://metacpan.org/pod/Text::CSV_XS (https://metacpan.org/pod/Text::CSV_XS) This module provides more function related to csv operation which you can utilise as per your requirements.
fmap BLOCK LIST
Performs a map operation on the files in LIST, using BLOCK as the mapping function. The results from BLOCK will be appended to the list that is returned at the end of the call.
csv This is a high-level function that aims at simple (user) interfaces. This can be used to read/parse a CSV file or stream (the default behavior) or to produce a file or write to a stream (define the out attribute).
use strict;
use warnings;
use File::Grep qw(fmap);
use Text::CSV_XS qw( csv );
use Data::Dumper;
my $data;
my $csv_file='test_file.csv';
# my #result = fmap { <block> } file_name;
# replace *DATA with your file path.
# checking the pattern and extracting value
# pushing values to array to create array of array
fmap { (/<clearedAlarms>(.*?)<\/clearedAlarms><warningAlarms>(.*?)<\/warningAlarms>/ ? push(#$data,[$1,$2]) : () ) } *DATA;
if (#$data) {
# Write array of arrays as csv file
csv (in => $data, out => $csv_file, sep_char=> ",", headers => [qw( clearedAlarms warningAlarms )]);
} else {
print "\n data not found (provide proper message)\n";
}
__DATA__
<queryResponse><entity><devicesDTO><clearedAlarms>1</clearedAlarms><warningAlarms>0</warningAlarms></devicesDTO></entity></queryResponse>
<queryResponse><entity><devicesDTO><clearedAlarms>2</clearedAlarms><warningAlarms>2</warningAlarms></devicesDTO></entity></queryResponse>
Ouput (if you open $csv_file file)
clearedAlarms,warningAlarms
1,0
2,2
Given the simplicity of the XML schema, this easier to do with AnyData
For instance:
#!/usr/bin/perl
# This script converts a XML file to CSV format.
# Load the AnyData XML to CSV conversion modules
use XML::Parser;
use XML::Twig;
use AnyData;
my $input_xml = "test.xml";
my $output_csv = "test.csv";
$flags->{record_tag} = 'ITEM';
adConvert( 'XML', $input_xml, 'CSV', $output_csv, $flags );
Would convert your data structure (XML) into:
clearedAlarms, warningAlarms
1, 0
2, 2
If your file structure is always the same, you indeed don't need an XML parser. But you don't need anything else either. You can treat that input file like a slightly more complex CSV file that has weird delimiters.
split uses a pattern to turn a string into a list of strings. By default it will eat the delimiter match, so this disappear. We can use a pattern that looks like an XML tag as the pattern. Note how I am using qr// with a delimiter that isn't the slash / to make this more readable, because it avoids escaping the optional slash / in the closing tag.
split qr{</?[^>]+>}, '<foo>123</foo>';
This will produce a data structure that looks like this (using Data::Printer to produce the output):
[
[0] "",
[2] 123
]
The first element is an empty string, which denotes the lack of any other characters before the first match of the delimiter pattern. We need to filter these out. That's easily done with grep.
grep { $_ ne q{} } split qr{</?[^>]+>}, '<foo>123</foo>';
Now our output is nice and clean.
[
[0] 123
]
All we now need to do is apply this to a full file. Since our data only contains a couple of numbers, there is no need to use Text::CSV in this case.
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
chomp;
say join ',', grep { $_ ne q{} } split qr{</?[^>]+>};
}
__DATA__
<queryResponse><entity><devicesDTO><clearedAlarms>1</clearedAlarms><warningAlarms>0</warningAlarms></devicesDTO></entity></queryResponse>
<queryResponse><entity><devicesDTO><clearedAlarms>2</clearedAlarms><warningAlarms>2</warningAlarms></devicesDTO></entity></queryResponse>
Please keep in mind that we can use a pattern match here because we don't actually parse the XML. Do not use regular expressions if you want to do any real XML parsing!
I have the following command in my perl script:
my #files = `find $basedir/ -type f -iname '$sampleid*.summary.csv'`; #there are multiple summary.csv files in my basedir. I store them in an array
my $summary = `tail -n 1 $files[0]`; #Each summary.csv contains a header line and a line with data. I fetch here the last line.
chomp($summary);
my #sp = split(/,/,$summary); # I split based on ','
my $gender = $sp[11]; # the values from column 11 are stored in $gender
my $qc = $sp[2]; # the values from column 2 are stored in $gender
Now, I'm experiencing the situation where my *summary.csv files don't have the same number of columns. They do all have 2 lines, where the first line represents the header.
What I want now is not storing the values from column 11 in gender, but I want to store the values from the column 'Gender' in $gender.
How can I achieve this?
First try at solution:
my %hash = ();
my $header = `head -n 1 $files[0]`; #reading the header
chomp ($header);
my #colnames = split (/,/,$header);
my $keyfield = $colnames[#here should be the column with the name 'Gender']
push #{ $hash{$keyfield} };
my $gender = $sp[$keyfield]
You will have to read the header line as well as the data to know what column holds which information. This is done easiest by writing actual Perl code instead of shelling out to various command line utilities. See further below for that solution.
Fixing your solution also requires a hash. You need to read the header line first, store the header fields in an array (as you've already done), and then read the data line. The data needs to be a hash, not an array. A hash is a map of keys and values.
# read the header and create a list of header fields
my $header = `head -n 1 $files[0]`;
chomp ($header);
my #colnames = split (/,/,$header);
# read the data line
my $summary = `tail -n 1 $files[0]`;
chomp($summary);
my %sp; # use a hash for the data, not an array
# use a hash slice to fill in the columns
#sp{#colnames} = split(/,/,$summary);
my $gender = $sp{Gender};
The tricky part here is this line.
#sp{#colnames} = split(/,/,$summary);
We have declared %sp as a hash, but we now access it with a # sigil. That's because we are taking a hash slice, as indicated by the curly braces {}. The slice we take is all elements with the names of the values in #colnames. There is more than one value, so the return value is not a scalar (with a $) any more. There is a list of return values, so the sigil turns to #. Now we use that list on the left hand side (that's called an LVALUE), and assign the result of the split to that list.
Doing it with modern Perl
The following program will use File::Find::Rule to replace your find command, and Text::CSV to read the CSV file. It grabs all the files, then opens one at a time. The header line will be read first, and fed into the Text::CSV object, so that it can then give back a hash reference, which you can use to access every field by name.
I've written it in a way that it will only read one line for each file, as you said there are only two lines per file. You can easily extend that to be a loop.
use strict;
use warnings;
use File::Find::Rule;
use Text::CSV;
my $sampleid;
my $basedir;
my $csv = Text::CSV->new(
{
binary => 1,
sep => ',',
}
) or die "Cannot use CSV: " . Text::CSV->error_diag;
my #files = File::Find::Rule->file()->name("$sampleid*.summary.csv")->in($basedir);
foreach my $file (#files) {
open my $fh, '<', $file or die "Can't open $file: $!";
# get the headers
my #cols = #{ $csv->getline($fh) };
$csv->column_names(#cols);
# read the first line
my $row = $csv->getline_hr($fh);
# do whatever you you want with the row
print "$file: ", $row->{gender};
}
Please note that I have not tested this program.
I'm trying to use Text::CSV to parse this CSV file. Here is how I am doing it:
open my $fh, '<', 'test.csv' or die "can't open csv";
my $csv = Text::CSV_XS->new ({ sep_char => "\t", binary => 1 , eol=> "\n"});
$csv->column_names($csv->getline($fh));
while(my $row = $csv->getline_hr($fh)) {
# use row
}
Because the file has 169,252 rows (not counting the headers line), I expect the loop to run that many times. However, it only runs 8 times and gives me 8 rows. I'm not sure what's happening, because the CSV just seems like a normal CSV file with \n as the line separator and \t as the field separator. If I loop through the file like this:
while(my $line = <$fh>) {
my $fields = $csv->parse($line);
}
Then the loop goes through all rows.
Text::CSV_XS is silently failing with an error. If you put the following after your while loop:
my ($cde, $str, $pos) = $csv->error_diag ();
print "$cde, $str, $pos\n";
You can see if there were errors parsing the file and you get the output:
2034, EIF - Loose unescaped quote, 336
Which means the column:
GT New Coupe 5.0L CD Wheels: 18" x 8" Magnetic Painted/Machined 6 Speakers
has an unquoted escape string (there is no backslash before the ").
The Text::CSV perldoc states:
allow_loose_quotes
By default, parsing fields that have quote_char characters inside an unquoted field, like
1,foo "bar" baz,42
would result in a parse error. Though it is still bad practice to allow this format, we cannot help there are some vendors that make their applications spit out lines styled like this.
If you change your arguments to the creation of Text::CSV_XS to:
my $csv = Text::CSV_XS->new ({ sep_char => "\t", binary => 1,
eol=> "\n", allow_loose_quotes => 1 });
The problem goes away, well until row 105265, when Error 2023 rears its head:
2023, EIQ - QUO character not allowed, 406
Details of this error in the perldoc:
2023 "EIQ - QUO character not allowed"
Sequences like "foo "bar" baz",qu and 2023,",2008-04-05,"Foo, Bar",\n will cause this error.
Setting your quote character empty (setting quote_char => '' on your call to Text::CSV_XS->new()) does seem to work around this and allow processing of the whole file. However I would take time to check if this is a sane option with the CSV data.
TL;DR The long and short is that your CSV is not in the greatest format, and you will have to work around it.
I am trying to deliver the output of Hive CLI to my (closed source) application, and want to replace all "NULL" tokens with empty string. This is because Hive returns NULL even for numeric fields which the application raises exceptions for. I thought this should be a simple sed, or perl regex, but cant solve the problem so far.
Here's an example of the data record -
NULL<TAB>NULL<TAB>NULL<TAB>NULL<TAB>NULL<TAB>NULL<TAB>NULL<TAB>NULL<TAB>2015-02-08
The perl code I tried is
my %replace = (
"\tNULL\t" => "b",
"^NULL\t" => "a",
"\tNULL\$" => "c"
);
my $regex = join "|", keys %replace;
#$regex = qr/$regex/;
my $filename = hout;
open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>) {
chomp $row;
$row =~ s/($regex)/$replace{$1}/g;
print "$row\n";
}
This is the output I get -
NULLbNULLbNULLbNULL<TAB>2015-02-08
In other words, in a stream of 'fields' delimited by a 'character', I want to replace any field that is equal to the string "NULL" with an empty string, so the delimiters surrounding the field (or start of line + delimiter, or delimiter + end of line) become adjacent.
Any guidance would be much appreciated! Thanks!
P.S. I dont need a perl solution per se; just any terse solution would be awesome (I tried sed as well with similar results)
The root of your problem here is that your patterns overlap. You have a delimiter either side of your 'NULL' which you then modify and move on.
So something like this:
my $string = "NULL\tNULL\tNULL\tsome_value\tNULL\n";
print $string;
$string =~ s/(\A|\t)NULL(\t|\Z)/$1$2/g;
print $string;
$string =~ s/(\A|\t)NULL(\t|\Z)/$1$2/g;
print $string;
You need two passes to process it, because the pattern 'grabs' too much for the next iteration to match.
So with reference to: Matching two overlapping patterns with Perl
What you probably need is:
$string =~ s/(\A|\t)NULL(?=(\t|\Z))/$1/g;
You can use the same model if you're wanting to apply it to separate patterns.
I do this a lot creating output files from hive cli, it may be "simple" but I just pipe my output though this sed string:
hive -f test.hql | sed 's/NULL//g' > test.out
Works fine for me.
I download a CSV file from another server using perl script. After download I wish to check whether the file contains any corrupted data or not. I tried to use Encode::Detect::Detector to detect encoding but it returns 'undef' in both cases:
if the string is ASCII or
if the string is corrupted
So using the below program I can't differentiate between ASCII & Corrupted Data.
use strict;
use Text::CSV;
use Encode::Detect::Detector;
use XML::Simple;
use Encode;
require Encode::Detect;
my #rows;
my $init_file = "new-data-jp-2013-8-8.csv";
my $csv = Text::CSV->new ( { binary => 1 } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, $init_file or die $init_file.": $!";
while ( my $row = $csv->getline( $fh ) ) {
my #fields = #$row; # get line into array
for (my $i=1; $i<=23; $i++){ # I already know that CSV file has 23 columns
if ((Encode::Detect::Detector::detect($fields[$i-1])) eq undef){
print "the encoding is undef in col".$i.
" where field is ".$fields[$i-1].
" and its length is ".length($fields[$i-1])." \n";
}
else {
my $string = decode("Detect", $fields[$i-1]);
print "this is string print ".$string.
" the encoding is ".Encode::Detect::Detector::detect($fields[$i-1]).
" and its length is ".length($fields[$i-1])."\n";
}
}
}
You have some bad assumptions about encodings, and some errors in your script.
foo() eq undef
does not make any sense. You cannot compare to string equality to undef, as undef isn't a string. It does, however, stringify to the empty string. You should use warnings to get error messages when you do such rubbish. To test whether a value is not undef, use defined:
unless(defined foo()) { .... }
The Encode::Detector::Detect module uses an object oriented interface. Therefore,
Encode::Detect::Detector::detect($foo)
is wrong. According to the docs, you should be doing
Encode::Detect::Detector->detect($foo)
You probably cannot do decoding on a field-by-field basis. Usually, one document has one encoding. You need to specify the encoding when opening the file handle, e.g.
use autodie;
open my $fh, "<:utf8", $init_file;
While CSV can support some degree of binary data (like encoded text), it isn't well suited for this purpose, and you may want to choose another data format.
Finally, ASCII data effectively does not need any de- or encoding. The undef result for encoding detection does make sense here. It cannot be asserted with certaincy that a document was encoded to ASCII (as many encodings are a superset of ASCII), but given a certain document it can be asserted that it isn't valid ASCII (i.e. has the 8th bit set) but must rather be a more complex encoding like Latin-1, UTF-8.