How do I handle embedded newlines in CSV files in Perl? - perl

I'm reading a .csv file that was created in Excel with the first line being column headings. One column heading contains an embedded newline. I want to ignore that newline but reading it line-by-line like:
while ( <IN> ) {
...
}
will treat it as a new line which will break my code (which I haven't written yet). My approach was to read the first line into an array of column headings and process the rest of the lines differently.
Is there maybe a regex I can use somewhere in the while that ignores the newline unless it's the last new line?
Or should I be approaching this differently?

Use one of the Perl modules that handle CSV, such as Text::CSV_XS. Its documentation shows you how to handle embedded newlines. In general, you don't want to spend your time writing another CSV parser; get on with the more important parts of your task!

Related

How to export postgres data containing newlines to CSV without breaking records on several lines

I am trying to export data from postgresq to CSV files but when I do have newlines in text in the database, the exported data will be broken on several lines, which makes much harder to read the CSV file, not to say that most applications will fail to load it properly.
Here is how I export the data now:
PRESQL="\pset format unaligned
\pset fieldsep \",\"
\pset footer off
\o 'out.csv'
"
cat <(echo $PRESQL) $QUERYFILE | psql …
Sa far, so good, unless you have newlines in the text fields. Any hack that would allow me to generate a very simple to parse CSV file (with one record per line)?
It was a mistake to consider that a CSV can be forced to have one line per row. The RFC states clear that newlines are to be enclosed in double quotes.
You can try replace() or regexp_replace() function.
The answer to the followinig SO question should give you an idea: How to remove carriage returns and new lines in Postgresql?

Deleting empty fields in pipe delimited printout in perl?

I'm going through each line of a file, looking for a few specific things in each line with regex, and I want to print it so that each row of the output .csv file just contains those things (thing1|thing2|thing3|thing4|) but because it's going through line by line I get things like
||||
then
|||thing4|
then
|thing1||thing3||
and I don't know how to delete the empty pipe delimited areas to shove everything together. Help?
You could filter it after with a regex
$out =~ s/\|{2}/|/sg;

perl multiline find and replace, can't get newline working

I have a directory on a Linux host with several property files which I want to edit by replacing hardcoded values with placeholder tags. My goal was to have a perl script which reads a delimited file that contains entries for each of the property files listing the hardcoded value, the placeholder value and the name of the file to edit.
For example, in file.prop I have these values set
<connection targetHostUrl="99.99.99.99"
targetHostPort="9999"
And I want to replace the values with tags as shown below
<connection targetHostUrl="TARGETHOST"
targetHostPort="PORT"
There will be several entries similar to this so I have to match on the unique combination of IP and PORT so I need a multiline match.
To do this I wrote the following script to take the input of the delimited filename, which is delimited with ||. I go get that file from the config directory and read in the values to get the hardcoded value, tag, and filename to edit. Then I read in that property file, do the substitution and then write it out again.
#!/usr/bin/perl
my $config = $ARGV[0];
chomp $config;
my $filename = '/config/' . $config;
my ($hard,$tagg,$prop);
open(DATAFILE, $filename) or die "Could not open DATAFILE $filename.";
while(<DATAFILE>)
{
chomp $_;
($hard,$tagg,$prop) = split('\|\|', $_);
$*=1;
open(INPUT,"</properties/$prop") or die "Could not open INPUT $prop.";
#input_array=<INPUT>;
close(INPUT);
$input_scalar=join("",#input_array);
$input_scalar =~ s/$hard/$tagg/;
open(OUTPUT,">/properties/$prop") or die "Could not open OUTPUT $prop.";
print(OUTPUT $input_scalar);
close(OUTPUT);
}
close DATAFILE;
Inside the config file I have the following entry
<connection targetHostUrl="99.99.999.99"(.|\n)*?targetHostPort="9999"||<connection targetHostUrl="TARGETHOST1"\n targetHostPort="PORT"||file.prop
My output is as shown below. It puts what I hoped to be a newline as a literal \n
<connection targetHostUrl="TARGETHOST"\n targetHostPort="PORT"
I can't find a way to get the \n taken as a newline. At first I thought, no problem, I'll just do a 2nd substitution like
perl -i -pe 's/\\n/\n/o' $prop
and although this works, for some reason it puts ^M characters at the end of every line except the one I did the replacement on. I don't want to do a 3rd replace to strip them out.
I've searched and found other ways of doing the multiline search/replace but they all interpret the \n literally.
Any suggestions?
My output is as shown below. It puts what I hoped to be a newline as a literal \n
Why would it insert a newline when the string doesn't contain one?
I can't find a way to get the \n taken as a newline.
There isn't any. If you want to substitute a newline, you need to provide a newline.
If you used a proper CSV parser like Text::CSV_XS, you could put a newline in your data file.
Otherwise, you'll have to write some code to handle the escape sequences you want your code to handle.
for some reason it puts ^M characters at the end of every line except the one I did the replacement on.
Quite the opposite. It removes it from the one line you did the replacement on.
That's home some programs represent a Carriage Return. You have a file with CR LF line ends. You could use dos2unix to convert it, or you could leave it as is because XML doesn't care.

Convert a CSV file with embedded commas into a bash array by line efficiently

Normally, I do something like
IFS=','
columns=( $LINE )
where $LINE is a line from a csv file I'm reading.
However, how do I handle a csv file with embedded commas? I have to handle several hundred gigs of file so everything needs to be done quickly, i.e., no multiple readings of a line, definitely no loops (last time I tried that slowed it down several factors).
The general structure of the code is as follows
FILENAME=$1
cat $FILENAME | while read LINE
do
IFS=","
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
Preferably, I need something that goes
FILENAME=$1
cat $FILENAME | while read LINE
do
IFS=","
# code to tell bash to ignore if IFS is within an open quote
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
Any tips would be appreciated. Otherwise, I'll probably switch to using another language to handle this stuff.
Probably embedded commas is just the first obvious problem that you encountered while parsing those CSV files.
Future problems that might popped are:
embedded newline separator characters
embedded utf8 chars
special treatment for whitespaces, empty fields, spaces around commas, undef values
I generally tend to follow the philosophy that If there is a (reputable) module that parses some
format you have to parse, use it instead of making a homebrew
I don't think there is such a thing for bash, but there are some for Perl. I'd go for Text::CSV_XS. Being written in C I expect it to be very fast.
You can use sed or something similar to convert the commas within quotes to some other sequence or punctuation. If you don't care about the stuff in quotes then you do not even need to change them back. You can do this on the whole file:
sed 's/\("[^,"]*\),\([^"]*"\)/\1;\2/g' input.csv > intermediate.csv
or on each line:
line=$(echo $line | sed 's/\("[^,"]*\),\([^"]*"\)/\1;\2/g')
This isn't a complete answer, but it's a possible approach.
Find a character that never occurs in the input file. Use a C program that parses the CSV file and prints lines to standard output with a different delimiter. Writing that program is left as an exercise, but I'm sure there's CSV-parsing C source code out there. Pipe the output of the C program into your script.
For example:
FILENAME=$1
new_c_program $FILENAME | while read LINE
do
IFS="|"
# code to tell bash to ignore if IFS is within an open quote
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
A minor point: I'd pick a name other than $newline; newline suggests an end-of-line marker rather than an entire line.
Another minor point: you have a "Useless Use Of cat" in the code in your question. You could replace this:
cat $FILENAME | while read LINE
do
...
done
by this:
while read LINE
do
...
done < $FILENAME
But if you replace cat by the hypothetical C program I suggested, you still need the pipe.

Reading a large file in perl, record by record, with a dynamic record separator

I have a script that reads a large file line by line. The record separator ($/) that I would like to use is (\n). The only problem is that the data on each line contains CRLF characters (\r\n), which the program should not be considered the end of a line.
For example, here is a sample data file (with the newlines and CRLFs written out):
line1contents\n
line2contents\n
line3\r\ncontents\n
line4contents\n
If I set $/ = "\n", then it splits the third line into two lines. Ideally, I could just set $/ to a regex that matches \n and not \r\n, but I don't think that's possible. Another possibility is to read in the whole file, then use the split function to split on said regex. The only problem is that the file is too large to load into memory.
Any suggestions?
For this particular task, it sounds pretty straightforward to check your line ending and append the next line as necessary:
$/ = "\n";
...
while(<$input>) {
while( substr($_,-2) eq "\r\n" ) {
$_ .= <$input>;
}
...
}
This is the same logic used to support line continuation in a number of different programming contexts.
You are right that you can't set $/ to a regular expression.
dos2unix would put a UNIX newline character in for the "\r\n" and so wouldn't really solve the problem. I would use a regex that replaces all instances of "\r\n" with a space or tab character and save the results to a different file (since you don't want to split the line at those points). Then I would run your script on the new file.
Try using dos2unix on the file first, and then read in as normal.