Deleting empty fields in pipe delimited printout in perl? - perl

I'm going through each line of a file, looking for a few specific things in each line with regex, and I want to print it so that each row of the output .csv file just contains those things (thing1|thing2|thing3|thing4|) but because it's going through line by line I get things like
||||
then
|||thing4|
then
|thing1||thing3||
and I don't know how to delete the empty pipe delimited areas to shove everything together. Help?

You could filter it after with a regex
$out =~ s/\|{2}/|/sg;

Related

A perl Script to Delete trailing spaces in a csv file

I am new to Perl Programming.
I have a CSV File with N fields in which Nth field is having Trailing Spaces for all the records. I want to remove all this Trailing Spaces.Please help me in this.
I have used this substitution in a loop. But it has given me empty file
s/\s+$//
Example File
123,ABCD,"AC,BD",21/12/2013
134,CDEF,"CD,BD,ED",23/11/2013
987,TGYH,"HY,-.FDDS",20/11/2013
Output
123,ABCD,"AC,BD",21/12/2013
134,CDEF,"CD,BD,ED",23/11/2013
987,TGYH,"HY,-.FDDS",20/11/2013
Please let me know if you need more details.
Thanks In Advance.
Your regex seems good. You could say:
perl -ple 's/\s+$//' filename
To save the changes in-place to the file, say:
perl -i -ple 's/\s+$//' filename
Steps to remove the leading and trailing spaces in the csv file:
Open the input file
OPEN FILE, "<$input_file";
Looping each and every row in the file to do trim of each row in the csv file.In the while loop use regex to trim the leading and trailing of the spaces for each field
while(my $row = <FILE>)
{
$row =~ 's/\s//g';
}
This will give you the results you are expecting

perl multiline find and replace, can't get newline working

I have a directory on a Linux host with several property files which I want to edit by replacing hardcoded values with placeholder tags. My goal was to have a perl script which reads a delimited file that contains entries for each of the property files listing the hardcoded value, the placeholder value and the name of the file to edit.
For example, in file.prop I have these values set
<connection targetHostUrl="99.99.99.99"
targetHostPort="9999"
And I want to replace the values with tags as shown below
<connection targetHostUrl="TARGETHOST"
targetHostPort="PORT"
There will be several entries similar to this so I have to match on the unique combination of IP and PORT so I need a multiline match.
To do this I wrote the following script to take the input of the delimited filename, which is delimited with ||. I go get that file from the config directory and read in the values to get the hardcoded value, tag, and filename to edit. Then I read in that property file, do the substitution and then write it out again.
#!/usr/bin/perl
my $config = $ARGV[0];
chomp $config;
my $filename = '/config/' . $config;
my ($hard,$tagg,$prop);
open(DATAFILE, $filename) or die "Could not open DATAFILE $filename.";
while(<DATAFILE>)
{
chomp $_;
($hard,$tagg,$prop) = split('\|\|', $_);
$*=1;
open(INPUT,"</properties/$prop") or die "Could not open INPUT $prop.";
#input_array=<INPUT>;
close(INPUT);
$input_scalar=join("",#input_array);
$input_scalar =~ s/$hard/$tagg/;
open(OUTPUT,">/properties/$prop") or die "Could not open OUTPUT $prop.";
print(OUTPUT $input_scalar);
close(OUTPUT);
}
close DATAFILE;
Inside the config file I have the following entry
<connection targetHostUrl="99.99.999.99"(.|\n)*?targetHostPort="9999"||<connection targetHostUrl="TARGETHOST1"\n targetHostPort="PORT"||file.prop
My output is as shown below. It puts what I hoped to be a newline as a literal \n
<connection targetHostUrl="TARGETHOST"\n targetHostPort="PORT"
I can't find a way to get the \n taken as a newline. At first I thought, no problem, I'll just do a 2nd substitution like
perl -i -pe 's/\\n/\n/o' $prop
and although this works, for some reason it puts ^M characters at the end of every line except the one I did the replacement on. I don't want to do a 3rd replace to strip them out.
I've searched and found other ways of doing the multiline search/replace but they all interpret the \n literally.
Any suggestions?
My output is as shown below. It puts what I hoped to be a newline as a literal \n
Why would it insert a newline when the string doesn't contain one?
I can't find a way to get the \n taken as a newline.
There isn't any. If you want to substitute a newline, you need to provide a newline.
If you used a proper CSV parser like Text::CSV_XS, you could put a newline in your data file.
Otherwise, you'll have to write some code to handle the escape sequences you want your code to handle.
for some reason it puts ^M characters at the end of every line except the one I did the replacement on.
Quite the opposite. It removes it from the one line you did the replacement on.
That's home some programs represent a Carriage Return. You have a file with CR LF line ends. You could use dos2unix to convert it, or you could leave it as is because XML doesn't care.

Convert a CSV file with embedded commas into a bash array by line efficiently

Normally, I do something like
IFS=','
columns=( $LINE )
where $LINE is a line from a csv file I'm reading.
However, how do I handle a csv file with embedded commas? I have to handle several hundred gigs of file so everything needs to be done quickly, i.e., no multiple readings of a line, definitely no loops (last time I tried that slowed it down several factors).
The general structure of the code is as follows
FILENAME=$1
cat $FILENAME | while read LINE
do
IFS=","
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
Preferably, I need something that goes
FILENAME=$1
cat $FILENAME | while read LINE
do
IFS=","
# code to tell bash to ignore if IFS is within an open quote
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
Any tips would be appreciated. Otherwise, I'll probably switch to using another language to handle this stuff.
Probably embedded commas is just the first obvious problem that you encountered while parsing those CSV files.
Future problems that might popped are:
embedded newline separator characters
embedded utf8 chars
special treatment for whitespaces, empty fields, spaces around commas, undef values
I generally tend to follow the philosophy that If there is a (reputable) module that parses some
format you have to parse, use it instead of making a homebrew
I don't think there is such a thing for bash, but there are some for Perl. I'd go for Text::CSV_XS. Being written in C I expect it to be very fast.
You can use sed or something similar to convert the commas within quotes to some other sequence or punctuation. If you don't care about the stuff in quotes then you do not even need to change them back. You can do this on the whole file:
sed 's/\("[^,"]*\),\([^"]*"\)/\1;\2/g' input.csv > intermediate.csv
or on each line:
line=$(echo $line | sed 's/\("[^,"]*\),\([^"]*"\)/\1;\2/g')
This isn't a complete answer, but it's a possible approach.
Find a character that never occurs in the input file. Use a C program that parses the CSV file and prints lines to standard output with a different delimiter. Writing that program is left as an exercise, but I'm sure there's CSV-parsing C source code out there. Pipe the output of the C program into your script.
For example:
FILENAME=$1
new_c_program $FILENAME | while read LINE
do
IFS="|"
# code to tell bash to ignore if IFS is within an open quote
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
A minor point: I'd pick a name other than $newline; newline suggests an end-of-line marker rather than an entire line.
Another minor point: you have a "Useless Use Of cat" in the code in your question. You could replace this:
cat $FILENAME | while read LINE
do
...
done
by this:
while read LINE
do
...
done < $FILENAME
But if you replace cat by the hypothetical C program I suggested, you still need the pipe.

perl split string on a single char, not on repeated char

I am using an existing perl script to process a text file output from a database query which I have no control over.
The data contains fields separated by '|', but some fields contain '||'. There are no empty fields. There may be spaces on either side of the field separator which I would also like to remove.
I cannot find a simple way to achieve this, apart from changing the '||' to something else, and putting it hack after the split, which seems a bit heavy going.
The file is substantial (typically up to about 100M).
Using split(/ *\| */, $line) works apart from the '||' character.
Any thought please?
split /\s*(?<!\|)\|(?!\|)\s*/
you can use negative look-behind and look-ahead to ensure there are no | symbols around the | you're splitting on:
split / \s* (?<!\|) \| (?!\|) \s* /x
Look at using Text::CSV or Tie::Handle::CSV to run through the file. If the text file has been done properly fields that contain || will be quoted.

How do I handle embedded newlines in CSV files in Perl?

I'm reading a .csv file that was created in Excel with the first line being column headings. One column heading contains an embedded newline. I want to ignore that newline but reading it line-by-line like:
while ( <IN> ) {
...
}
will treat it as a new line which will break my code (which I haven't written yet). My approach was to read the first line into an array of column headings and process the rest of the lines differently.
Is there maybe a regex I can use somewhere in the while that ignores the newline unless it's the last new line?
Or should I be approaching this differently?
Use one of the Perl modules that handle CSV, such as Text::CSV_XS. Its documentation shows you how to handle embedded newlines. In general, you don't want to spend your time writing another CSV parser; get on with the more important parts of your task!