Convert a CSV file with embedded commas into a bash array by line efficiently - perl

Normally, I do something like
IFS=','
columns=( $LINE )
where $LINE is a line from a csv file I'm reading.
However, how do I handle a csv file with embedded commas? I have to handle several hundred gigs of file so everything needs to be done quickly, i.e., no multiple readings of a line, definitely no loops (last time I tried that slowed it down several factors).
The general structure of the code is as follows
FILENAME=$1
cat $FILENAME | while read LINE
do
IFS=","
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
Preferably, I need something that goes
FILENAME=$1
cat $FILENAME | while read LINE
do
IFS=","
# code to tell bash to ignore if IFS is within an open quote
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
Any tips would be appreciated. Otherwise, I'll probably switch to using another language to handle this stuff.

Probably embedded commas is just the first obvious problem that you encountered while parsing those CSV files.
Future problems that might popped are:
embedded newline separator characters
embedded utf8 chars
special treatment for whitespaces, empty fields, spaces around commas, undef values
I generally tend to follow the philosophy that If there is a (reputable) module that parses some
format you have to parse, use it instead of making a homebrew
I don't think there is such a thing for bash, but there are some for Perl. I'd go for Text::CSV_XS. Being written in C I expect it to be very fast.

You can use sed or something similar to convert the commas within quotes to some other sequence or punctuation. If you don't care about the stuff in quotes then you do not even need to change them back. You can do this on the whole file:
sed 's/\("[^,"]*\),\([^"]*"\)/\1;\2/g' input.csv > intermediate.csv
or on each line:
line=$(echo $line | sed 's/\("[^,"]*\),\([^"]*"\)/\1;\2/g')

This isn't a complete answer, but it's a possible approach.
Find a character that never occurs in the input file. Use a C program that parses the CSV file and prints lines to standard output with a different delimiter. Writing that program is left as an exercise, but I'm sure there's CSV-parsing C source code out there. Pipe the output of the C program into your script.
For example:
FILENAME=$1
new_c_program $FILENAME | while read LINE
do
IFS="|"
# code to tell bash to ignore if IFS is within an open quote
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
A minor point: I'd pick a name other than $newline; newline suggests an end-of-line marker rather than an entire line.
Another minor point: you have a "Useless Use Of cat" in the code in your question. You could replace this:
cat $FILENAME | while read LINE
do
...
done
by this:
while read LINE
do
...
done < $FILENAME
But if you replace cat by the hypothetical C program I suggested, you still need the pipe.

Related

Using sed, prepend line only once, if there's a match later in file content

I'd like to add a line on top of my output if my input file has a specific word.
However, if I'm just looking for specific string, then as I understand it, it's too late. The first line is already in the output and I can't prepend to it anymore.
Here's an exemple of input.
one
two
two
three
If I can find a line with, say, the word two, I'd like to add a new line before the first one, with for example FOUND. I want that line prepended only once, even if there are several matches.
So an input file without any two would remain unchanged, and the example file above would become:
FOUND
one
two
two
three
I know how to prepend with i\, but can't get the context right. From what I understood that would be around:
1{
/two/{ # This will will search "two" in the first line, how to look for it in the whole file ?
1i\
FOUND
}
}
EDIT:
I know how to do it using other languages/methods, that's not my question.
Sed has advanced features to work on several lines at once, append/prepend lines and is not limited to substitution. I have a sed file already filled with expressions to modify a python source file, which is why I'd prefer to avoid using something else. I want to be able to add an import at the beginning of a file if a certain class is used.
A Perl solution:
perl -i.bak -0077 -pE 'say "FOUND" if /two/;' in_file
The Perl one-liner uses these command line flags:
-p : Loop over the input one line at a time, assigning it to $_ by default. Add print $_ after each loop iteration.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
-E : Tells Perl to look for code in-line, instead of in a file. Also enables all optional features. Here, enables say.
-0777 : Slurp files whole.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
sed is for doing s/old/new on individual strings, that's not what you're trying to do so you shouldn't bother trying to use sed. There's lots of ways to do this, this one will be very efficient, robust and portable to all Unix systems:
$ grep -Fq 'two' file && echo "FOUND"; cat file
FOUND
one
two
two
three
To operate on a stream instead of (or in addition to) a file and without needing to read the whole input into memory:
awk 'f{print; next} {buf[NR]=$0} /two/{print "FOUND"; for (i=1;i<=NR;i++) print buf[i]; f=1}'
e.g.:
$ cat file | awk 'f{print; next} {buf[NR]=$0} /two/{print "FOUND"; for (i=1;i<=NR;i++) print buf[i]; f=1}'
FOUND
one
two
two
three
That awk script will also work using any awk in any shell on every Unix box.

Removing text with command line?

I have a huge list of locations in this form in a text file:
ar,casa de piedra,Casa de Piedra,20,,-49.985133,-68.914673
gr,riziani,Ríziani,18,,39.5286111,20.35
mx,tenextepec,Tenextepec,30,,19.466667,-97.266667
Is there any way with command line to remove everything that isn't between the first and second commas? For example, I want my list to look like this:
casa de piedra
riziani
tenextepec
with Perl
perl -F/,/ -ane 'print $F[1]."\n"' file
Use cut(1):
cut -d, -f2 inputfile
With perl:
perl -pe 's/^.*?,(.*?),.*/$1/' filename
Breakdown of the above code
perl - the command to use the perl programming language.
-pe - flags.
e means "run this as perl code".
p means:
Set $_ to the first line of the file (given by filename)
Run the -e code
Print $_
Repeat from step 1 with the next line of the file
what -p actually does behind the scenes is best explained here.
s/.*?,(.*?),.*/$1/ is a regular expression:
s/pattern/replacement/ looks for pattern in $_ and replaces it with replacement
.*? basically means "anything" (it's more complicated than that but outside the scope of this answer)
, is a comma (nothing special)
() capture whatever is in them and save it in $1
.* is another (slightly different) "anything" (this time it's more like "everything")
$1 is what we captured with ()
so the whole thing basically says to search in $_ for:
anything
a comma
anything (save this bit)
another comma
everything
and replace it with the bit it saved. This effectively saves the stuff between the first and second commas, deletes everything, and then puts what it saved into $_.
filename is the name of your text file
To review, the code goes through your file line by line, applies the regular expression to extract your needed bit, and then prints it out.
If you want the result in a file, use this:
perl -pe 's/^.*?,(.*?),.*/$1/' filename > out.txt
and the result goes into a file named out.txt (that will be placed wherever your terminal is pointed to at the moment.) What this pretty much does is tell the terminal to print the command's result to a file instead of on the screen.
Also, if it isn't crucial to use the command line, you can just import into Excel (it's in CSV format) and work with it graphically.
With awk:
$ awk -F ',' '{ print $2 }' file

perl multiline find and replace, can't get newline working

I have a directory on a Linux host with several property files which I want to edit by replacing hardcoded values with placeholder tags. My goal was to have a perl script which reads a delimited file that contains entries for each of the property files listing the hardcoded value, the placeholder value and the name of the file to edit.
For example, in file.prop I have these values set
<connection targetHostUrl="99.99.99.99"
targetHostPort="9999"
And I want to replace the values with tags as shown below
<connection targetHostUrl="TARGETHOST"
targetHostPort="PORT"
There will be several entries similar to this so I have to match on the unique combination of IP and PORT so I need a multiline match.
To do this I wrote the following script to take the input of the delimited filename, which is delimited with ||. I go get that file from the config directory and read in the values to get the hardcoded value, tag, and filename to edit. Then I read in that property file, do the substitution and then write it out again.
#!/usr/bin/perl
my $config = $ARGV[0];
chomp $config;
my $filename = '/config/' . $config;
my ($hard,$tagg,$prop);
open(DATAFILE, $filename) or die "Could not open DATAFILE $filename.";
while(<DATAFILE>)
{
chomp $_;
($hard,$tagg,$prop) = split('\|\|', $_);
$*=1;
open(INPUT,"</properties/$prop") or die "Could not open INPUT $prop.";
#input_array=<INPUT>;
close(INPUT);
$input_scalar=join("",#input_array);
$input_scalar =~ s/$hard/$tagg/;
open(OUTPUT,">/properties/$prop") or die "Could not open OUTPUT $prop.";
print(OUTPUT $input_scalar);
close(OUTPUT);
}
close DATAFILE;
Inside the config file I have the following entry
<connection targetHostUrl="99.99.999.99"(.|\n)*?targetHostPort="9999"||<connection targetHostUrl="TARGETHOST1"\n targetHostPort="PORT"||file.prop
My output is as shown below. It puts what I hoped to be a newline as a literal \n
<connection targetHostUrl="TARGETHOST"\n targetHostPort="PORT"
I can't find a way to get the \n taken as a newline. At first I thought, no problem, I'll just do a 2nd substitution like
perl -i -pe 's/\\n/\n/o' $prop
and although this works, for some reason it puts ^M characters at the end of every line except the one I did the replacement on. I don't want to do a 3rd replace to strip them out.
I've searched and found other ways of doing the multiline search/replace but they all interpret the \n literally.
Any suggestions?
My output is as shown below. It puts what I hoped to be a newline as a literal \n
Why would it insert a newline when the string doesn't contain one?
I can't find a way to get the \n taken as a newline.
There isn't any. If you want to substitute a newline, you need to provide a newline.
If you used a proper CSV parser like Text::CSV_XS, you could put a newline in your data file.
Otherwise, you'll have to write some code to handle the escape sequences you want your code to handle.
for some reason it puts ^M characters at the end of every line except the one I did the replacement on.
Quite the opposite. It removes it from the one line you did the replacement on.
That's home some programs represent a Carriage Return. You have a file with CR LF line ends. You could use dos2unix to convert it, or you could leave it as is because XML doesn't care.

sed recipe: how to do stuff between two patterns that can be either on one line or on two lines?

Let's say we want to do some substitutions only between some patterns, let them be <a> and </a> for clarity... (all right, all right, they're start and end!.. Jeez!)
So I know what to do if start and end always occur on the same line: just design a proper regex.
I also know what to do if they're guaranteed to be on different lines and I don't care about anything in the line containing end and I'm also OK with applying all the commands in the line containing start before start: just specify the address range as /start/,/end/.
This, however, doesn't sound very useful. What if I need to do a smarter job, for instance, introduce changes inside a {...} block?
One thing I can think of is breaking the input on { and } before processing and putting it back together afterwards:
sed 's/{\|}/\n/g' input | sed 'main stuff' | sed ':a $!{N;ba}; s/\n\(}\|{\)\n/\1/g'
Another option is the opposite:
cat input | tr '\n' '#' | sed 'whatever; s/#/\n/g'
Both of these are ugly, mainly because the operations are not confined within a single command. The second one is even worse because one has to use some character or substring as a "newline holder" assuming it isn't present in the original text.
So the question is: are there better ways or can the above-mentioned ones be optimized? This is quite a regular task from what I read in recent SO questions, so I'd like to choose the best practice once and for all.
P.S. I'm mostly interested in pure sed solutions: can the job be do with one invocation of sed and nothing else? Please no awk, Perl, etc.: this is more of a theoretical question, not a "need the job done asap" one.
This might work for you:
# create multiline test data
cat <<\! >/tmp/a
> this
> this { this needs
> changing to
> that } that
> that
> !
sed '/{/!b;:a;/}/!{$q;N;ba};h;s/[^{]*{//;s/}.*//;s/this\|that/\U&/g;x;G;s/{[^}]*}\([^\n]*\)\n\(.*\)/{\2}\1/' /tmp/a
this
this { THIS needs
changing to
THAT } that
that
# convert multiline test data to a single line
tr '\n' ' ' </tmp/a >/tmp/b
sed '/{/!b;:a;/}/!{$q;N;ba};h;s/[^{]*{//;s/}.*//;s/this\|that/\U&/g;x;G;s/{[^}]*}\([^\n]*\)\n\(.*\)/{\2}\1/' /tmp/b
this this { THIS needs changing to THAT } that that
Explanation:
Read the data into the pattern space (PS). /{/!b;:a;/}/!{$q;N;ba}
Copy the data into the hold space (HS). h
Strip non-data from front and back of string. s/[^{]*{//;s/}.*//
Convert data e.g. s/this\|that/\U&/g
Swap to HS and append converted data. x;G
Replace old data with converted data.s/{[^}]*}\([^\n]*\)\n\(.*\)/{\2}\1/
EDIT:
A more complicated answer which I think caters for more than one block per line.
# slurp file into pattern space (PS)
:a
$! {
N
ba
}
# check for presence of \v if so quit with exit value 1
/\v/q1
# replace original newlines with \v's
y/\n/\v/
# append a newline to PS as a delimiter
G
# copy PS to hold space (HS)
h
# starting from right to left delete everything but blocks
:b
s/\(.*\)\({.*}\).*\n/\1\n\2/
tb
# delete any non-block details form the start of the file
s/.*\n//
# PS contains only block details
# do any block processing here e.g. uppercase this and that
s/th\(is\|at\)/\U&/g
# append ps to hs
H
# swap to HS
x
# replace each original block with its processed one from right to left
:c
s/\(.*\){.*}\(.*\)\n\n\(.*\)\({.*}\)/\1\n\n\4\2\3/
tc
# delete newlines
s/\n//g
# restore original newlines
y/\v/\n/
# done!
N.B. This uses GNU specific options but could be tweaked to work with generic sed's.

Reading a large file in perl, record by record, with a dynamic record separator

I have a script that reads a large file line by line. The record separator ($/) that I would like to use is (\n). The only problem is that the data on each line contains CRLF characters (\r\n), which the program should not be considered the end of a line.
For example, here is a sample data file (with the newlines and CRLFs written out):
line1contents\n
line2contents\n
line3\r\ncontents\n
line4contents\n
If I set $/ = "\n", then it splits the third line into two lines. Ideally, I could just set $/ to a regex that matches \n and not \r\n, but I don't think that's possible. Another possibility is to read in the whole file, then use the split function to split on said regex. The only problem is that the file is too large to load into memory.
Any suggestions?
For this particular task, it sounds pretty straightforward to check your line ending and append the next line as necessary:
$/ = "\n";
...
while(<$input>) {
while( substr($_,-2) eq "\r\n" ) {
$_ .= <$input>;
}
...
}
This is the same logic used to support line continuation in a number of different programming contexts.
You are right that you can't set $/ to a regular expression.
dos2unix would put a UNIX newline character in for the "\r\n" and so wouldn't really solve the problem. I would use a regex that replaces all instances of "\r\n" with a space or tab character and save the results to a different file (since you don't want to split the line at those points). Then I would run your script on the new file.
Try using dos2unix on the file first, and then read in as normal.