perl split string on a single char, not on repeated char - perl

I am using an existing perl script to process a text file output from a database query which I have no control over.
The data contains fields separated by '|', but some fields contain '||'. There are no empty fields. There may be spaces on either side of the field separator which I would also like to remove.
I cannot find a simple way to achieve this, apart from changing the '||' to something else, and putting it hack after the split, which seems a bit heavy going.
The file is substantial (typically up to about 100M).
Using split(/ *\| */, $line) works apart from the '||' character.
Any thought please?

split /\s*(?<!\|)\|(?!\|)\s*/

you can use negative look-behind and look-ahead to ensure there are no | symbols around the | you're splitting on:
split / \s* (?<!\|) \| (?!\|) \s* /x

Look at using Text::CSV or Tie::Handle::CSV to run through the file. If the text file has been done properly fields that contain || will be quoted.

Related

rename multiple files possible unintended interpolation

I'm using brew rename to rename multiple files...
file-24.png => file.png
file-48.png => file#2x.png
file-72.png => file#3x.png
the first one is succeed with,,
rename 's/-24//g' *
the second and third...
rename 's/-48/#2x/g' *
and getting Possible unintended interpolation of #2 in string at (eval 2) line 1...
escaping doesnt work..
rename 's/-48/\#2x/g' *
other possible ways to rename multiple files like this case are also welcome..
I don't know what "brew rename" is, but if it uses normal regex
's/pattern/q(#replacement)/e'
This uses /e modifier to evaluate the replacement side as code, where q() operator (single quotes) is used to insert literal characters.
Another way is to use \x40 for # character
's/pattern/\x40replacement/'
or just escape it, use \# in the replacement.
This is suitable for when there's just one character to deal with, like here. if there's more than that then it's easier to single-quote the whole thing, with q() (for which we need /e flag).
Can't help it but ask -- are you certain that you want to have # in a file name? That character gets interpreted in various ways by many tools. For instance, sticking that file name in a variable in a Perl script leads to no end of trouble. Why not even simply file_at_2x.png?
This may be more of a curiousity, but if you have a lot of files you can rename them all with
's{ \-([0-9]+) }{ ($r = $1/24) > 1 && qq(_at_${r}x) || q() }ex'
This captures the number ([0-9]+) into $1. Then, it finds the ratio ($r = $1/24) and if that is >1 then (&& short-circuits) it replaces -number with _at_${r}x, otherwise (||) removes it by putting an empty string, q().
I use {}{} delimiters so that I may use / inside, and }x allows spaces inside, for readability.
Please test this carefully with (a copy of) your actual files, as always.
I know this question is old and maybe the version of rename that apt-get installs is lightly different or improved. However, escaping with a single backslash seems to work:
$ rename -n -v 's/-48/\#2x/g' *
rename(foo-48.txt, foo#2x.txt)

Why does split not return anything?

I am trying to get that Perl split working for more than 2 hours. I don't see an error. Maybe some other eyes can look at it and see the issue. I am sure its a silly one:
#versionsplit=split('.',"15.0.3");
print $versionsplit[0];
print $versionsplit[1];
print $versionsplit[2];
I just get an empty array. Any idea why?
You need:
#versionsplit=split(/\./,"15.0.3");
The first argument to split is a regular expression, not a string. And . is the regex symbol which means ‘match any character’. So all the characters in your input string were being treated as separators, and split wasn't finding anything between them to return.
the "." represents any character.You need to escape it for split function to recognise as a field separator.
change your line to
#versionsplit=split('\.',"15.0.3");

Convert a CSV file with embedded commas into a bash array by line efficiently

Normally, I do something like
IFS=','
columns=( $LINE )
where $LINE is a line from a csv file I'm reading.
However, how do I handle a csv file with embedded commas? I have to handle several hundred gigs of file so everything needs to be done quickly, i.e., no multiple readings of a line, definitely no loops (last time I tried that slowed it down several factors).
The general structure of the code is as follows
FILENAME=$1
cat $FILENAME | while read LINE
do
IFS=","
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
Preferably, I need something that goes
FILENAME=$1
cat $FILENAME | while read LINE
do
IFS=","
# code to tell bash to ignore if IFS is within an open quote
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
Any tips would be appreciated. Otherwise, I'll probably switch to using another language to handle this stuff.
Probably embedded commas is just the first obvious problem that you encountered while parsing those CSV files.
Future problems that might popped are:
embedded newline separator characters
embedded utf8 chars
special treatment for whitespaces, empty fields, spaces around commas, undef values
I generally tend to follow the philosophy that If there is a (reputable) module that parses some
format you have to parse, use it instead of making a homebrew
I don't think there is such a thing for bash, but there are some for Perl. I'd go for Text::CSV_XS. Being written in C I expect it to be very fast.
You can use sed or something similar to convert the commas within quotes to some other sequence or punctuation. If you don't care about the stuff in quotes then you do not even need to change them back. You can do this on the whole file:
sed 's/\("[^,"]*\),\([^"]*"\)/\1;\2/g' input.csv > intermediate.csv
or on each line:
line=$(echo $line | sed 's/\("[^,"]*\),\([^"]*"\)/\1;\2/g')
This isn't a complete answer, but it's a possible approach.
Find a character that never occurs in the input file. Use a C program that parses the CSV file and prints lines to standard output with a different delimiter. Writing that program is left as an exercise, but I'm sure there's CSV-parsing C source code out there. Pipe the output of the C program into your script.
For example:
FILENAME=$1
new_c_program $FILENAME | while read LINE
do
IFS="|"
# code to tell bash to ignore if IFS is within an open quote
columns=( $LINE )
# affect columns changes here
newline="${columns[*]}"
echo "$newline"
done
A minor point: I'd pick a name other than $newline; newline suggests an end-of-line marker rather than an entire line.
Another minor point: you have a "Useless Use Of cat" in the code in your question. You could replace this:
cat $FILENAME | while read LINE
do
...
done
by this:
while read LINE
do
...
done < $FILENAME
But if you replace cat by the hypothetical C program I suggested, you still need the pipe.

Why aren't my nested lookarounds working correctly in my Perl substitution?

I have a Perl substitution which converts hyperlinks to lowercase:
's/(?<=<a href=")([^"]+)(?=")/\L$1/g'
I want the substitution to ignore any links which begin with a hash, for example I want it to change the path in Foo Bar to lowercase but skip if it comes across Bar.
Nesting lookaheads to instruct it to skip these links isn't working correctly for me. This is the one-liner I've written:
perl -pi -e 's/(?<=<a href=" (?! (?<=<a href="#) ) )([^"]+)(?=")/\L$1/g' *;
Could anyone hint to me where I have gone wrong with this substitution? It executes just fine, but does not do anything.
As near as I can tell, your initial regex will work just fine, if you add the condition that the first character in the link may not be a hash # or a double quote, e.g. [^#"]
s/(?<=<a href=")([^#"][^"]+)(?=")/\L$1/gi;
In the case you have links which do not start with a hash, e.g. Foo Bar, it becomes slightly more complicated:
s{(?<=<a href=")([^#"]+)(#[^"]+)*(?=")}{ lc($1) . ($2 // "") }gei;
We now have to evaluate the substitution, since otherwise we get undefined variable warnings when the optional anchor reference is not present.
You don't need look-arounds, from what I see
use 5.010;
...
s/<a \s+ href \s* = \s* "\K([^#"][^"]*)"/\L$1"/gx;
\K means "keep" everything before it. It amounts to a variable-length look-behind.
perlre:
For various reasons \K may be significantly more efficient than the equivalent (?<=...) construct, and it is especially useful in situations where you want to efficiently remove something following something else in a string.

Why does my Perl tr/// remove newlines?

I'm trying to clean up form input using the following Perl transliteration:
sub ValidateInput {
my $input = shift;
$input =~ tr/a-zA-Z0-9_#.:;',#$%&()\/\\{}[]?! -//cd;
return $input;
}
The problem is that this transliteration is removing embedded newline characters that users may enter into a textarea field which I want to keep as part of the string. Any ideas on how I can update this to stop it from removing embedded newline characters? Thanks in advance for your help!
I'm not sure what you are doing, but I suspect you are trying to keep all the characters between the space and the tilde in the ASCII table, along with some of the whitespace characters. I think most of your list condenses to a single range \x20-\x7e:
$string =~ tr/\x0a\x0d\x20-\x7e//cd;
If you want to knock out a character like " (although I suspect you really want it since you allow the single quote), just adjust your range:
$string =~ tr/\x0a\x0d\x20-\xa7\xa9-\x7e//cd;
That's a bit of a byzantine way of doing it! If you add \012 it should keep the newlines.
$input =~ tr/a-zA-Z0-9_#.:;',#$%&()\/\{}[]?! \012-//cd;
See Form content types.
application/x-www-form-urlencoded: Line breaks are represented as "CR LF" pairs (i.e., %0D%0A).
...
multipart/form-data: As with all MIME transmissions, "CR LF" (i.e., %0D%0A) is used to separate lines of data.
I do not know what you have in the database. Now you know what your script it sees.
You are using CGI.pm, right?
Thanks for the help guys! Ultimately I decided to process all the data in our database to remove the character that was causing the issue so that any text that was submitted via our update form (and not changed by the user) would match what was in the database. Per your suggestions I also added a few additional allowed characters to the validation regex.