Why does my Perl tr/// remove newlines? - perl

I'm trying to clean up form input using the following Perl transliteration:
sub ValidateInput {
my $input = shift;
$input =~ tr/a-zA-Z0-9_#.:;',#$%&()\/\\{}[]?! -//cd;
return $input;
}
The problem is that this transliteration is removing embedded newline characters that users may enter into a textarea field which I want to keep as part of the string. Any ideas on how I can update this to stop it from removing embedded newline characters? Thanks in advance for your help!

I'm not sure what you are doing, but I suspect you are trying to keep all the characters between the space and the tilde in the ASCII table, along with some of the whitespace characters. I think most of your list condenses to a single range \x20-\x7e:
$string =~ tr/\x0a\x0d\x20-\x7e//cd;
If you want to knock out a character like " (although I suspect you really want it since you allow the single quote), just adjust your range:
$string =~ tr/\x0a\x0d\x20-\xa7\xa9-\x7e//cd;

That's a bit of a byzantine way of doing it! If you add \012 it should keep the newlines.
$input =~ tr/a-zA-Z0-9_#.:;',#$%&()\/\{}[]?! \012-//cd;

See Form content types.
application/x-www-form-urlencoded: Line breaks are represented as "CR LF" pairs (i.e., %0D%0A).
...
multipart/form-data: As with all MIME transmissions, "CR LF" (i.e., %0D%0A) is used to separate lines of data.
I do not know what you have in the database. Now you know what your script it sees.
You are using CGI.pm, right?

Thanks for the help guys! Ultimately I decided to process all the data in our database to remove the character that was causing the issue so that any text that was submitted via our update form (and not changed by the user) would match what was in the database. Per your suggestions I also added a few additional allowed characters to the validation regex.

Related

Split function returns weird characters

I am facing a problem with a script I want to make. In short, I am connecting to a local database with dbi and execute some queries. While this works just fine, and as I print out the returned values from select queries and so on, when I split, say, the $firstName to an array and print out the array I get weird characters. Note that all the fields in the table I am working are containing only greek characters and are utf8_general_ci. I played around with use utf8, use encoding, binmode, encode etc but still the split function does return š weird characters while before the split the whole greek word was printed fine. I suppose this is due to some missing pragma about string encoding or something similar but really can't find out the solution. Thanks in advance.
Here is the piece of code I am describing. Perl version is v5.14.2
#query = &DatabaseSubs::getStringFromDb();
print "$query[1]\n"; # prints the greek name fine
#chars = split('',$query[1]);
foreach $chr (#chars) {
print "$chr \n"; # prints weird chars
}
And here is the output from print and foreach respectively.
By default, Perl assumes that you are working with single-byte characters. But you aren't, in UTF8 the Greek characters that you are using are two-bytes in size. Therefore split is splitting your characters in half and you're getting strange characters.
You need to decode your bytes into characters as they come into your program. One way to do that would be like this.
use Encode;
my #query = map { decode_utf8($_) } DatabaseSubs::getStringFromDb();
(I've also removed the unnecessary and potentially confusing '&' from the subroutine call.)
Now #query contains properly decode character strings and split will split into individual characters correctly(*).
But if you print one of these characters, you'll get a "wide character" warning. That's because Perl's I/O layer expects single-byte characters. You need to tell it to expect UTF8. You can do that like this:
binmode STDOUT, ':utf8';
There are other improvements that you could consider. For example, you could probably put the decoding into the getStringFromDb subroutine. I recommend reading perldoc perluniintro and perldoc perlunicode for more details.
(*) Yes, there's another whole level of pain lurking when you get into two-character graphemes, but let's ignore that for now.
Your data is in utf8, but perl doesn't know that, so each perl character is just one byte of the multibyte characters that are stored in the database.
You tell perl that the data is in fact utf8 with:
utf8::decode($query[1]);
(though most database drivers provide a way to automate this before you even see the data in your code). Once you've done this, split will properly operate on the actual characters. You probably then need to also set your output filehandle to expect utf8 characters, or it will try to downgrade them to an 8-bit encoding.
The issue is that split('', $word) splits on every byte where in utf8 you can have multi-byte characters. For characters with ASCII value less than 127, this is fine, but anything beyond 127 is represented as multiple bytes. You're essentially printing half the character's code, hence it looking like garbage.

Why does split not return anything?

I am trying to get that Perl split working for more than 2 hours. I don't see an error. Maybe some other eyes can look at it and see the issue. I am sure its a silly one:
#versionsplit=split('.',"15.0.3");
print $versionsplit[0];
print $versionsplit[1];
print $versionsplit[2];
I just get an empty array. Any idea why?
You need:
#versionsplit=split(/\./,"15.0.3");
The first argument to split is a regular expression, not a string. And . is the regex symbol which means ‘match any character’. So all the characters in your input string were being treated as separators, and split wasn't finding anything between them to return.
the "." represents any character.You need to escape it for split function to recognise as a field separator.
change your line to
#versionsplit=split('\.',"15.0.3");

perl split string on a single char, not on repeated char

I am using an existing perl script to process a text file output from a database query which I have no control over.
The data contains fields separated by '|', but some fields contain '||'. There are no empty fields. There may be spaces on either side of the field separator which I would also like to remove.
I cannot find a simple way to achieve this, apart from changing the '||' to something else, and putting it hack after the split, which seems a bit heavy going.
The file is substantial (typically up to about 100M).
Using split(/ *\| */, $line) works apart from the '||' character.
Any thought please?
split /\s*(?<!\|)\|(?!\|)\s*/
you can use negative look-behind and look-ahead to ensure there are no | symbols around the | you're splitting on:
split / \s* (?<!\|) \| (?!\|) \s* /x
Look at using Text::CSV or Tie::Handle::CSV to run through the file. If the text file has been done properly fields that contain || will be quoted.

Best way to print fixed columns table in Perl (using underscores instead of spaces)

I need to format database records into a table that a web forum can display properly (using bbcode). The forum in question does not respect spaces no matter which type of formatting tag I use but does have a monospace font, so I need to replace all spaces by underscores like this to keep everything aligned:
Field____Field____Field
Value____Value____Value
Value____Value____Value
Value____Value____Value
Value____Value____Value
I've looked into Perl formats and printf, but I can't figure out how to make the spaces and tabs into underscore using these methods. The text also have variable length, so I need the columns to be variable as well (can't hardcode fixed values).
Any help would be appreciated. Thanks!
A bit of a hack but I would use sprintf but I would replace the space in my values with another character that can not be found in these values (like ~). This can be done with a simple regex.
After sprintf I would replace the spaces with underlines and my special character in the values back to space.
You don't need anything advanced, you just need to replace the spaces with underscore:
my $str = "Field Field Field";
$str =~ tr/ /_/;
print $str;
In case the values in your fields may contain tabs (or other space-like characters) you may want to do the following:
my $str = "Field Field\tContinued Field";
$str =~ s/\s/_/g;
print $str;

Perl - Hyphen and Minus

I have a method where i split terms bounded by white-spaces. I want to remove the minus sign when it is alone like these:
$word =~ s/^\-$//;
The problem is that i cannot visually identify the difference between a minus and a hyphen (used for separating two words for example). How can i be sure that i'm only removing the minus sign?
In the ASCII printable character set, the hyphen and minus are the same symbol (ASCII 45), so when you're just scanning printable ASCII text data, whether you remove it or not would really depend on the context. Also, hyphenated words shouldn't contain whitespace, and when used to set apart a phrase -- like this -- you'll usually find two consecutive dashes. So if you're finding the symbol on it's own there's something unusual going on in the file.
To match the En-dash character or Em-dash characters, you'd search for \226 or \227 respectively (the ASCII value in octal).
Try:
#!/usr/bin/env perl
use strict;
use warnings;
while( <DATA> ){
if( m/(?<=[[:alpha:]])\-(?=[[:alpha:]])/ ){
print "hyphen: $_";
}elsif( m/\-/ ){
print "minus: $_";
}else{
print "other: $_";
}
}
__DATA__
this has hypenated-words.
this is a negative number: -2
some confusing-2 things
-to test it
title -- one-line description
When coding, use a suitable editor. There are many of them, use Google or ask fellow developers. Here's a selection of notepads:
Notepad++
Programmer's Notepad
Notepad2
These editors won't sell you a hyphen for a minus when you clearly hit the minus key on the keyboard. So in about eleven years of programming, I've never faced this problem thanks to using appropriate editing software for coding.