perl reading from database, fields have \0 \t as normal text

perl reading from database, fields have \0 \t as normal text - perl

Apologies from the outset. I cannot give code that I am using.
I am querying a database via DBI and using perl to print the output via fetchrow-array and print $variable
But the fields in the database contain \0 \t \r etc as part of the normal text.
When these fields are printed as via the variable and the print command, these \t \r \0 text characters are mistakenly printed as tab, newline, hex character. I see no way to tell print to ignore any character strings like this.
Any ideas?
Thanks.

Neither fetching data using DBI nor printing will convert \t into a tab. The only time Perl converts \t is if it's found in a double-quoted string literal[1], which is to say in a file passed to perl, do, require or use, or in a string passed to perl -e or eval EXPR.
If you have a tab, you are taking steps to convert \t to a tab, or it's actually a tab in the database.
This includes qx and the replacement expression of a substitution without /e.

Related

What does \x do in print

I would like to start by saying that I am not familiar with Perl. That being said, I came across this piece of code and I could not figure out what the \x was for in the code below. In addition, I was unsure why nothing was displayed when I ran the following:
perl -e 'print "\x7c\x8e\x04\x08"'

It's not about print: it's about string representation, in which codes represent characters from your character set. For more information you should read Quote and Quote-like Operators and Effects of Character Semantics
In your case the character code is in hex. You should look in your character set table, and you may need to convert to decimal first.

You said "I was unsure why nothing was displayed when I ran the following:"
perl -e 'print "\x7c\x8e\x04\x08"'
That command outputs 4 characters to STDOUT. Each of the characters is specified in hexadecimal. The "\x7c" part will output the vertical bar character |. The other three characters are control characters, so probably wouldn't produce any visible output. If you redirect output to a file, you will end up with a 4 byte file.
It's possible that you're not seeing the vertical bar character because it's being overwritten by your command prompt. Unlike the shell echo or Python's print, Perl's print function does not automatically append a newline to all output. If you want new lines, you can insert them in the string using \n.

\x signifies the start of a hexadecimal character notation.

Split function returns weird characters

I am facing a problem with a script I want to make. In short, I am connecting to a local database with dbi and execute some queries. While this works just fine, and as I print out the returned values from select queries and so on, when I split, say, the $firstName to an array and print out the array I get weird characters. Note that all the fields in the table I am working are containing only greek characters and are utf8_general_ci. I played around with use utf8, use encoding, binmode, encode etc but still the split function does return  weird characters while before the split the whole greek word was printed fine. I suppose this is due to some missing pragma about string encoding or something similar but really can't find out the solution. Thanks in advance.
Here is the piece of code I am describing. Perl version is v5.14.2
#query = &DatabaseSubs::getStringFromDb();
print "$query[1]\n"; # prints the greek name fine
#chars = split('',$query[1]);
foreach $chr (#chars) {
print "$chr \n"; # prints weird chars
}
And here is the output from print and foreach respectively.

By default, Perl assumes that you are working with single-byte characters. But you aren't, in UTF8 the Greek characters that you are using are two-bytes in size. Therefore split is splitting your characters in half and you're getting strange characters.
You need to decode your bytes into characters as they come into your program. One way to do that would be like this.
use Encode;
my #query = map { decode_utf8($_) } DatabaseSubs::getStringFromDb();
(I've also removed the unnecessary and potentially confusing '&' from the subroutine call.)
Now #query contains properly decode character strings and split will split into individual characters correctly(*).
But if you print one of these characters, you'll get a "wide character" warning. That's because Perl's I/O layer expects single-byte characters. You need to tell it to expect UTF8. You can do that like this:
binmode STDOUT, ':utf8';
There are other improvements that you could consider. For example, you could probably put the decoding into the getStringFromDb subroutine. I recommend reading perldoc perluniintro and perldoc perlunicode for more details.
(*) Yes, there's another whole level of pain lurking when you get into two-character graphemes, but let's ignore that for now.

Your data is in utf8, but perl doesn't know that, so each perl character is just one byte of the multibyte characters that are stored in the database.
You tell perl that the data is in fact utf8 with:
utf8::decode($query[1]);
(though most database drivers provide a way to automate this before you even see the data in your code). Once you've done this, split will properly operate on the actual characters. You probably then need to also set your output filehandle to expect utf8 characters, or it will try to downgrade them to an 8-bit encoding.

The issue is that split('', $word) splits on every byte where in utf8 you can have multi-byte characters. For characters with ASCII value less than 127, this is fine, but anything beyond 127 is represented as multiple bytes. You're essentially printing half the character's code, hence it looking like garbage.

Perl: Is quotemeta for regular expressions only? Is it safe for file names?

While answering this question regarding safe escaping of filename with spaces (and potentially other characters), one of the answers said to use Perl's built-in quotemeta function.
The documentation of quotemeta states:
quotemeta (and \Q ... \E ) are useful when interpolating strings
into regular expressions, because by default an interpolated variable
will be considered a mini-regular expression.
In the documentation for quotemeta, the only mention of its use is to escape all the characters other than /[A-Za-z_0-9]/ with a \ for use in a regex. It does not state the use for filenames. This does seem like a very pleasant, if undocumented, side effect however.
In a comment to Sinan Ünür answer to the earlier question, hobbs states:
shell escaping is different from
regexp escaping, and although I can't
come up with a situation where
quotemeta would give a truly unsafe
result, it's not meant for the task.
If you must escape, instead of
bypassing the shell, I suggest trying
String::ShellQuote which takes a more
conservative approach using sh single
quotes to defang everything except
single quotes themselves, and
backslashes for single quotes. – hobbs
Aug 13 '09 at 14:25
Is it safe -- completely -- to use quotemeta in place of more conservative file quoting like String::Shellquote? Is quotemeta utf8 or multibyte character safe?
I put together a test that is unclear. quotemeta works well, it seems, except for a file name or directory name with a \n, or \r in it. While rare, these characters are legal in Unix and I have seen them. Recall that certain characters, such as LF, CR and NUL cannot be escaped with \. I read my hard drive with 700k files with quotemeta and had no failures.
I have suspicion (though I have not demonstrated it yet) that quotemeta might fail with multibyte characters where one or more of the bytes falls into the ASCII range. For example,à can be encoded as one character (UTF8 C3 A0) or as two characters (U+0061 gives a u+0300 is a combining graves accent). The only demonstrated failure I have with quotemeta is with files with a \n or \r in the path that I created. I would be interested in other characters to put in nasty_names to test.
ShellQuote works perfectly on all file names except those terminated by a NUL when creating a file. I have never ever had a failure with it.
So what to use? Just to be clear: shell quoting is not something I do often, since I usually just use Perl open to open a pipe to a process. That method does not suffer the shell issues discussed. I am interested since I have seen quotemeta used often for file name escaping.
(Thanks to Ether I have added IPC::System::Simple)
Test file:
use strict; use warnings; use autodie;
use String::ShellQuote;
use File::Find;
use File::Path;
use IPC::System::Simple 'capturex';
my #nasty_names;
my $top_dir = '/Users/andrew/bin/pipetestdir/testdir';
my $sub_dir = "easy_to_remove_me";
my (#qfail, #sfail, #ipcfail);
sub wanted {
if ($File::Find::name) {
my $rtr;
my $exec1="ls ".quotemeta($File::Find::name);
my $exec2="ls ".shell_quote($File::Find::name);
my #exec3= ("ls", $File::Find::name);
$rtr=`$exec1`;
push #qfail, "$exec1"
if $rtr=~/^\s*$/ ;
$rtr=`$exec2`;
push #sfail, "$exec2"
if $rtr=~/^\s*$/ ;
$rtr = capturex(#exec3);
push #ipcfail, \#exec3
if $rtr=~/^\s*$/ ;
}
}
chdir($top_dir) or die "$!";
mkdir "$top_dir/$sub_dir";
chdir "$top_dir/$sub_dir";
push #nasty_names, "name with new line \n in the middle";
push #nasty_names, "name with CR \r in the middle";
push #nasty_names, "name with tab\tright there";
push #nasty_names, "utf \x{0061}\x{0300} combining diacritic";
push #nasty_names, "utf e̋ alt combining diacritic";
push #nasty_names, "utf e\x{cc8b} alt combining diacritic";
push #nasty_names, "utf άέᾄ greek";
push #nasty_names, 'back\slashes\\Not\\\at\\\\end';
push #nasty_names, qw|back\slashes\\IS\\\at\\\\end\\\\|;
sub create_nasty_files {
for my $name (#nasty_names) {
open my $fh, '>', $name ;
close $fh;
}
}
for my $dir (#nasty_names) {
chdir("$top_dir/$sub_dir");
mkpath($dir);
chdir $dir;
create_nasty_files();
}
find(\&wanted, $top_dir);
print "\nquotemeta failed on:\n", join "\n", #qfail;
print "\nShell Quote failed on:\n", join "\n", #sfail;
print "\ncapturex failed on:\n", join "\n", #ipcfail;
print "\n\n\n",
"Remove \"$top_dir/$sub_dir\" before running again...\n\n";

Quotemeta is safe under these assumptions:
Only non-alphanumeric characters have a special meaning.
If a non-alphanumeric character has a special meaning, putting a backslash in front of it will always make it non-special.
If a non-alphanumeric character doesn't have a special meaning, putting a backslash in front of it will do nothing.
The shell violates rules 2 and 3 no matter what quote context you use -- outside of quotes, backslash-newline doesn't generate newline; in double-quotes, backslash-punctuation puts a backslash into the output (outside of a certain list of punctuation); and in single-quotes, everything is literal and backslash doesn't even protect you against a closing single-quote.
I still recommend String::ShellQuote if you need to quote things for the shell. I also recommend avoiding letting the shell process your filenames entirely, if you can, by using LIST-form system/exec/open or IPC::Open2, IPC::Open3, or IPC::System::Simple.
As for things besides the shell... lots of different things violate one or more of the rules. For example, obsolete POSIX "basic" regexes and various kinds of editor regexes have punctuation characters that are non-special by default, but become special when preceded by backslash. Basically what I'm saying is, know the thing that you're feeding your data to very well, and escape properly. Only use quotemeta if it's an exact fit, or if you're using it for something that's not very important.

You could also use IPC::System::Simple capture() or capturex() (which I suggested in another answer on that first question), which will let you bypass the shell.
I added these lines to your script and found that no examples failed:
use IPC::System::Simple 'capturex';
...
my (#qfail, #sfail, #ipcfail);
...
my #exec3= ("ls", $File::Find::name);
...
$rtr = capturex(#exec3);
push #ipcfail, \#exec3
if $rtr=~/^\s*$/ ;
...
print "\ncapturex failed on:\n", join "\n", #ipcfail;
But in general, you should solve the actual problem, rather than attempting to find better band-aids. quotemeta is intended specifically to escape regular expression-significant characters, which as you have discovered is not a perfect overlap with the set of characters that are significant to the shell.

The following is a Unix-only solution; see https://stackoverflow.com/a/32161361/45375 for Windows support.
An alternative is this simple function, which should work robustly even with non-ASCII characters (assuming the correct encoding), as well as \n, and \r, but excluding NUL (see bottom).
sub quoteforsh { join ' ', map { "'" . s/'/'\\''/gr . "'" } #_ }
The function encloses each argument in single-quotes and, if multiple arguments were specified, separates them with spaces.
Single-quoted strings are used, because their contents is not subject to any interpretation in POSIX-like shells.
As such, however, you cannot even escape ' instances themselves, which requires the following workaround: every embedded ' instance is replaced with '\'' (sic), which effectively splits the input string into multiple single-quoted strings, with escaped ' instances - \' - spliced in - the shell then reassembles the string parts into a single string.
Example:
print quoteforsh 'I\'m here & wëll';
literally produces (including the enclosing single-quotes) 'I'\''m here & wëll', which, to the shell, are 3 contiguous strings - 'I', \', and '&well', which the shell then reassembles into a single string, which, after quote removal, yields I'm here & wëll.
OSX Unicode caveat: The HFS+ stores filenames in NFD (decomposed Unicode normal form - base letter followed by another character that is the associated diacritic), whereas Perl typically creates NFC (composed Unicode normal form - a single character identifies the accented letter).
When using literal filenames, this distinction doesn't matter (the system calls do the mapping), but when using globs, it does, and, unfortunately, you have to do your own translation between the two forms.
Support for NUL (0x0) chars.:
I don't think NUL chars. in filenames are a real-world concern:
Most POSIX-like shells (bash, dash, ksh) ignore NUL chars. on the command line - zsh being the only exception.
Even if that weren't an issue, according to Wikipedia, most Unix systems do not support NUL chars. in filenames.
Besides, trying to pass a literal with a NUL to Perl's system() function breaks the invocation, presumably, because the string passed to sh -c is cut off at the first NUL:
system "echo 'a\x{0}b'"; # BREAKS

What's the clearest way to replace trailing backslash \ with \n?

I want multi-line strings in java, so I seek a simple preprocessor to convert C-style multi-lines into single lines with a literal '\n'.
Before:
System.out.println("convert trailing backslashes\
this is on another line\
\
\
above are two blank lines\
But don't convert non-trailing backslashes, like: \"\t\" and \'\\\'");
After:
System.out.println("convert trailing backslashes\nthis is on another line\n\n\nabove are two blank lines\nBut don't convert non-trailing backslashes, like: \"\t\" and \'\\\'");
I thought sed would do it well, but sed is line-based, so replacing the '\' and the newline that follows it (effectively joining the two lines) is not very natural in sed. I adapted sredden79's oneliner to the following - it works, it's clever, but it's not clear:
sed ':a { $!N; s/\\\n/\\n/; ta }'
The substitute is of escaped literal backslash, newline with escaped literal backslash, n. :a is a label and ta is goto label if the substitute found a match; $ means the last line, and $! is the opposite (i.e. all lines but the last). N means to append the next line to the pattern space (thus making the \n character visible.)
EDIT here's a variation to keep compiler error line numbers etc accurate: it turns each extended line into "..."+\n (and handles the first and last lines of the String correctly):
sed ':a { $!N; s/\\\n/\\n"+\n"/; ta }'
giving:
System.out.println("convert trailing backslashes\n"+
"this is on another line\n"+
"\n"+
"\n"+
"above are two blank lines\n"+
"But don't convert non-trailing backslashes, like: \"\t\" and \'\\\'");
EDIT Actually, it would be better have Perl/Python style multi-line, where it starts and ends with a special code on one line (""" for python, I think).
Is there a simpler, saner, clearer way (maybe not using sed)?

Is there a simpler, saner, clearer way.
Forget the pre-processor, live with the limitation, complain about it (so that it will maybe be fixed in Java 7 or 8), and use an IDE to ease the pain.
Other alternatives (too troublesome I suppose, but still better than messing with the compilation process):
use a JVM-based language that does support here-docs
externalize the string into a resource file

A perl one-liner:
perl -0777 -pe 's/\\\n/\\n/g'
This will read either stdin or the file(s) named after it on the command line and write the output to stdout.
If you're using an editor that supports filtering, like vi or emacs, just filter your text through the above command and you're done:
If you're using Windows and have to worry about \r :
C:\> perl -0777 -pe "s/\\\r?\n/\\n/g"
although I think win32 Perl handles \r itself so this may be unnecessary.
The -0777 option is a special case of the -0 (that's a zero) option that defines the line or record separator. In this case, it means that we don't want any separator so read the entire file in as a single string.
The -pe option is a combination of -p (process line-by-line and print the result) and -e (next argument is (a line of) the program to execute)

A perl script to what you asked for.
while (<>) {
chomp;
print $_;
if (/\\$/) {
print "n";
} else {
print "\n";
}
}

sed 's/\x5c\x5c$/\x22\x5c\x5cn\x22/'
Hex for backslash and double quote is \x5c and \x22 respectively - it needs to be escaped so \x5c is doubled and the $ anchors to the end of the line.
Updated again per OP comment:
sed "{:a;N;\$!b a};s/\x5c\x5c\n/\x5c\x5cn/g"
The :a creates a label and the N appends a line to the pattern space, the b a branches back to the label :a except when its the last line $!;
After its all loaded - a single line substitution replaces all occurrences of a newline \n with a literal '\n' using the hex ascii code \x5c for the backslash.

What's the difference between single and double quotes in Perl?

I am just begining to learn Perl. I looked at the beginning perl page and started working.
It says:
The difference between single quotes and double quotes is that single quotes mean that their contents should be taken literally, while double quotes mean that their contents should be interpreted
When I run this program:
#!/usr/local/bin/perl
print "This string \n shows up on two lines.";
print 'This string \n shows up on only one.';
It outputs:
This string
shows up on two lines.
This string
shows up on only one.
Am I wrong somewhere?
the version of perl below:
perl -v
This is perl, v5.8.5 built for aix
Copyright 1987-2004, Larry Wall
Perl may be copied only under the terms of either the Artistic License or the
GNU General Public License, which may be found in the Perl 5 source kit.
Complete documentation for Perl, including FAQ lists, should be found on
this system using `man perl' or `perldoc perl'. If you have access to the
Internet, point your browser at http://www.perl.com/, the Perl Home Page.

I am inclined to say something is up with your shell/terminal, and whatever you are outputting to is interpreting the \n as a newline and that the problem is not with Perl.
To confirm: This Shouldn't Happen(TM) - in the first case I would expect to see a new line inserted, but with single quotes it ought to output literally the characters \n and not a new line.

In Perl, single-quoted strings do not expand backslash-escapes like \n or \t. The reason you're seeing them expanded is probably due to the nature of the shell that you're using, which is munging your output for some reason.

Everything you need to know about quoting and quote-like operators is in perlop.
To answer your specific question, double-quotes can turn certain sequences of literal characters into other characters. In your example, the double quotes turn the sequence of characters \ and n into the single character that represents a newline. In a single quoted string, that same literal sequence is just the literal \ and n characters.

By "interpreted", they mean that variable names and such will not be printed, but their values instead. \n is an escape sequence, so I'd think it would not be interpreted.

In addition to your O'Reilly link, a reference no less authoritative than the 'Programming Perl' book by Larry Wall, states that backslash interpolation does not occur in single quoted strings.
... much like Unix shell quotes: double quoted string literals are subject to
backslash and variable interpolation; single quoted strings are not
(except for \' and \\, so that you may ...)
Programing Perl, 2nd ed, 1996 page 16
So it would be interesting to see what your Perl does with
print 'Double backslash n: \\n';
As above, please show us the output from 'perl -v'.
And I believe I have confused the forum editor software, because that last Perl 'print' should have indented.

If you use the double quote it will be interpreted the \n as a newline.
But if you use the single quote it will not interpreted the \n as a newline.
For me it is working correctly.
file content
print "This string \n shows up on two lines.";
print 'This string \n shows up on only one.'

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse