What is the space character in Perl?

What is the space character in Perl? - perl

I want to have this output
"type":"test test"
I don't want to use space in the command not like " ".
I want to have a character that represents a single space, and I know that I can not use \s.
Is there some thing I can use?
print "\"type\":\"test(space character should be here )test\";

The best solution really is to use the space key directly:
print q("type":"test test"); # Changed delimiter for fewer escapes
But if you want to, you can also use
\x20, the ASCII code for “Space”, or
\N{SPACE}, the Unicode charname for the normal space (there are many more).
\N{U+0020}, the Unicode codepoint for the normal space.
…
Note that these only work in double-quoted strings, e.g. qq(...).
If you want to go the charname route, then you have to load the charnames module prior to Perl 5.16:
use charnames ':full';
Since 5.16, that module is loaded automatically once an \N{...} escape is found.
By default, the $" variable contains a single space – the contents of this variable are used to concatenate the contents of an array when it is interpolated.

Just use the space key:
print "\"type\":\"test test\"";
# ^

Related

Perl: quoting correctly all special characters [duplicate]

This question already has answers here:
How can I prevent Perl from interpreting double-backslash as single-backslash character?
(3 answers)
Closed 4 years ago.
I have this sample string, containing 2 backslashes. Please don't ask me for the source of the string, it is just a sample string.
my $string = "use Ppppp\\Ppppp;";
print $string;
Both, double quotes or quotes will print
use Ppppp\Ppppp;
Using
my $string = "\Quse Ppppp\\Ppppp;\E";
print $string;
will print
use\ Ppppp\\Ppppp\;
adding those extra backslashes to the output.
Is there a simple solution in perl to display the string "literally", without modifying the string like adding extra backslashes to escape?

I have this sample string, containing 2 backslashes. ...
my $string = "use Ppppp\\Ppppp;";
Sorry, but you're mistaken - that string only contains one backslash*, as \\ is a escape sequence in double-quoted (and single-quoted) strings that produces a single backslash. See also "Quote and Quote-like Operators" in perlop. If your string really does contain two backslashes, then you need to write "use Ppppp\\\\Ppppp;", or use a heredoc, as in:
chomp( my $string = <<'ENDSTR' );
use Ppppp\\Ppppp;
ENDSTR
If you want the string output as valid Perl source code (using its escaping), then you can use one of several options:
my $string = "use Ppppp\\Ppppp;";
# option 1
use Data::Dumper;
$Data::Dumper::Useqq=1;
$Data::Dumper::Terse=1;
print Dumper($string);
# option 2
use Data::Dump;
dd $string;
# option 3
use B;
print B::perlstring($string);
Each one of these will print "use Ppppp\\Ppppp;". (There are of course other modules available too. Personally I like Data::Dump. Data::Dumper is a core module.)
Using one of these modules is also the best way to verify what your $string variable really contains.
If that still doesn't fit your needs: A previous edit of your question said "How can I escape correctly all special characters including backslash?" - you'd have to specify a full list of which characters you consider special. You could do something like this, for example:
use 5.014; # for s///r
my $string = "use Ppppp\\Ppppp;";
print $string=~s/(?=[\\])/\\/gr;
That'll print $string with backslashes doubled, without modifying $string. You can also add more characters to the regex character class to add backslashes in front of those characters as well.
* Update: So I don't sound too pedantic here: of course the Perl source code contains two backslashes. But there is a difference between the literal source code and what the Perl string ends up containing, the same way that the string "Foo\nBar" contains a newline character instead of the two literal characters \ and n.
For the sake of completeness, as already discussed in the comments: \Q\E (aka quotemeta) is primarily meant for escaping any special characters that may be special to regular expressions (all ASCII characters not matching /[A-Za-z_0-9]/), which is why it is also escaping the spaces and semicolon.
Since you mention external files: If you are reading a line such as use Ppppp\\Ppppp; from an external file, then the Perl string will contain two backslashes, and if you print it, it will also show two backslashes. But if you wanted to represent that string as Perl source code, you have to write "use Ppppp\\\\Ppppp;" (or use one of the other methods from the question you linked to).

perl specify unicode character by name without putting name in all caps

So, this is sort of a cosmetic point, but is there an easy way to insert a unicode character by its name inside a Perl string and give the name "normal" casing?
Perl includes unicode literals that look up code points by name, as in the following:
"\N{GREEK SMALL LETTER ALPHA}"
I find something like the following easier to read:
"\N{Greek Small Letter Alpha}",
As far as I know there are no case minimal pairs when it comes to unicode character names. Is there a concise way to name the character that still triggers a compilation error very early in the process of executing a script if the character doesn't exist?
example compilation error with intentionally misspelled character name, this is the kind of check I don't want to give up.
$ echo '%[a]' | ./unicodify
Unknown charname 'GREK SMALL LETTER ALPHA' at ./unicodify line 10, within string
Execution of ./unicodify aborted due to compilation errors.
I'm trying to write a small utility to make it easier to enter unicode characters in text files by mnemonic names delimited by %[ and ].
Here's an extremely stripped down example that just replaces %[a] and %[b].
#! /usr/bin/env perl
use strict;
use warnings;
use utf8;
use open ':std' => ':utf8';
my %abbrevs = (
'a' => "\N{GREEK SMALL LETTER ALPHA}",
'b' => "\N{GREEK SMALL LETTER BETA}",
);
while (<>) {
chomp;
my $line = $_;
$line =~ s/(\%\[(.*?)\])/$abbrevs{$2}/g;
print "${line}\n";
}

Quote charnames,
Starting in Perl v5.16, any occurrence of \N{CHARNAME} sequences in a double-quotish string automatically loads this module with arguments :full and :short (described below) if it hasn't already been loaded with different arguments
On of those "different arguments" requests the use of loose matching.
$ perl -CSD -e'
use charnames ":loose";
CORE::say "\N{Greek Small Letter Alpha}";
'
α
LOOSE MATCHES
By specifying :loose, Unicode's loose character name matching rules are selected instead of the strict exact match used otherwise. That means that CHARNAME doesn't have to be so precisely specified. Upper/lower case doesn't matter (except with scripts as mentioned above), nor do any underscores, and the only hyphens that matter are those at the beginning or end of a word in the name (with one exception: the hyphen in U+1180 HANGUL JUNGSEONG O-E does matter). Also, blanks not adjacent to hyphens don't matter. The official Unicode names are quite variable as to where they use hyphens versus spaces to separate word-like units, and this option allows you to not have to care as much. The reason non-medial hyphens matter is because of cases like U+0F60 TIBETAN LETTER -A versus U+0F68 TIBETAN LETTER A. The hyphen here is significant, as is the space before it, and so both must be included.
:loose slows down look-ups by a factor of 2 to 3 versus :full, but the trade-off may be worth it to you. Each individual look-up takes very little time, and the results are cached, so the speed difference would become a factor only in programs that do look-ups of many different spellings, and probably only when those look-ups are through vianame() and string_vianame(), since \N{...} look-ups are done at compile time.
The module also provides the means for creating custom aliases.

Split function returns weird characters

I am facing a problem with a script I want to make. In short, I am connecting to a local database with dbi and execute some queries. While this works just fine, and as I print out the returned values from select queries and so on, when I split, say, the $firstName to an array and print out the array I get weird characters. Note that all the fields in the table I am working are containing only greek characters and are utf8_general_ci. I played around with use utf8, use encoding, binmode, encode etc but still the split function does return  weird characters while before the split the whole greek word was printed fine. I suppose this is due to some missing pragma about string encoding or something similar but really can't find out the solution. Thanks in advance.
Here is the piece of code I am describing. Perl version is v5.14.2
#query = &DatabaseSubs::getStringFromDb();
print "$query[1]\n"; # prints the greek name fine
#chars = split('',$query[1]);
foreach $chr (#chars) {
print "$chr \n"; # prints weird chars
}
And here is the output from print and foreach respectively.

By default, Perl assumes that you are working with single-byte characters. But you aren't, in UTF8 the Greek characters that you are using are two-bytes in size. Therefore split is splitting your characters in half and you're getting strange characters.
You need to decode your bytes into characters as they come into your program. One way to do that would be like this.
use Encode;
my #query = map { decode_utf8($_) } DatabaseSubs::getStringFromDb();
(I've also removed the unnecessary and potentially confusing '&' from the subroutine call.)
Now #query contains properly decode character strings and split will split into individual characters correctly(*).
But if you print one of these characters, you'll get a "wide character" warning. That's because Perl's I/O layer expects single-byte characters. You need to tell it to expect UTF8. You can do that like this:
binmode STDOUT, ':utf8';
There are other improvements that you could consider. For example, you could probably put the decoding into the getStringFromDb subroutine. I recommend reading perldoc perluniintro and perldoc perlunicode for more details.
(*) Yes, there's another whole level of pain lurking when you get into two-character graphemes, but let's ignore that for now.

Your data is in utf8, but perl doesn't know that, so each perl character is just one byte of the multibyte characters that are stored in the database.
You tell perl that the data is in fact utf8 with:
utf8::decode($query[1]);
(though most database drivers provide a way to automate this before you even see the data in your code). Once you've done this, split will properly operate on the actual characters. You probably then need to also set your output filehandle to expect utf8 characters, or it will try to downgrade them to an 8-bit encoding.

The issue is that split('', $word) splits on every byte where in utf8 you can have multi-byte characters. For characters with ASCII value less than 127, this is fine, but anything beyond 127 is represented as multiple bytes. You're essentially printing half the character's code, hence it looking like garbage.

Check characters inside string for their Unicode value

I would like to replace characters with certain Unicode values in a variable with dash. I have two ideas which might work, but I do not know how to check for the value of character:
1/ processing variable as string, checking every characters value and placing these characters in a new variable (replacing those characters which are invalid)
2/ use these magic :-)
$variable = s/[$char_range]/-/g;
char_range should be similar to [0-9] or [A-Z], but it should be values for utf-8 characters. I need range from 0x00 to 0x7F to be exact.

The following expression should replace anything that is not ASCII with a hyphen, which is (I think) what you want to do:
s/[\N{U+0080}-\N{U+FFFF}]/-/g

There's no such thing as UTF-8 characters. There are only characters that you encode into UTF-8. Even then, you don't want to make ranges outside of the magical ones that Perl knows about. You're likely to get more than you expect.
To get the ordinal value for a character, use ord:
use utf8;
my $code_number = ord '😸'; # U+1F638
say sprintf "%#x", $code_number;
However, I don't think that's what you need. It sounds like you want to replace characters in the ASCII range with a -. You can specify ranges of code numbers:
s/[\000-\177]/-/g; # in octal
s/[\x00-\x7f]/-/g; # in hexadecimal
You can specify wide character ordinal values in braces:
s/[\x80-\x{10ffff}]/-/g; # wide characters, replace non-ASCII in this case
When the characters have a common property, you can use that:
s/\p{ASCII}/-/g;
However, if you are replacing things character for character, you might want a transliteration:
$string =~ tr/\000-\177/-/;

Perl: Is quotemeta for regular expressions only? Is it safe for file names?

While answering this question regarding safe escaping of filename with spaces (and potentially other characters), one of the answers said to use Perl's built-in quotemeta function.
The documentation of quotemeta states:
quotemeta (and \Q ... \E ) are useful when interpolating strings
into regular expressions, because by default an interpolated variable
will be considered a mini-regular expression.
In the documentation for quotemeta, the only mention of its use is to escape all the characters other than /[A-Za-z_0-9]/ with a \ for use in a regex. It does not state the use for filenames. This does seem like a very pleasant, if undocumented, side effect however.
In a comment to Sinan Ünür answer to the earlier question, hobbs states:
shell escaping is different from
regexp escaping, and although I can't
come up with a situation where
quotemeta would give a truly unsafe
result, it's not meant for the task.
If you must escape, instead of
bypassing the shell, I suggest trying
String::ShellQuote which takes a more
conservative approach using sh single
quotes to defang everything except
single quotes themselves, and
backslashes for single quotes. – hobbs
Aug 13 '09 at 14:25
Is it safe -- completely -- to use quotemeta in place of more conservative file quoting like String::Shellquote? Is quotemeta utf8 or multibyte character safe?
I put together a test that is unclear. quotemeta works well, it seems, except for a file name or directory name with a \n, or \r in it. While rare, these characters are legal in Unix and I have seen them. Recall that certain characters, such as LF, CR and NUL cannot be escaped with \. I read my hard drive with 700k files with quotemeta and had no failures.
I have suspicion (though I have not demonstrated it yet) that quotemeta might fail with multibyte characters where one or more of the bytes falls into the ASCII range. For example,à can be encoded as one character (UTF8 C3 A0) or as two characters (U+0061 gives a u+0300 is a combining graves accent). The only demonstrated failure I have with quotemeta is with files with a \n or \r in the path that I created. I would be interested in other characters to put in nasty_names to test.
ShellQuote works perfectly on all file names except those terminated by a NUL when creating a file. I have never ever had a failure with it.
So what to use? Just to be clear: shell quoting is not something I do often, since I usually just use Perl open to open a pipe to a process. That method does not suffer the shell issues discussed. I am interested since I have seen quotemeta used often for file name escaping.
(Thanks to Ether I have added IPC::System::Simple)
Test file:
use strict; use warnings; use autodie;
use String::ShellQuote;
use File::Find;
use File::Path;
use IPC::System::Simple 'capturex';
my #nasty_names;
my $top_dir = '/Users/andrew/bin/pipetestdir/testdir';
my $sub_dir = "easy_to_remove_me";
my (#qfail, #sfail, #ipcfail);
sub wanted {
if ($File::Find::name) {
my $rtr;
my $exec1="ls ".quotemeta($File::Find::name);
my $exec2="ls ".shell_quote($File::Find::name);
my #exec3= ("ls", $File::Find::name);
$rtr=`$exec1`;
push #qfail, "$exec1"
if $rtr=~/^\s*$/ ;
$rtr=`$exec2`;
push #sfail, "$exec2"
if $rtr=~/^\s*$/ ;
$rtr = capturex(#exec3);
push #ipcfail, \#exec3
if $rtr=~/^\s*$/ ;
}
}
chdir($top_dir) or die "$!";
mkdir "$top_dir/$sub_dir";
chdir "$top_dir/$sub_dir";
push #nasty_names, "name with new line \n in the middle";
push #nasty_names, "name with CR \r in the middle";
push #nasty_names, "name with tab\tright there";
push #nasty_names, "utf \x{0061}\x{0300} combining diacritic";
push #nasty_names, "utf e̋ alt combining diacritic";
push #nasty_names, "utf e\x{cc8b} alt combining diacritic";
push #nasty_names, "utf άέᾄ greek";
push #nasty_names, 'back\slashes\\Not\\\at\\\\end';
push #nasty_names, qw|back\slashes\\IS\\\at\\\\end\\\\|;
sub create_nasty_files {
for my $name (#nasty_names) {
open my $fh, '>', $name ;
close $fh;
}
}
for my $dir (#nasty_names) {
chdir("$top_dir/$sub_dir");
mkpath($dir);
chdir $dir;
create_nasty_files();
}
find(\&wanted, $top_dir);
print "\nquotemeta failed on:\n", join "\n", #qfail;
print "\nShell Quote failed on:\n", join "\n", #sfail;
print "\ncapturex failed on:\n", join "\n", #ipcfail;
print "\n\n\n",
"Remove \"$top_dir/$sub_dir\" before running again...\n\n";

Quotemeta is safe under these assumptions:
Only non-alphanumeric characters have a special meaning.
If a non-alphanumeric character has a special meaning, putting a backslash in front of it will always make it non-special.
If a non-alphanumeric character doesn't have a special meaning, putting a backslash in front of it will do nothing.
The shell violates rules 2 and 3 no matter what quote context you use -- outside of quotes, backslash-newline doesn't generate newline; in double-quotes, backslash-punctuation puts a backslash into the output (outside of a certain list of punctuation); and in single-quotes, everything is literal and backslash doesn't even protect you against a closing single-quote.
I still recommend String::ShellQuote if you need to quote things for the shell. I also recommend avoiding letting the shell process your filenames entirely, if you can, by using LIST-form system/exec/open or IPC::Open2, IPC::Open3, or IPC::System::Simple.
As for things besides the shell... lots of different things violate one or more of the rules. For example, obsolete POSIX "basic" regexes and various kinds of editor regexes have punctuation characters that are non-special by default, but become special when preceded by backslash. Basically what I'm saying is, know the thing that you're feeding your data to very well, and escape properly. Only use quotemeta if it's an exact fit, or if you're using it for something that's not very important.

You could also use IPC::System::Simple capture() or capturex() (which I suggested in another answer on that first question), which will let you bypass the shell.
I added these lines to your script and found that no examples failed:
use IPC::System::Simple 'capturex';
...
my (#qfail, #sfail, #ipcfail);
...
my #exec3= ("ls", $File::Find::name);
...
$rtr = capturex(#exec3);
push #ipcfail, \#exec3
if $rtr=~/^\s*$/ ;
...
print "\ncapturex failed on:\n", join "\n", #ipcfail;
But in general, you should solve the actual problem, rather than attempting to find better band-aids. quotemeta is intended specifically to escape regular expression-significant characters, which as you have discovered is not a perfect overlap with the set of characters that are significant to the shell.

The following is a Unix-only solution; see https://stackoverflow.com/a/32161361/45375 for Windows support.
An alternative is this simple function, which should work robustly even with non-ASCII characters (assuming the correct encoding), as well as \n, and \r, but excluding NUL (see bottom).
sub quoteforsh { join ' ', map { "'" . s/'/'\\''/gr . "'" } #_ }
The function encloses each argument in single-quotes and, if multiple arguments were specified, separates them with spaces.
Single-quoted strings are used, because their contents is not subject to any interpretation in POSIX-like shells.
As such, however, you cannot even escape ' instances themselves, which requires the following workaround: every embedded ' instance is replaced with '\'' (sic), which effectively splits the input string into multiple single-quoted strings, with escaped ' instances - \' - spliced in - the shell then reassembles the string parts into a single string.
Example:
print quoteforsh 'I\'m here & wëll';
literally produces (including the enclosing single-quotes) 'I'\''m here & wëll', which, to the shell, are 3 contiguous strings - 'I', \', and '&well', which the shell then reassembles into a single string, which, after quote removal, yields I'm here & wëll.
OSX Unicode caveat: The HFS+ stores filenames in NFD (decomposed Unicode normal form - base letter followed by another character that is the associated diacritic), whereas Perl typically creates NFC (composed Unicode normal form - a single character identifies the accented letter).
When using literal filenames, this distinction doesn't matter (the system calls do the mapping), but when using globs, it does, and, unfortunately, you have to do your own translation between the two forms.
Support for NUL (0x0) chars.:
I don't think NUL chars. in filenames are a real-world concern:
Most POSIX-like shells (bash, dash, ksh) ignore NUL chars. on the command line - zsh being the only exception.
Even if that weren't an issue, according to Wikipedia, most Unix systems do not support NUL chars. in filenames.
Besides, trying to pass a literal with a NUL to Perl's system() function breaks the invocation, presumably, because the string passed to sh -c is cut off at the first NUL:
system "echo 'a\x{0}b'"; # BREAKS