perl specify unicode character by name without putting name in all caps - perl

So, this is sort of a cosmetic point, but is there an easy way to insert a unicode character by its name inside a Perl string and give the name "normal" casing?
Perl includes unicode literals that look up code points by name, as in the following:
"\N{GREEK SMALL LETTER ALPHA}"
I find something like the following easier to read:
"\N{Greek Small Letter Alpha}",
As far as I know there are no case minimal pairs when it comes to unicode character names. Is there a concise way to name the character that still triggers a compilation error very early in the process of executing a script if the character doesn't exist?
example compilation error with intentionally misspelled character name, this is the kind of check I don't want to give up.
$ echo '%[a]' | ./unicodify
Unknown charname 'GREK SMALL LETTER ALPHA' at ./unicodify line 10, within string
Execution of ./unicodify aborted due to compilation errors.
I'm trying to write a small utility to make it easier to enter unicode characters in text files by mnemonic names delimited by %[ and ].
Here's an extremely stripped down example that just replaces %[a] and %[b].
#! /usr/bin/env perl
use strict;
use warnings;
use utf8;
use open ':std' => ':utf8';
my %abbrevs = (
'a' => "\N{GREEK SMALL LETTER ALPHA}",
'b' => "\N{GREEK SMALL LETTER BETA}",
);
while (<>) {
chomp;
my $line = $_;
$line =~ s/(\%\[(.*?)\])/$abbrevs{$2}/g;
print "${line}\n";
}

Quote charnames,
Starting in Perl v5.16, any occurrence of \N{CHARNAME} sequences in a double-quotish string automatically loads this module with arguments :full and :short (described below) if it hasn't already been loaded with different arguments
On of those "different arguments" requests the use of loose matching.
$ perl -CSD -e'
use charnames ":loose";
CORE::say "\N{Greek Small Letter Alpha}";
'
α
LOOSE MATCHES
By specifying :loose, Unicode's loose character name matching rules are selected instead of the strict exact match used otherwise. That means that CHARNAME doesn't have to be so precisely specified. Upper/lower case doesn't matter (except with scripts as mentioned above), nor do any underscores, and the only hyphens that matter are those at the beginning or end of a word in the name (with one exception: the hyphen in U+1180 HANGUL JUNGSEONG O-E does matter). Also, blanks not adjacent to hyphens don't matter. The official Unicode names are quite variable as to where they use hyphens versus spaces to separate word-like units, and this option allows you to not have to care as much. The reason non-medial hyphens matter is because of cases like U+0F60 TIBETAN LETTER -A versus U+0F68 TIBETAN LETTER A. The hyphen here is significant, as is the space before it, and so both must be included.
:loose slows down look-ups by a factor of 2 to 3 versus :full, but the trade-off may be worth it to you. Each individual look-up takes very little time, and the results are cached, so the speed difference would become a factor only in programs that do look-ups of many different spellings, and probably only when those look-ups are through vianame() and string_vianame(), since \N{...} look-ups are done at compile time.
The module also provides the means for creating custom aliases.

Related

substituting spaces for underscores using lookaheads in perl

I have files with many lines of the following form:
word -0.15636028 -0.2953045 0.29853472 ....
(one word preceding several hundreds floats delimited by blanks)
Due to some errors out of my control, the word sometimes has spaces in it.
a bbb c -0.15636028 -0.2953045 0.29853472 .... (several hundreds floats)
which I wish to substitute by underscores so to get:
a_bbb_c -0.15636028 -0.2953045 0.29853472 .... (several hundreds floats)
have tried for each line the following substitution code:
s/\s(?=(\s-?\d\.\d+)+)/_/g;
So lookarounds is apparently not the solution.
I'd be grateful for any clues.
Your idea for the lookahead is fine, but the question is how to replace only spaces in the part matched before the lookahead, when they are mixed with other things (the words, that is).
One way is to capture what precedes the first float (given by lookahead), and in the replacement part run another regex on what's been captured, to replace spaces
s{ (.*?) (?=\s+-?[0-9]+\.[0-9]) }{ $1 =~ s/\s+/_/gr }ex
Notes
Modifier /e makes the replacement part be evaluated as code; any valid Perl code goes
With s{}{} delimiters we can use s/// ones in the replacement part's regex
Regex in the replacement part, that changes spaces to _ in the captured text, has /r modifier so to return the modified string and leave the original unchanged. Thus we aren't attempting to change $1 (it's read only), and the modified string (being returned) is available as the replacement
Modifier /x allows use of spaces in patterns, for readability
Some assumptions must be made here. Most critical one is that the text to process is followed by a number in the given format, -?[0-9]+\.[0-9]+, and that there isn't such a number in the text itself. This follows the OP's sample and, more decidedly, the attempted solution
A couple of details with assumptions. (1) Leading digits are expected with [0-9]+\. -- if you can have numbers like .123 then use [0-9]*\. (2) The \s+ in the inner regex collapses multiple consecutive spaces into one _, so a b c becomes a_b_c (and not a__b_c)
In the lookahead I scoop up all spaces preceding the first float with \s+ -- and so they'll stay in front of the first float. This is as wanted with one space but with multiple ones it may be awkward
If they were included in the .*? capture (if the lookahead only has one space, \s) then we'd get an _ trailing the word(s). I thought that'd be more awkward. The ideal solution is to run another regex and clean that up, if such a case is possible and if it's a bother
An example
echo "a bbb c -0.15636028 -0.2953045" |
perl -wpe's{(.*?)(?=\s+-?[0-9]+\.[0-9])}{ $1 =~ s/\s+/_/gr }e'
prints
a_bbb_c -0.15636028 -0.2953045
Then to process all lines in a file you can do either
perl -wpe'...' file > new_file
and get a new_file with changes, or
perl -i.bak -wpe'...' file
to change the file in-place (that's -i), where .bak makes it save a backup.
Would something like this work for you:
s/\s+/_/g;
s/_(-?\d+\.)/ $1/g;
Use a negative lookahead to replace any spaces not followed by a float:
echo "a bbb cc -0.123232 -0.3232" | perl -wpe 's/ +(?! *-?\d+\.)/_/g'
Assuming from your comments your files look like that:
name float1 float2 float3
a bbb c -0.15636028 -0.2953045 0.29853472
abbb c -0.15636028 -0.2953045 0.29853472
a bbbc -0.15636028 -0.2953045 0.29853472
ab bbc -0.15636028 -0.2953045 0.29853472
abbbc -0.15636028 -0.2953045 0.29853472
Since you said in comments that the first field may contain digits, you can't use a lookahead that searches the first float to solve the problem. (you can nevertheless use a lookahead that counts the number of floats until the end of the line but it isn't very handy).
What I suggest is a solution based on fields number defined by the header first line.
You can use the header line to know the number of fields and replace spaces at the begining of other lines until the number of fields is the same.
You can use perl command line as awk like that:
perl -MEnglish -pae'$c=scalar #F if ($NR==1);for($i=0;$i<scalar(#F)-$c;$i++){s/\s+/_/}' file
The for loop counts the difference between the number of fields in the first row (stored in $c) and in the current line (given by scalar(#F) where #F is the fields array), and repeats the substitution.
The a switches the perl command line in autosplit mode and the -MEnglish makes available the number row variable as $NR (like the NR variable in awk).
It's possible to shorten it like that:
perl -pae'$c=#F if $.<2;$i=#F-$c;s/\s+/_/ while $i--' file

Why can't I declare a variable name that consists of all digits?

Why can't I declare a variable name that consists of all digits, e.g. my $123;?
I know that $1 is a special variable related to regexes. But $a is also a special variable (related to sort), and I can declare it in my code like this:
my $a;
Why can I do that but not
my $1;
or
my $123;
This is documented in the perldoc called perlvar.
Variable names in Perl can have several formats. Usually, they must begin with a letter or underscore, in which case they can be arbitrarily long (up to an internal limit of 251 characters) and may contain letters, digits, underscores, or the special sequence :: or '. [...]
Perl variable names may also be a sequence of digits or a single punctuation or control character. These names are all reserved for special uses by Perl; for example, the all-digits names are used to hold data captured by backreferences after a regular expression match. [...]
Taking this into account, these names are valid:
$a
$foo
$_
$_123
$_foo_bar_baz_
$_____
However, that doesn't mean it makes sense having them.
It's always a good idea to name a variable after what it contains. It is very hard to guess what a variable called $99 might contain. Maybe it's the 99th of something, in which case it should be an array index. Or it could be 99, in which case you could use a constant (99).

Split function returns weird characters

I am facing a problem with a script I want to make. In short, I am connecting to a local database with dbi and execute some queries. While this works just fine, and as I print out the returned values from select queries and so on, when I split, say, the $firstName to an array and print out the array I get weird characters. Note that all the fields in the table I am working are containing only greek characters and are utf8_general_ci. I played around with use utf8, use encoding, binmode, encode etc but still the split function does return š weird characters while before the split the whole greek word was printed fine. I suppose this is due to some missing pragma about string encoding or something similar but really can't find out the solution. Thanks in advance.
Here is the piece of code I am describing. Perl version is v5.14.2
#query = &DatabaseSubs::getStringFromDb();
print "$query[1]\n"; # prints the greek name fine
#chars = split('',$query[1]);
foreach $chr (#chars) {
print "$chr \n"; # prints weird chars
}
And here is the output from print and foreach respectively.
By default, Perl assumes that you are working with single-byte characters. But you aren't, in UTF8 the Greek characters that you are using are two-bytes in size. Therefore split is splitting your characters in half and you're getting strange characters.
You need to decode your bytes into characters as they come into your program. One way to do that would be like this.
use Encode;
my #query = map { decode_utf8($_) } DatabaseSubs::getStringFromDb();
(I've also removed the unnecessary and potentially confusing '&' from the subroutine call.)
Now #query contains properly decode character strings and split will split into individual characters correctly(*).
But if you print one of these characters, you'll get a "wide character" warning. That's because Perl's I/O layer expects single-byte characters. You need to tell it to expect UTF8. You can do that like this:
binmode STDOUT, ':utf8';
There are other improvements that you could consider. For example, you could probably put the decoding into the getStringFromDb subroutine. I recommend reading perldoc perluniintro and perldoc perlunicode for more details.
(*) Yes, there's another whole level of pain lurking when you get into two-character graphemes, but let's ignore that for now.
Your data is in utf8, but perl doesn't know that, so each perl character is just one byte of the multibyte characters that are stored in the database.
You tell perl that the data is in fact utf8 with:
utf8::decode($query[1]);
(though most database drivers provide a way to automate this before you even see the data in your code). Once you've done this, split will properly operate on the actual characters. You probably then need to also set your output filehandle to expect utf8 characters, or it will try to downgrade them to an 8-bit encoding.
The issue is that split('', $word) splits on every byte where in utf8 you can have multi-byte characters. For characters with ASCII value less than 127, this is fine, but anything beyond 127 is represented as multiple bytes. You're essentially printing half the character's code, hence it looking like garbage.

What is the space character in Perl?

I want to have this output
"type":"test test"
I don't want to use space in the command not like " ".
I want to have a character that represents a single space, and I know that I can not use \s.
Is there some thing I can use?
print "\"type\":\"test(space character should be here )test\";
The best solution really is to use the space key directly:
print q("type":"test test"); # Changed delimiter for fewer escapes
But if you want to, you can also use
\x20, the ASCII code for “Space”, or
\N{SPACE}, the Unicode charname for the normal space (there are many more).
\N{U+0020}, the Unicode codepoint for the normal space.
…
Note that these only work in double-quoted strings, e.g. qq(...).
If you want to go the charname route, then you have to load the charnames module prior to Perl 5.16:
use charnames ':full';
Since 5.16, that module is loaded automatically once an \N{...} escape is found.
By default, the $" variable contains a single space – the contents of this variable are used to concatenate the contents of an array when it is interpolated.
Just use the space key:
print "\"type\":\"test test\"";
# ^

Perl: Is quotemeta for regular expressions only? Is it safe for file names?

While answering this question regarding safe escaping of filename with spaces (and potentially other characters), one of the answers said to use Perl's built-in quotemeta function.
The documentation of quotemeta states:
quotemeta (and \Q ... \E ) are useful when interpolating strings
into regular expressions, because by default an interpolated variable
will be considered a mini-regular expression.
In the documentation for quotemeta, the only mention of its use is to escape all the characters other than /[A-Za-z_0-9]/ with a \ for use in a regex. It does not state the use for filenames. This does seem like a very pleasant, if undocumented, side effect however.
In a comment to Sinan Ünür answer to the earlier question, hobbs states:
shell escaping is different from
regexp escaping, and although I can't
come up with a situation where
quotemeta would give a truly unsafe
result, it's not meant for the task.
If you must escape, instead of
bypassing the shell, I suggest trying
String::ShellQuote which takes a more
conservative approach using sh single
quotes to defang everything except
single quotes themselves, and
backslashes for single quotes. – hobbs
Aug 13 '09 at 14:25
Is it safe -- completely -- to use quotemeta in place of more conservative file quoting like String::Shellquote? Is quotemeta utf8 or multibyte character safe?
I put together a test that is unclear. quotemeta works well, it seems, except for a file name or directory name with a \n, or \r in it. While rare, these characters are legal in Unix and I have seen them. Recall that certain characters, such as LF, CR and NUL cannot be escaped with \. I read my hard drive with 700k files with quotemeta and had no failures.
I have suspicion (though I have not demonstrated it yet) that quotemeta might fail with multibyte characters where one or more of the bytes falls into the ASCII range. For example,à can be encoded as one character (UTF8 C3 A0) or as two characters (U+0061 gives a u+0300 is a combining graves accent). The only demonstrated failure I have with quotemeta is with files with a \n or \r in the path that I created. I would be interested in other characters to put in nasty_names to test.
ShellQuote works perfectly on all file names except those terminated by a NUL when creating a file. I have never ever had a failure with it.
So what to use? Just to be clear: shell quoting is not something I do often, since I usually just use Perl open to open a pipe to a process. That method does not suffer the shell issues discussed. I am interested since I have seen quotemeta used often for file name escaping.
(Thanks to Ether I have added IPC::System::Simple)
Test file:
use strict; use warnings; use autodie;
use String::ShellQuote;
use File::Find;
use File::Path;
use IPC::System::Simple 'capturex';
my #nasty_names;
my $top_dir = '/Users/andrew/bin/pipetestdir/testdir';
my $sub_dir = "easy_to_remove_me";
my (#qfail, #sfail, #ipcfail);
sub wanted {
if ($File::Find::name) {
my $rtr;
my $exec1="ls ".quotemeta($File::Find::name);
my $exec2="ls ".shell_quote($File::Find::name);
my #exec3= ("ls", $File::Find::name);
$rtr=`$exec1`;
push #qfail, "$exec1"
if $rtr=~/^\s*$/ ;
$rtr=`$exec2`;
push #sfail, "$exec2"
if $rtr=~/^\s*$/ ;
$rtr = capturex(#exec3);
push #ipcfail, \#exec3
if $rtr=~/^\s*$/ ;
}
}
chdir($top_dir) or die "$!";
mkdir "$top_dir/$sub_dir";
chdir "$top_dir/$sub_dir";
push #nasty_names, "name with new line \n in the middle";
push #nasty_names, "name with CR \r in the middle";
push #nasty_names, "name with tab\tright there";
push #nasty_names, "utf \x{0061}\x{0300} combining diacritic";
push #nasty_names, "utf e̋ alt combining diacritic";
push #nasty_names, "utf e\x{cc8b} alt combining diacritic";
push #nasty_names, "utf άέᾄ greek";
push #nasty_names, 'back\slashes\\Not\\\at\\\\end';
push #nasty_names, qw|back\slashes\\IS\\\at\\\\end\\\\|;
sub create_nasty_files {
for my $name (#nasty_names) {
open my $fh, '>', $name ;
close $fh;
}
}
for my $dir (#nasty_names) {
chdir("$top_dir/$sub_dir");
mkpath($dir);
chdir $dir;
create_nasty_files();
}
find(\&wanted, $top_dir);
print "\nquotemeta failed on:\n", join "\n", #qfail;
print "\nShell Quote failed on:\n", join "\n", #sfail;
print "\ncapturex failed on:\n", join "\n", #ipcfail;
print "\n\n\n",
"Remove \"$top_dir/$sub_dir\" before running again...\n\n";
Quotemeta is safe under these assumptions:
Only non-alphanumeric characters have a special meaning.
If a non-alphanumeric character has a special meaning, putting a backslash in front of it will always make it non-special.
If a non-alphanumeric character doesn't have a special meaning, putting a backslash in front of it will do nothing.
The shell violates rules 2 and 3 no matter what quote context you use -- outside of quotes, backslash-newline doesn't generate newline; in double-quotes, backslash-punctuation puts a backslash into the output (outside of a certain list of punctuation); and in single-quotes, everything is literal and backslash doesn't even protect you against a closing single-quote.
I still recommend String::ShellQuote if you need to quote things for the shell. I also recommend avoiding letting the shell process your filenames entirely, if you can, by using LIST-form system/exec/open or IPC::Open2, IPC::Open3, or IPC::System::Simple.
As for things besides the shell... lots of different things violate one or more of the rules. For example, obsolete POSIX "basic" regexes and various kinds of editor regexes have punctuation characters that are non-special by default, but become special when preceded by backslash. Basically what I'm saying is, know the thing that you're feeding your data to very well, and escape properly. Only use quotemeta if it's an exact fit, or if you're using it for something that's not very important.
You could also use IPC::System::Simple capture() or capturex() (which I suggested in another answer on that first question), which will let you bypass the shell.
I added these lines to your script and found that no examples failed:
use IPC::System::Simple 'capturex';
...
my (#qfail, #sfail, #ipcfail);
...
my #exec3= ("ls", $File::Find::name);
...
$rtr = capturex(#exec3);
push #ipcfail, \#exec3
if $rtr=~/^\s*$/ ;
...
print "\ncapturex failed on:\n", join "\n", #ipcfail;
But in general, you should solve the actual problem, rather than attempting to find better band-aids. quotemeta is intended specifically to escape regular expression-significant characters, which as you have discovered is not a perfect overlap with the set of characters that are significant to the shell.
The following is a Unix-only solution; see https://stackoverflow.com/a/32161361/45375 for Windows support.
An alternative is this simple function, which should work robustly even with non-ASCII characters (assuming the correct encoding), as well as \n, and \r, but excluding NUL (see bottom).
sub quoteforsh { join ' ', map { "'" . s/'/'\\''/gr . "'" } #_ }
The function encloses each argument in single-quotes and, if multiple arguments were specified, separates them with spaces.
Single-quoted strings are used, because their contents is not subject to any interpretation in POSIX-like shells.
As such, however, you cannot even escape ' instances themselves, which requires the following workaround: every embedded ' instance is replaced with '\'' (sic), which effectively splits the input string into multiple single-quoted strings, with escaped ' instances - \' - spliced in - the shell then reassembles the string parts into a single string.
Example:
print quoteforsh 'I\'m here & wëll';
literally produces (including the enclosing single-quotes) 'I'\''m here & wëll', which, to the shell, are 3 contiguous strings - 'I', \', and '&well', which the shell then reassembles into a single string, which, after quote removal, yields I'm here & wëll.
OSX Unicode caveat: The HFS+ stores filenames in NFD (decomposed Unicode normal form - base letter followed by another character that is the associated diacritic), whereas Perl typically creates NFC (composed Unicode normal form - a single character identifies the accented letter).
When using literal filenames, this distinction doesn't matter (the system calls do the mapping), but when using globs, it does, and, unfortunately, you have to do your own translation between the two forms.
Support for NUL (0x0) chars.:
I don't think NUL chars. in filenames are a real-world concern:
Most POSIX-like shells (bash, dash, ksh) ignore NUL chars. on the command line - zsh being the only exception.
Even if that weren't an issue, according to Wikipedia, most Unix systems do not support NUL chars. in filenames.
Besides, trying to pass a literal with a NUL to Perl's system() function breaks the invocation, presumably, because the string passed to sh -c is cut off at the first NUL:
system "echo 'a\x{0}b'"; # BREAKS