how unpack function will work in perl for this code $str =~ s/([^\w ])/'%'.unpack('H2', $1)/eg; - perl

i have a code in perl $str =~ s/([^\w ])/'%'.unpack('H2', $1)/eg; i am not undestanding what value will be stored in $str

Assuming $str is encoded using UTF-8, and assuming the code you provided is followed by $str =~ s/ /+/g, the result is a url-encoded string safe for use in URLs.
Specifically, the line of code in question replaces every non-word except spaces with a three character sequence starting with % and followed by two hex digits representing the character number.
For example,
foo's ⇒ foo%27s
20% ⇒ 20%25
A better solution would be to use uri_escape (for strings encoded using UTF-8) or uri_escape_utf8 (for strings of Unicode Code Points aka decoded strings) from URI::Escape.

Provided line of code modifies $str value according substitute rule set s/([^\w ])/'%'.unpack('H2', $1)/eg.
How does it work:
[^\w] - look at $str for character not \w known as complement to \w
\w - represents range [A-za-z0-9_], punctuation chars and Unicode marks see perlre
([^\w]) capture found character, 'store' it in $1
regex modifier e evaluates '%'.unpack('H2',$1) as substitution string
unpack('H2',$1) - unpack $1 with template 'H2' (hex representation of byte associated with $1)
take '%' and concatenate it with unpacked result
use result from step 6 as replacement string
regex modifier g instructs to make this operation for all occurrences in the $str
Without knowing initial $str value before this operation, impossible to evaluate final result.
If initial value is known then you can evaluate result by visiting https://regex101.com/ website.
Nothing could speak louder than sample code demonstrating transformation
use feature 'say';
$msg = "Date: Mar 6 2020, Msg: soon Alex's birthday";
$msg =~ s/([^\w ])/'%'.unpack('H2', $1)/eg;
say $msg;
Output
Date%3a Mar 6 2020%2c Msg%3a soon Alex%27s birthday
Following code demonstrates how "Hello World\n" will look as hex representation (for Dada).
use feature 'say';
my $msg = "Hello World!\n";
print $msg;
my $a = unpack('H*',$msg);
say $a;
Output
Hello World!
48656c6c6f20576f726c64210a

You could start by trying it out and seeing if that gives you a hint.
$ perl -E'$str = "&*("; $str =~ s/([^\w ])/"%".unpack('H2', $1)/eg; say $str'
%26%2a%28
So, we have a substitution operator that looks like this:
s/PATTERN/REPLACEMENT/OPTIONS
Our pattern is ([^\w ]) which means "match every individual character that isn't a 'word character' or a space and capture that character in $1.
The replacement string is "%".unpack('H2', $1). Which means "the character '%' followed by the result of running unpack('H2', $1). unpack() here is being used to convert characters to the hexadecimal equivalent of their ASCII code. "H" means "convert to hex" and "2" means produce two hex digits".
The options are /e which means "run this code and use the output as the replacement string" and /g which means "do this for every match in the input string".
Putting that all together, you have code that:
Looks for non-word characters
Converts them to their hexadecimal escape code
Replaces them in the string
Using URI::Escape is probably a better approach.

Related

How can I escape a string in Perl for LDAP searching?

I want to escape a string, per RFC 4515. So, the string "u1" would be transformed to "\75\31", that is, the ordinal value of each character, in hex, preceded by backslash.
Has to be done in Perl. I already know how to do it in Python, C++, Java, etc., but Perl if baffling.
Also, I cannot use Net::LDAP and I may not be able to add any new modules, so, I want to do it with basic Perl features.
Skimming through RFC 4515, this encoding escapes the individual octets of multi-byte UTF-8 characters, not codepoints. So, something that works with non-ASCII text too:
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
sub valueencode ($) {
# Unpack format returns octets of UTF-8 encoded text
my #bytes = unpack "U0C*", $_[0];
sprintf '\%02x' x #bytes, #bytes;
}
say valueencode 'u1';
say valueencode "Lu\N{U+010D}i\N{U+0107}"; # Lučić, from the RFC 4515 examples
Example:
$ perl demo.pl
\75\31
\4c\75\c4\8d\69\c4\87
Or an alternative using the vector flag:
use Encode qw/encode/;
sub valueencode ($) {
sprintf '\%*vx', "\\", encode('UTF-8', $_[0]);
}
Finally, a smarter version that only escapes ASCII characters when it has to (And multi-byte characters, even though upon a closer read of the RFC they don't actually need to be if they're valid UTF-8):
# Encode according to RFC 4515 valueencoding grammar rules:
#
# Text is UTF-8 encoded. Bytes can be escaped with the sequence
# \XX, where the X's are hex digits.
#
# The characters NUL, LPAREN, RPAREN, ASTERISK and BACKSLASH all MUST
# be escaped.
#
# Bytes > 0x7F that aren't part of a valid UTF-8 sequence MUST be
# escaped. This version assumes there are no such bytes and that input
# is a ASCII or Unicode string.
#
# Single bytes and valid multibyte UTF-8 sequences CAN be escaped,
# with each byte escaped separately. This version escapes multibyte
# sequences, to give ASCII results.
sub valueencode ($) {
my $encoded = "";
for my $byte (unpack 'U0C*', $_[0]) {
if (($byte >= 0x01 && $byte <= 0x27) ||
($byte >= 0x2B && $byte <= 0x5B) ||
($byte >= 0x5D && $byte <= 0x7F)) {
$encoded .= chr $byte;
} else {
$encoded .= sprintf '\%02x', $byte;
}
}
return $encoded;
}
This version returns the strings 'u1' and 'Lu\c4\8di\c4\87' from the above examples.
In short, one way is just as the question says: split the string into characters, get their ordinals then convert format to hex; then put it back together. I don't know how to get the \nn format so I'd make it 'by hand'. For instance
my $s = join '', map { sprintf '\%x', ord } split //, 'u1';
Or use vector flag %v to treat the string as a "vector" of integers
my $s = sprintf '\%*vx', '\\', 'u1';
With %v the string is broken up into numerical representation of characters, each is converted (%x), and they're joined back, with . between them. That (optional) * allows us to specify our string by which to join them instead, \ (escaped) here.
This can also be done with pack + unpack, see the link below. Also see that page if there is a wide range of input characters.†
See ord and sprintf, and for more pages like this one.
† If there is non-ASCII input then you may need to encode it so to get octets, if they are to escape (and not whole codepoints)
use Encode qw(encode);
my $s = sprintf '\%*vx', '\\', encode('UTF_8', $input);
See the linked page for more.

perl - matching at even positions and remove non-printable chars

I have a hex2string from database table dump that is like
"41424320202020200A200B000C"
what I want to do is to match at the even positions and detect the control chars that could break the string when printed.. i.e remove ascii null \x00, \n, \r, \f and \x80 to \xFF, etc..
I tried removing ascii null like
perl -e ' $x="41424320202020200A200B000C"; $x=~s/00//g; print "$x\n" '
but the result is incorrect as it removed 0 from trailing hex value of space \x20 and leading 0 of newline \x0A i.e 20 0A to 2A
414243202020202A2B0C
what i wanted is
414243202020202020
say unpack("H*", pack("H*", "41424320202020200A200B000C") =~ s/[^\t[:print:]]//arg);
or
my $hex = "41424320202020200A200B000C";
my $bytes = pack("H*", $hex);
$bytes =~ s/[^\t[:print:]]//ag;
$hex = unpack("H*", $bytes);
say $hex;
or
my $hex = "41424320202020200A200B000C";
my $bytes = pack("H*", $hex);
$bytes =~ s/[^\t\x20-\x7E]//g;
$hex = unpack("H*", $bytes);
say $hex;
Solutions using /a and /r require Perl 5.14+.
The above starts with the following string:
41424320202020200A200B000C
It is converted into the following using pack:
ABC␠␠␠␠␠␊␠␋␀␌
The substitution removes all non-ASCII and all non-printable characters except TAB, leaving us with the following:
ABC␠␠␠␠␠␠
It is converted into the following using unpack:
414243202020202020
This solution is not only shorter than the previous solutions, it is also faster because it allocates far fewer variables and only starts the regex match once.
detect the control chars that could break the string when printed.. i.e remove ascii null \x00, \n, \r, \f and \x80 to \xFF, etc..
Building on Hakon's answer (Which only strips out nul bytes, not all the other ones):
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
my $x="41424320202020200A200B000C";
say $x;
say grep { chr(hex($_)) =~ /[[:print:]\t]/ && hex($_) < 128 } unpack("(A2)*", $x);
gives you
41424320202020200A200B000C
414243202020202020
The character class [:print:] inside a character set matches all printable characters including space (but not control characters like newline and linefeed), and I added in tab as well. Then it also checks to make sure the byte is in the ASCII range (Since higher characters are still printable in many locales).
It is possible to work directly with the hex form of the characters, but it's far more complicated. I recommend against using this approach. This answer serves to illustrate why this solution wasn't proposed.
You wish to exclude all characters except the following:
ASCII printables (2016 to 7E16)
TAB (0916)
That means you wish to exclude the following characters:
0016 to 0816
0A16 to 1F16
7F16 to FF16
If we group these by leading digits, we get
0016 to 0816, 0A16 to 0F16
1016 to 1F16
7F16
8016 to FF16
We can therefore use the following:
$hex =~ s/\G(?:..)*?\K(?:0[0-8A-Fa-f]|7F|[189A-Fa-f].)//sg; # 5.10+
$hex =~ s/\G((?:..)*?)(?:0[0-8A-Fa-f]|7F|[189A-Fa-f].)/$1/sg; # Slower
You can try split the string into 2 bytes substrings using unpack:
my $x="41424320202020200A200B000C";
say $x;
say join '', grep { $_ !~ /00/} unpack "(A2)*", $x;
Output:
41424320202020200A200B000C
41424320202020200A200B0C

How to insert a colon between word and number

I want to insert a colon between word and number then add a new line after a number.
For example:
"cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011";
my expected output:
cat:11052000
cow_and_owner_:01011999 12031981
dog:22032011
My attempt :
$Bday=~ /^([a-z]||\_)/:/^([0-9])/
print "\n";
#!/usr/bin/perl
use warnings;
use strict;
my $str = "cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011";
$str =~ s/\s*([a-z_]+)((?: \d+)+)/$1:$2\n/g;
print $str;
produces your desired output from your sample input.
Edit: Note the use of the s operator for regular expression substitution. One of the many problems with your code is that you're not using that (IF your intent is to modify the string in place and not extract bits from it for further processing)
One more variant -
> cat test_perl.pl
#!/usr/bin/perl
use strict;
use warnings;
while ( "cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011" =~ m/([a-z_]+)\s+([0-9 ]+)/g )
{
print "$1:$2\n";
}
> test_perl.pl
cat:11052000
cow_and_owner_:01011999 12031981
dog:22032011
>
The original code $Bday=~ /^([a-z]||\_)/:/^([0-9])/ doesn't make much sense. Apart from missing a semicolon and having too many delimiters (matching patterns are of the format /.../ or m/.../ and replacing ones s/.../.../), it could never match anything.
([a-z]||\_) would match:
one lowercase ASCII letter (a through z);
an empty string (the space between the two |s; or
one underscore (escape with a backslash is superfluous).
To get it (or the corresponding subexpression for numbers) to match a sequence of one
or more of the characters, you need to follow it with a +.
^([0-9]) would fail to match unless it was at the beginning of the string. There it would match a single digit.
My solution (taking into account the later comments by the OP about having input such as cat[1] or dog3):
use strict;
use warnings;
my $bday = "cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011 cat[1] 01012018 dog3 02012018";
# capture groups:
# $1------------------------\ $2-------------\
$bday =~ s/([A-Za-z][A-Za-z0-9_\[\]]*)\h+(\d+(?:\h+\d+)*)(?!\S)\s*/$1:$2\n/g;
print $bday;
will print out:
cat:11052000
cow_and_owner_:01011999 12031981
dog:22032011
cat[1]:01012018
dog3:02012018
Breakdown:
[A-Za-z]: Begin with a letter.
[A-Za-z0-9_\[\]]*: Follow with zero or more letters, numbers, underscores and square brackets.
\h+: Separate with one or more horizontal whitespace.
\d+(?:\h+\d+)*: One sequence of digits (\d+) followed by zero or more sequences of horizontal whitespace and digits.
(?!\S): Can't be followed by non-whitespace.
\s*: Consume following whitespace (including line feeds; this allows the input to be separated on multiple lines, as long as a single entry is not spread on multiple lines. To get that, replace all the \h+ with \s+.).
The replace pattern will repeat (the /g modifier) sequentially in the source string as long as it matches, placing each heading-date record on its own line and then proceeding with the rest of the string.
Note that if your headers (dog etc.) might contain non-ASCII letters, use \pL or \p{XPosixAlpha} instead of [A-Za-z]:
$bday =~ s/\pL[\pL0-9_\[\]]*)\h+(\d+(?:\h+\d+)*)(?!\S)\s*/$1:$2\n/g;

Replace returns with spaces and commas with semicolons?

I want to be able to be able to replace all of the line returns (\n's) in a single string (not an entire file, just one string in the program) with spaces and all commas in the same string with semicolons.
Here is my code:
$str =~ s/"\n"/" "/g;
$str =~ s/","/";"/g;
This will do it. You don't need to use quotations around them.
$str =~ s/\n/ /g;
$str =~ s/,/;/g;
Explanation of modifier options for the Substitution Operator (s///)
e Forces Perl to evaluate the replacement pattern as an expression.
g Replaces all occurrences of the pattern in the string.
i Ignores the case of characters in the string.
m Treats the string as multiple lines.
o Compiles the pattern only once.
s Treats the string as a single line.
x Lets you use extended regular expressions.
You don't need to quote in your search and replace, only to represent a space in your first example (or you could just do / / too).
$str =~ s/\n/" "/g;
$str =~ s/,/;/g;
I'd use tr:
$str =~ tr/\n,/ ;/;

Perl unicode hash key lookup

I am so confused by the unicode and unicode in perl.
I got this hash from MySQL db.
my $hashFromDB = { "Ves\x{101}kha" => "some value" };
But I only know the key in this form of notation
my $key = "Ves\u0101kha";
How can I convert that \uXXXX notation to that \x{xxx} so that I can get the value with the key.
Thanks.
\x{} escape in Perl works almost exactly like \u escape in JS. You simply use code inside {}, exactly as in your first snippet:
my $key = "Ves\x{101}kha";
my $value = $hashFromDB->{$key};
If you have literal string with \u in it, which I assume means "\u followed with 4 hexadecimal digits to form codepoint number", just preprocess it with regexp that would replace such sequences with real characters with same code:
$key =~ s/\\u([a-fA-F0-9]{4})/chr(hex($1))/ge;
BTW, \u have different meaning in Perl - it upercases next symbol. You can check complete list of escapes in documentation.