How can I escape a string in Perl for LDAP searching? - perl

I want to escape a string, per RFC 4515. So, the string "u1" would be transformed to "\75\31", that is, the ordinal value of each character, in hex, preceded by backslash.
Has to be done in Perl. I already know how to do it in Python, C++, Java, etc., but Perl if baffling.
Also, I cannot use Net::LDAP and I may not be able to add any new modules, so, I want to do it with basic Perl features.

Skimming through RFC 4515, this encoding escapes the individual octets of multi-byte UTF-8 characters, not codepoints. So, something that works with non-ASCII text too:
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
sub valueencode ($) {
# Unpack format returns octets of UTF-8 encoded text
my #bytes = unpack "U0C*", $_[0];
sprintf '\%02x' x #bytes, #bytes;
}
say valueencode 'u1';
say valueencode "Lu\N{U+010D}i\N{U+0107}"; # Lučić, from the RFC 4515 examples
Example:
$ perl demo.pl
\75\31
\4c\75\c4\8d\69\c4\87
Or an alternative using the vector flag:
use Encode qw/encode/;
sub valueencode ($) {
sprintf '\%*vx', "\\", encode('UTF-8', $_[0]);
}
Finally, a smarter version that only escapes ASCII characters when it has to (And multi-byte characters, even though upon a closer read of the RFC they don't actually need to be if they're valid UTF-8):
# Encode according to RFC 4515 valueencoding grammar rules:
#
# Text is UTF-8 encoded. Bytes can be escaped with the sequence
# \XX, where the X's are hex digits.
#
# The characters NUL, LPAREN, RPAREN, ASTERISK and BACKSLASH all MUST
# be escaped.
#
# Bytes > 0x7F that aren't part of a valid UTF-8 sequence MUST be
# escaped. This version assumes there are no such bytes and that input
# is a ASCII or Unicode string.
#
# Single bytes and valid multibyte UTF-8 sequences CAN be escaped,
# with each byte escaped separately. This version escapes multibyte
# sequences, to give ASCII results.
sub valueencode ($) {
my $encoded = "";
for my $byte (unpack 'U0C*', $_[0]) {
if (($byte >= 0x01 && $byte <= 0x27) ||
($byte >= 0x2B && $byte <= 0x5B) ||
($byte >= 0x5D && $byte <= 0x7F)) {
$encoded .= chr $byte;
} else {
$encoded .= sprintf '\%02x', $byte;
}
}
return $encoded;
}
This version returns the strings 'u1' and 'Lu\c4\8di\c4\87' from the above examples.

In short, one way is just as the question says: split the string into characters, get their ordinals then convert format to hex; then put it back together. I don't know how to get the \nn format so I'd make it 'by hand'. For instance
my $s = join '', map { sprintf '\%x', ord } split //, 'u1';
Or use vector flag %v to treat the string as a "vector" of integers
my $s = sprintf '\%*vx', '\\', 'u1';
With %v the string is broken up into numerical representation of characters, each is converted (%x), and they're joined back, with . between them. That (optional) * allows us to specify our string by which to join them instead, \ (escaped) here.
This can also be done with pack + unpack, see the link below. Also see that page if there is a wide range of input characters.†
See ord and sprintf, and for more pages like this one.
† If there is non-ASCII input then you may need to encode it so to get octets, if they are to escape (and not whole codepoints)
use Encode qw(encode);
my $s = sprintf '\%*vx', '\\', encode('UTF_8', $input);
See the linked page for more.

Related

perl - matching at even positions and remove non-printable chars

I have a hex2string from database table dump that is like
"41424320202020200A200B000C"
what I want to do is to match at the even positions and detect the control chars that could break the string when printed.. i.e remove ascii null \x00, \n, \r, \f and \x80 to \xFF, etc..
I tried removing ascii null like
perl -e ' $x="41424320202020200A200B000C"; $x=~s/00//g; print "$x\n" '
but the result is incorrect as it removed 0 from trailing hex value of space \x20 and leading 0 of newline \x0A i.e 20 0A to 2A
414243202020202A2B0C
what i wanted is
414243202020202020
say unpack("H*", pack("H*", "41424320202020200A200B000C") =~ s/[^\t[:print:]]//arg);
or
my $hex = "41424320202020200A200B000C";
my $bytes = pack("H*", $hex);
$bytes =~ s/[^\t[:print:]]//ag;
$hex = unpack("H*", $bytes);
say $hex;
or
my $hex = "41424320202020200A200B000C";
my $bytes = pack("H*", $hex);
$bytes =~ s/[^\t\x20-\x7E]//g;
$hex = unpack("H*", $bytes);
say $hex;
Solutions using /a and /r require Perl 5.14+.
The above starts with the following string:
41424320202020200A200B000C
It is converted into the following using pack:
ABC␠␠␠␠␠␊␠␋␀␌
The substitution removes all non-ASCII and all non-printable characters except TAB, leaving us with the following:
ABC␠␠␠␠␠␠
It is converted into the following using unpack:
414243202020202020
This solution is not only shorter than the previous solutions, it is also faster because it allocates far fewer variables and only starts the regex match once.
detect the control chars that could break the string when printed.. i.e remove ascii null \x00, \n, \r, \f and \x80 to \xFF, etc..
Building on Hakon's answer (Which only strips out nul bytes, not all the other ones):
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
my $x="41424320202020200A200B000C";
say $x;
say grep { chr(hex($_)) =~ /[[:print:]\t]/ && hex($_) < 128 } unpack("(A2)*", $x);
gives you
41424320202020200A200B000C
414243202020202020
The character class [:print:] inside a character set matches all printable characters including space (but not control characters like newline and linefeed), and I added in tab as well. Then it also checks to make sure the byte is in the ASCII range (Since higher characters are still printable in many locales).
It is possible to work directly with the hex form of the characters, but it's far more complicated. I recommend against using this approach. This answer serves to illustrate why this solution wasn't proposed.
You wish to exclude all characters except the following:
ASCII printables (2016 to 7E16)
TAB (0916)
That means you wish to exclude the following characters:
0016 to 0816
0A16 to 1F16
7F16 to FF16
If we group these by leading digits, we get
0016 to 0816, 0A16 to 0F16
1016 to 1F16
7F16
8016 to FF16
We can therefore use the following:
$hex =~ s/\G(?:..)*?\K(?:0[0-8A-Fa-f]|7F|[189A-Fa-f].)//sg; # 5.10+
$hex =~ s/\G((?:..)*?)(?:0[0-8A-Fa-f]|7F|[189A-Fa-f].)/$1/sg; # Slower
You can try split the string into 2 bytes substrings using unpack:
my $x="41424320202020200A200B000C";
say $x;
say join '', grep { $_ !~ /00/} unpack "(A2)*", $x;
Output:
41424320202020200A200B000C
41424320202020200A200B0C

Perl | Print ASCII, but backslashed other

I want print 95 ASCII symblols unchanged, but for others to print its codes.
How make it in pure perl? 'unpack' function? Any module?
print BackSlashed('test folder'); # expected test\040folder
print BackSlashed('test тестовая folder');
# expected test\040\321\202\320\265\321\201\321\202\320\276\320\262\320\260\321\217\040folder
print BackSlashed('НОВАЯ ПАПКА');
# expected \320\235\320\236\320\222\320\220\320\257\040\320\237\320\220\320\237\320\232\320\220
sub BackSlashed() {
my $str = shift;
.. backslashed code here...
return $str
}
You can use a regular expression substitution with an evaled substitution part. In there, need to convert each character to its numeric value first, and then output it in octal notation. There's a good explanation for it in this answer. Attach an escaped backslash \ to get it to show up in the output.
$str =~ s/([^a-zA-Z0-9])/sprintf "\\%03o", ord($1)/eg;
I limited the capture group to basic ASCII letters and numbers. If you want something else, just change the character group.
Since your sample output has octets but you said your code has the use utf8 pragma, you need to convert Perl's representation of the string to the corresponding octet sequence before you run the substitution.
use utf8;
my $str = 'НОВАЯ ПАПКА';
print foo($str);
sub foo { # note that there are no () here!
my $str = shift;
utf8::encode($str);
$str =~ s/([^a-zA-Z0-9])/sprintf "\\%03o", ord($1)/eg;
return $str;
}

Reading unicode chars on the byte level

Suppose I wanted to detect unicode characters and encode them using \u notation. If I had to use a byte array, are there simple rules I can follow to detect groups of bytes that belong to a single character?
I am referring to UTF-8 bytes that need to be encoded for an ASCII-only receiver. At the moment, non-ASCII-Printable characters are stripped. s/[^\x20-\x7e\r\n\t]//g.
I want to improve this functionality to write \u0000 notation.
You need to have Unicode characters, so start by decoding your byte array.
use Encode qw( decode );
my $decoded_text = decode("UTF-8", $encoded_text);
Only then can you escape Unicode characters.
( my $escaped_text = $decoded_text ) =~
s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;
For example,
$ perl -CSDA -MEncode=decode -E'
my $encoded_text = "\xC3\x89\x72\x69\x63\x20\xE2\x99\xA5\x20\x50\x65\x72\x6c";
my $decoded_text = decode("UTF-8", $encoded_text);
say $decoded_text;
( my $escaped_text = $decoded_text ) =~
s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;
say $escaped_text;
'
Éric ♥ Perl
\u00C9ric \u2665 Perl

Convert utf-8 into html &...;

In Perl, how can I convert string containing utf-8 characters to HTML where such characters will be converted into &...; ?
First, split on an empty pattern to get a list of single characters. Then, map each character to itself, if it is ASCII, or its code, if it is not:
use Encode qw( decode_utf8 );
my $utf8_string = "\xE2\x80\x9C\x68\x6F\x6D\x65\xE2\x80\x9D";
my $unicode_string = decode_utf8($utf8_string);
my $html = join q(),
map { ord > 127 ? "&#" . ord . ";"
: $_
} split //, $unicode_string;
Just replace every symbol that is not printable and not low ASCII (that is, anything outside \x20 - \x7F region) with simple calculation of its ord + necessary HTML entity formatting. Perl regexp have /e flag to indicate that replacement should be treated as code.
use utf8;
my $str = "testТест"; # This is correct UTF-8 string right in the code
$str =~ s/([^[\x20-\x7F])/"&#" . ord($1) . ";"/eg;
print $str;
# testТест

Perl substr based on bytes

I'm using SimpleDB for my application. Everything goes well unless the limitation of one attribute is 1024 bytes. So for a long string I have to chop the string into chunks and save it.
My problem is that sometimes my string contains unicode character (chinese, japanese, greek) and the substr() function is based on character count not byte.
I tried to use use bytes for byte semantic or later
substr(encode_utf8($str), $start, $length) but it does not help at all.
Any help would be appreciated.
UTF-8 was engineered so that character boundaries are easy to detect. To split the string into chunks of valid UTF-8, you can simply use the following:
my $utf8 = encode_utf8($text);
my #utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;
Then either
# The saving code expects bytes.
store($_) for #utf8_chunks;
or
# The saving code expects decoded text.
store(decode_utf8($_)) for #utf8_chunks;
Demonstration:
$ perl -e'
use Encode qw( encode_utf8 );
# This character encodes to three bytes using UTF-8.
my $text = "\N{U+2660}" x 342;
my $utf8 = encode_utf8($text);
my #utf8_chunks = $utf8 =~ /\G(.{1,1024})(?![\x80-\xBF])/sg;
CORE::say(length($_)) for #utf8_chunks;
'
1023
3
substr operates on 1-byte characters unless the string has the UTF-8 flag on. So this will give you the first 1024 bytes of a decoded string:
substr encode_utf8($str), 0, 1024;
although, not necessarily splitting the string on character boundaries. To discard any split characters at the end, you can use:
$str = decode_utf8($str, Encode::FB_QUIET);