perl - matching at even positions and remove non-printable chars - perl

I have a hex2string from database table dump that is like
"41424320202020200A200B000C"
what I want to do is to match at the even positions and detect the control chars that could break the string when printed.. i.e remove ascii null \x00, \n, \r, \f and \x80 to \xFF, etc..
I tried removing ascii null like
perl -e ' $x="41424320202020200A200B000C"; $x=~s/00//g; print "$x\n" '
but the result is incorrect as it removed 0 from trailing hex value of space \x20 and leading 0 of newline \x0A i.e 20 0A to 2A
414243202020202A2B0C
what i wanted is
414243202020202020

say unpack("H*", pack("H*", "41424320202020200A200B000C") =~ s/[^\t[:print:]]//arg);
or
my $hex = "41424320202020200A200B000C";
my $bytes = pack("H*", $hex);
$bytes =~ s/[^\t[:print:]]//ag;
$hex = unpack("H*", $bytes);
say $hex;
or
my $hex = "41424320202020200A200B000C";
my $bytes = pack("H*", $hex);
$bytes =~ s/[^\t\x20-\x7E]//g;
$hex = unpack("H*", $bytes);
say $hex;
Solutions using /a and /r require Perl 5.14+.
The above starts with the following string:
41424320202020200A200B000C
It is converted into the following using pack:
ABC␠␠␠␠␠␊␠␋␀␌
The substitution removes all non-ASCII and all non-printable characters except TAB, leaving us with the following:
ABC␠␠␠␠␠␠
It is converted into the following using unpack:
414243202020202020
This solution is not only shorter than the previous solutions, it is also faster because it allocates far fewer variables and only starts the regex match once.

detect the control chars that could break the string when printed.. i.e remove ascii null \x00, \n, \r, \f and \x80 to \xFF, etc..
Building on Hakon's answer (Which only strips out nul bytes, not all the other ones):
#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
my $x="41424320202020200A200B000C";
say $x;
say grep { chr(hex($_)) =~ /[[:print:]\t]/ && hex($_) < 128 } unpack("(A2)*", $x);
gives you
41424320202020200A200B000C
414243202020202020
The character class [:print:] inside a character set matches all printable characters including space (but not control characters like newline and linefeed), and I added in tab as well. Then it also checks to make sure the byte is in the ASCII range (Since higher characters are still printable in many locales).

It is possible to work directly with the hex form of the characters, but it's far more complicated. I recommend against using this approach. This answer serves to illustrate why this solution wasn't proposed.
You wish to exclude all characters except the following:
ASCII printables (2016 to 7E16)
TAB (0916)
That means you wish to exclude the following characters:
0016 to 0816
0A16 to 1F16
7F16 to FF16
If we group these by leading digits, we get
0016 to 0816, 0A16 to 0F16
1016 to 1F16
7F16
8016 to FF16
We can therefore use the following:
$hex =~ s/\G(?:..)*?\K(?:0[0-8A-Fa-f]|7F|[189A-Fa-f].)//sg; # 5.10+
$hex =~ s/\G((?:..)*?)(?:0[0-8A-Fa-f]|7F|[189A-Fa-f].)/$1/sg; # Slower

You can try split the string into 2 bytes substrings using unpack:
my $x="41424320202020200A200B000C";
say $x;
say join '', grep { $_ !~ /00/} unpack "(A2)*", $x;
Output:
41424320202020200A200B000C
41424320202020200A200B0C

Related

How can I escape a string in Perl for LDAP searching?

I want to escape a string, per RFC 4515. So, the string "u1" would be transformed to "\75\31", that is, the ordinal value of each character, in hex, preceded by backslash.
Has to be done in Perl. I already know how to do it in Python, C++, Java, etc., but Perl if baffling.
Also, I cannot use Net::LDAP and I may not be able to add any new modules, so, I want to do it with basic Perl features.
Skimming through RFC 4515, this encoding escapes the individual octets of multi-byte UTF-8 characters, not codepoints. So, something that works with non-ASCII text too:
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
sub valueencode ($) {
# Unpack format returns octets of UTF-8 encoded text
my #bytes = unpack "U0C*", $_[0];
sprintf '\%02x' x #bytes, #bytes;
}
say valueencode 'u1';
say valueencode "Lu\N{U+010D}i\N{U+0107}"; # Lučić, from the RFC 4515 examples
Example:
$ perl demo.pl
\75\31
\4c\75\c4\8d\69\c4\87
Or an alternative using the vector flag:
use Encode qw/encode/;
sub valueencode ($) {
sprintf '\%*vx', "\\", encode('UTF-8', $_[0]);
}
Finally, a smarter version that only escapes ASCII characters when it has to (And multi-byte characters, even though upon a closer read of the RFC they don't actually need to be if they're valid UTF-8):
# Encode according to RFC 4515 valueencoding grammar rules:
#
# Text is UTF-8 encoded. Bytes can be escaped with the sequence
# \XX, where the X's are hex digits.
#
# The characters NUL, LPAREN, RPAREN, ASTERISK and BACKSLASH all MUST
# be escaped.
#
# Bytes > 0x7F that aren't part of a valid UTF-8 sequence MUST be
# escaped. This version assumes there are no such bytes and that input
# is a ASCII or Unicode string.
#
# Single bytes and valid multibyte UTF-8 sequences CAN be escaped,
# with each byte escaped separately. This version escapes multibyte
# sequences, to give ASCII results.
sub valueencode ($) {
my $encoded = "";
for my $byte (unpack 'U0C*', $_[0]) {
if (($byte >= 0x01 && $byte <= 0x27) ||
($byte >= 0x2B && $byte <= 0x5B) ||
($byte >= 0x5D && $byte <= 0x7F)) {
$encoded .= chr $byte;
} else {
$encoded .= sprintf '\%02x', $byte;
}
}
return $encoded;
}
This version returns the strings 'u1' and 'Lu\c4\8di\c4\87' from the above examples.
In short, one way is just as the question says: split the string into characters, get their ordinals then convert format to hex; then put it back together. I don't know how to get the \nn format so I'd make it 'by hand'. For instance
my $s = join '', map { sprintf '\%x', ord } split //, 'u1';
Or use vector flag %v to treat the string as a "vector" of integers
my $s = sprintf '\%*vx', '\\', 'u1';
With %v the string is broken up into numerical representation of characters, each is converted (%x), and they're joined back, with . between them. That (optional) * allows us to specify our string by which to join them instead, \ (escaped) here.
This can also be done with pack + unpack, see the link below. Also see that page if there is a wide range of input characters.†
See ord and sprintf, and for more pages like this one.
† If there is non-ASCII input then you may need to encode it so to get octets, if they are to escape (and not whole codepoints)
use Encode qw(encode);
my $s = sprintf '\%*vx', '\\', encode('UTF_8', $input);
See the linked page for more.

how unpack function will work in perl for this code $str =~ s/([^\w ])/'%'.unpack('H2', $1)/eg;

i have a code in perl $str =~ s/([^\w ])/'%'.unpack('H2', $1)/eg; i am not undestanding what value will be stored in $str
Assuming $str is encoded using UTF-8, and assuming the code you provided is followed by $str =~ s/ /+/g, the result is a url-encoded string safe for use in URLs.
Specifically, the line of code in question replaces every non-word except spaces with a three character sequence starting with % and followed by two hex digits representing the character number.
For example,
foo's ⇒ foo%27s
20% ⇒ 20%25
A better solution would be to use uri_escape (for strings encoded using UTF-8) or uri_escape_utf8 (for strings of Unicode Code Points aka decoded strings) from URI::Escape.
Provided line of code modifies $str value according substitute rule set s/([^\w ])/'%'.unpack('H2', $1)/eg.
How does it work:
[^\w] - look at $str for character not \w known as complement to \w
\w - represents range [A-za-z0-9_], punctuation chars and Unicode marks see perlre
([^\w]) capture found character, 'store' it in $1
regex modifier e evaluates '%'.unpack('H2',$1) as substitution string
unpack('H2',$1) - unpack $1 with template 'H2' (hex representation of byte associated with $1)
take '%' and concatenate it with unpacked result
use result from step 6 as replacement string
regex modifier g instructs to make this operation for all occurrences in the $str
Without knowing initial $str value before this operation, impossible to evaluate final result.
If initial value is known then you can evaluate result by visiting https://regex101.com/ website.
Nothing could speak louder than sample code demonstrating transformation
use feature 'say';
$msg = "Date: Mar 6 2020, Msg: soon Alex's birthday";
$msg =~ s/([^\w ])/'%'.unpack('H2', $1)/eg;
say $msg;
Output
Date%3a Mar 6 2020%2c Msg%3a soon Alex%27s birthday
Following code demonstrates how "Hello World\n" will look as hex representation (for Dada).
use feature 'say';
my $msg = "Hello World!\n";
print $msg;
my $a = unpack('H*',$msg);
say $a;
Output
Hello World!
48656c6c6f20576f726c64210a
You could start by trying it out and seeing if that gives you a hint.
$ perl -E'$str = "&*("; $str =~ s/([^\w ])/"%".unpack('H2', $1)/eg; say $str'
%26%2a%28
So, we have a substitution operator that looks like this:
s/PATTERN/REPLACEMENT/OPTIONS
Our pattern is ([^\w ]) which means "match every individual character that isn't a 'word character' or a space and capture that character in $1.
The replacement string is "%".unpack('H2', $1). Which means "the character '%' followed by the result of running unpack('H2', $1). unpack() here is being used to convert characters to the hexadecimal equivalent of their ASCII code. "H" means "convert to hex" and "2" means produce two hex digits".
The options are /e which means "run this code and use the output as the replacement string" and /g which means "do this for every match in the input string".
Putting that all together, you have code that:
Looks for non-word characters
Converts them to their hexadecimal escape code
Replaces them in the string
Using URI::Escape is probably a better approach.

Perl - Convert integer to text Char(1,2,3,4,5,6)

I am after some help trying to convert the following log I have to plain text.
This is a URL so there maybe %20 = 'space' and other but the main bit I am trying convert is the char(1,2,3,4,5,6) to text.
Below is an example of what I am trying to convert.
select%20char(45,120,49,45,81,45),char(45,120,50,45,81,45),char(45,120,51,45,81,45)
What I have tried so far is the following while trying to added into the char(in here) to convert with the chr($2)
perl -pe "s/(char())/chr($2)/ge"
All this has manage to do is remove the char but now I am trying to convert the number to text and remove the commas and brackets.
I maybe way off with how I am doing as I am fairly new to to perl.
perl -pe "s/word to remove/word to change it to/ge"
"s/(char(what goes in here))/chr($2)/ge"
Output try to achieve is
select -x1-Q-,-x2-Q-,-x3-Q-
Or
select%20-x1-Q-,-x2-Q-,-x3-Q-
Thanks for any help
There's too much to do here for a reasonable one-liner. Also, a script is easier to adjust later
use warnings;
use strict;
use feature 'say';
use URI::Escape 'uri_unescape';
my $string = q{select%20}
. q{char(45,120,49,45,81,45),char(45,120,50,45,81,45),}
. q{char(45,120,51,45,81,45)};
my $new_string = uri_unescape($string); # convert %20 and such
my #parts = $new_string =~ /(.*?)(char.*)/;
$parts[1] = join ',', map { chr( (/([0-9]+)/)[0] ) } split /,/, $parts[1];
$new_string = join '', #parts;
say $new_string;
this prints
select -x1-Q-,-x2-Q-,-x3-Q-
Comments
Module URI::Escape is used to convert percent-encoded characters, per RFC 3986
It is unspecified whether anything can follow the part with char(...)s, and what that might be. If there can be more after last char(...) adjust the splitting into #parts, or clarify
In the part with char(...)s only the numbers are needed, what regex in map uses
If you are going to use regex you should read up on it. See
perlretut, a tutorial
perlrequick, a quick-start introduction
perlre, the full account of syntax
perlreref, a quick reference (its See Also section is useful on its own)
Alright, this is going to be a messy "one-liner". Assuming your text is in a variable called $text.
$text =~ s{char\( ( (?: (?:\d+,)* \d+ )? ) \)}{
my #arr = split /,/, $1;
my $temp = join('', map { chr($_) } #arr);
$temp =~ s/^|$/"/g;
$temp
}xeg;
The regular expression matches char(, followed by a comma-separated list of sequences of digits, followed by ). We capture the digits in capture group $1. In the substitution, we split $1 on the comma (since chr only works on one character, not a whole list of them). Then we map chr over each number and concatenate the result into a string. The next line simply puts quotation marks at the start and end of the string (presumably you want the output quoted) and then returns the new string.
Input:
select%20char(45,120,49,45,81,45),char(45,120,50,45,81,45),char(45,120,51,45,81,45)
Output:
select%20"-x1-Q-","-x2-Q-","-x3-Q-"
If you want to replace the % escape sequences as well, I suggest doing that in a separate line. Trying to integrate both substitutions into one statement is going to get very hairy.
This will do as you ask. It performs the decoding in two stages: first the URI-encoding is decoded using chr hex $1, and then each char() function is translated to the string corresponding to the character equivalents of its decimal parameters
use strict;
use warnings 'all';
use feature 'say';
my $s = 'select%20char(45,120,49,45,81,45),char(45,120,50,45,81,45),char(45,120,51,45,81,45)';
$s =~ s/%(\d+)/ chr hex $1 /eg;
$s =~ s{ char \s* \( ( [^()]+ ) \) }{ join '', map chr, $1 =~ /\d+/g }xge;
say $s;
output
select -x1-Q-,-x2-Q-,-x3-Q-

Reading unicode chars on the byte level

Suppose I wanted to detect unicode characters and encode them using \u notation. If I had to use a byte array, are there simple rules I can follow to detect groups of bytes that belong to a single character?
I am referring to UTF-8 bytes that need to be encoded for an ASCII-only receiver. At the moment, non-ASCII-Printable characters are stripped. s/[^\x20-\x7e\r\n\t]//g.
I want to improve this functionality to write \u0000 notation.
You need to have Unicode characters, so start by decoding your byte array.
use Encode qw( decode );
my $decoded_text = decode("UTF-8", $encoded_text);
Only then can you escape Unicode characters.
( my $escaped_text = $decoded_text ) =~
s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;
For example,
$ perl -CSDA -MEncode=decode -E'
my $encoded_text = "\xC3\x89\x72\x69\x63\x20\xE2\x99\xA5\x20\x50\x65\x72\x6c";
my $decoded_text = decode("UTF-8", $encoded_text);
say $decoded_text;
( my $escaped_text = $decoded_text ) =~
s/([^\x0A\x20-\x5B\x5D-\x7E])/sprintf("\\u%04X", ord($1))/eg;
say $escaped_text;
'
Éric ♥ Perl
\u00C9ric \u2665 Perl

Convert utf-8 into html &...;

In Perl, how can I convert string containing utf-8 characters to HTML where such characters will be converted into &...; ?
First, split on an empty pattern to get a list of single characters. Then, map each character to itself, if it is ASCII, or its code, if it is not:
use Encode qw( decode_utf8 );
my $utf8_string = "\xE2\x80\x9C\x68\x6F\x6D\x65\xE2\x80\x9D";
my $unicode_string = decode_utf8($utf8_string);
my $html = join q(),
map { ord > 127 ? "&#" . ord . ";"
: $_
} split //, $unicode_string;
Just replace every symbol that is not printable and not low ASCII (that is, anything outside \x20 - \x7F region) with simple calculation of its ord + necessary HTML entity formatting. Perl regexp have /e flag to indicate that replacement should be treated as code.
use utf8;
my $str = "testТест"; # This is correct UTF-8 string right in the code
$str =~ s/([^[\x20-\x7F])/"&#" . ord($1) . ";"/eg;
print $str;
# testТест