I have an old legacy system that is pulling postgres bytea data from a database (can't change the code as it's compiled). How do I go about converting bytea data that is in this format to an ascii string?
Here is a sample of the data:
my($value) = "\x46726f6d3a20224d";
I found this example online to decode a hex string, but the input has to start with 0x, not \x. If I change the test string from \x to 0x then this example works however the \x is treating the next HEX 46 as a capital F as it's doing the regex match, then the rest is failing to decode.
Here is the regex I had found that works with a string starting with 0x but not \x, Is it possible to decode this type of hex string somehow?
$value =~ s/0x(([0-9a-f][0-9a-f])+)/pack('H*', $1)/ie;
print $value, "\n";
Correct output when you use 0x on the input string:
From: "M
Incorrect output (not decoded) when using \x on the input string:
F726f6d3a20224d
Cheers, Mike
Assuming you actually have the string
\x46726f6d3a20224d
as produced by
my $value = "\\x46726f6d3a20224d";
Then all you just need to replace the 0 with \\.
$value =~ s/\\x(([0-9a-f][0-9a-f])+)/pack('H*', $1)/ie;
Better (less repetition and it avoids slowdowns related to ß):
$value =~ s/\\x((?:[0-9a-fA-F]{2})+)/ pack( 'H*', $1 ) /e;
If you expect this to be the whole string, then I'd use
$value = pack( 'H*', $value =~ s/^\\x//r );
The following is faster, but loses some validation:
$value = pack( 'H*', substr( $value, 2 ) );
Related
I want to escape a string, per RFC 4515. So, the string "u1" would be transformed to "\75\31", that is, the ordinal value of each character, in hex, preceded by backslash.
Has to be done in Perl. I already know how to do it in Python, C++, Java, etc., but Perl if baffling.
Also, I cannot use Net::LDAP and I may not be able to add any new modules, so, I want to do it with basic Perl features.
Skimming through RFC 4515, this encoding escapes the individual octets of multi-byte UTF-8 characters, not codepoints. So, something that works with non-ASCII text too:
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw/say/;
sub valueencode ($) {
# Unpack format returns octets of UTF-8 encoded text
my #bytes = unpack "U0C*", $_[0];
sprintf '\%02x' x #bytes, #bytes;
}
say valueencode 'u1';
say valueencode "Lu\N{U+010D}i\N{U+0107}"; # Lučić, from the RFC 4515 examples
Example:
$ perl demo.pl
\75\31
\4c\75\c4\8d\69\c4\87
Or an alternative using the vector flag:
use Encode qw/encode/;
sub valueencode ($) {
sprintf '\%*vx', "\\", encode('UTF-8', $_[0]);
}
Finally, a smarter version that only escapes ASCII characters when it has to (And multi-byte characters, even though upon a closer read of the RFC they don't actually need to be if they're valid UTF-8):
# Encode according to RFC 4515 valueencoding grammar rules:
#
# Text is UTF-8 encoded. Bytes can be escaped with the sequence
# \XX, where the X's are hex digits.
#
# The characters NUL, LPAREN, RPAREN, ASTERISK and BACKSLASH all MUST
# be escaped.
#
# Bytes > 0x7F that aren't part of a valid UTF-8 sequence MUST be
# escaped. This version assumes there are no such bytes and that input
# is a ASCII or Unicode string.
#
# Single bytes and valid multibyte UTF-8 sequences CAN be escaped,
# with each byte escaped separately. This version escapes multibyte
# sequences, to give ASCII results.
sub valueencode ($) {
my $encoded = "";
for my $byte (unpack 'U0C*', $_[0]) {
if (($byte >= 0x01 && $byte <= 0x27) ||
($byte >= 0x2B && $byte <= 0x5B) ||
($byte >= 0x5D && $byte <= 0x7F)) {
$encoded .= chr $byte;
} else {
$encoded .= sprintf '\%02x', $byte;
}
}
return $encoded;
}
This version returns the strings 'u1' and 'Lu\c4\8di\c4\87' from the above examples.
In short, one way is just as the question says: split the string into characters, get their ordinals then convert format to hex; then put it back together. I don't know how to get the \nn format so I'd make it 'by hand'. For instance
my $s = join '', map { sprintf '\%x', ord } split //, 'u1';
Or use vector flag %v to treat the string as a "vector" of integers
my $s = sprintf '\%*vx', '\\', 'u1';
With %v the string is broken up into numerical representation of characters, each is converted (%x), and they're joined back, with . between them. That (optional) * allows us to specify our string by which to join them instead, \ (escaped) here.
This can also be done with pack + unpack, see the link below. Also see that page if there is a wide range of input characters.†
See ord and sprintf, and for more pages like this one.
† If there is non-ASCII input then you may need to encode it so to get octets, if they are to escape (and not whole codepoints)
use Encode qw(encode);
my $s = sprintf '\%*vx', '\\', encode('UTF_8', $input);
See the linked page for more.
I need to increment a numeric value in a string:
my $str = "tool_v01.zip";
(my $newstr = $str) =~ s/\_v(\d+)\.zip$/ ($1++);/eri;
#(my $newstr = $str) =~ s/\_v(\d+)\.zip$/ ($1+1);/eri;
#(my $newstr = $str) =~ s/\_v(\d+)\.zip$/ $1=~s{(\d+)}{$1+1}/r; /eri;
print $newstr;
Expected output is tool_v02.zip
Note: the version number 01 may contain any number of leading zeroes
I don't think this question has anything to do with the /r modifier, but rather how to properly format the output. For that, I'd suggest sprintf:
my $newstr = $str =~ s{ _v (\d+) \.zip$ }
{ sprintf("_v%0*d.zip", length($1), $1+1 ) }xeri;
Or, replacing just the number with zero-width Lookaround Assertions:
my $newstr = $str =~ s{ (?<= _v ) (\d+) (?= \.zip$ ) }
{ sprintf("%0*d", length($1), $1+1 ) }xeri;
Note: With either of these solutions, something like tool_v99.zip would be altered to tool_v100.zip because the new sequence number cannot be expressed in two characters. If that's not what you want then you need to specify what alternative behaviour you require.
The bit you're missing is sprintf which works the same way as printf except rather than outputting the formatted string to stdout or a file handle, it returns it as a string. Example:
sprintf("%02d",3)
generates a string 03
Putting this into your regex you can do this. Rather than using /r you can use do a zero-width look ahead ((?=...)) to match the file suffix and just replace the matched number with the new value
s/(\d+)(?=.zip$)/sprintf("%02d",$1+1)/ei
I am parsing an html documents, and there is a variable var key = 0xa9 for example, i do use my regex and all, but the 0xa9 i am getting is stored in a variable as a string, is there any cast function or anything to convert it to a number?
EDIT :
I am sure i didn't explain myself well, this is what i have been trying to do :
$t = $t . chr ( ord(substr($e, $i, 1)) ^ $var); BUT $var = "0xa9" or whatever, the thing is it is a string, so in the previous operation i do get an error Argument "0xc2" isn't numeric in bitwise xor (^) at, that is why i want the exact same value but not as a string, in order to work, $var needs to be like $var = 0xa9 for example.
Try
print hex '0xAf'; # prints '175'
From perldoc
hex
Interprets EXPR as a hex string and returns the corresponding value.
(To convert strings that might start with either 0 , 0x , or 0b, see oct.)
If EXPR is omitted, uses $_ .
Please try this:
print hex $string
Try adding 0x0 to $var. It should convert the second operand in XOR operation to numeric value.
$t = $t . chr ( ord(substr($e, $i, 1)) ^ ($var + 0x0));
I was running a perl app which uses /opt/local/lib/perl5/5.12.4/darwin-thread-multi-2level/Encode.pm
and issues an error
Cannot decode string with wide characters at /opt/local/lib/perl5/5.12.4/darwin-thread-multi-2level/Encode.pm line 174.
Line 174 of Encode.pm reads
sub decode($$;$) {
my ( $name, $octets, $check ) = #_;
return undef unless defined $octets;
$octets .= '' if ref $octets;
$check ||= 0;
my $enc = find_encoding($name);
unless ( defined $enc ) {
require Carp;
Carp::croak("Unknown encoding '$name'");
}
my $string = $enc->decode( $octets, $check ); # line 174
$_[1] = $octets if $check and !ref $check and !( $check & LEAVE_SRC() );
return $string;
}
Any workaround?
encode takes a string of Unicode code points and serialises them into a string of bytes.
decode takes a string of bytes and deserialises them into Unicode code points.
That message means you passed a string containing one or more characters above 255 (non-bytes) to decode, which is obviously an incorrect argument.
>perl -MEncode -E"for (254..257) { say; decode('iso-8859-1', chr($_)); }"
254
255
256
Wide character in subroutine entry at .../Encode.pm line 176.
You ask for a workaround, but the bug is yours. Perhaps you are accidentally trying to decode something you already decoded?
I had a similar problem.
$enc->decode( $octets, $check ); expects octets.
So put Encode::_utf8_off($octets) before. It made it work for me.
That error message is saying that you have passed in a string that has already been decoded (and contains characters above codepoint 255). You can't decode it again.
In Perl, how can I convert string containing utf-8 characters to HTML where such characters will be converted into &...; ?
First, split on an empty pattern to get a list of single characters. Then, map each character to itself, if it is ASCII, or its code, if it is not:
use Encode qw( decode_utf8 );
my $utf8_string = "\xE2\x80\x9C\x68\x6F\x6D\x65\xE2\x80\x9D";
my $unicode_string = decode_utf8($utf8_string);
my $html = join q(),
map { ord > 127 ? "&#" . ord . ";"
: $_
} split //, $unicode_string;
Just replace every symbol that is not printable and not low ASCII (that is, anything outside \x20 - \x7F region) with simple calculation of its ord + necessary HTML entity formatting. Perl regexp have /e flag to indicate that replacement should be treated as code.
use utf8;
my $str = "testТест"; # This is correct UTF-8 string right in the code
$str =~ s/([^[\x20-\x7F])/"&#" . ord($1) . ";"/eg;
print $str;
# testТест