Perl | Print ASCII, but backslashed other - perl

I want print 95 ASCII symblols unchanged, but for others to print its codes.
How make it in pure perl? 'unpack' function? Any module?
print BackSlashed('test folder'); # expected test\040folder
print BackSlashed('test тестовая folder');
# expected test\040\321\202\320\265\321\201\321\202\320\276\320\262\320\260\321\217\040folder
print BackSlashed('НОВАЯ ПАПКА');
# expected \320\235\320\236\320\222\320\220\320\257\040\320\237\320\220\320\237\320\232\320\220
sub BackSlashed() {
my $str = shift;
.. backslashed code here...
return $str
}

You can use a regular expression substitution with an evaled substitution part. In there, need to convert each character to its numeric value first, and then output it in octal notation. There's a good explanation for it in this answer. Attach an escaped backslash \ to get it to show up in the output.
$str =~ s/([^a-zA-Z0-9])/sprintf "\\%03o", ord($1)/eg;
I limited the capture group to basic ASCII letters and numbers. If you want something else, just change the character group.
Since your sample output has octets but you said your code has the use utf8 pragma, you need to convert Perl's representation of the string to the corresponding octet sequence before you run the substitution.
use utf8;
my $str = 'НОВАЯ ПАПКА';
print foo($str);
sub foo { # note that there are no () here!
my $str = shift;
utf8::encode($str);
$str =~ s/([^a-zA-Z0-9])/sprintf "\\%03o", ord($1)/eg;
return $str;
}

Related

Perl - Convert integer to text Char(1,2,3,4,5,6)

I am after some help trying to convert the following log I have to plain text.
This is a URL so there maybe %20 = 'space' and other but the main bit I am trying convert is the char(1,2,3,4,5,6) to text.
Below is an example of what I am trying to convert.
select%20char(45,120,49,45,81,45),char(45,120,50,45,81,45),char(45,120,51,45,81,45)
What I have tried so far is the following while trying to added into the char(in here) to convert with the chr($2)
perl -pe "s/(char())/chr($2)/ge"
All this has manage to do is remove the char but now I am trying to convert the number to text and remove the commas and brackets.
I maybe way off with how I am doing as I am fairly new to to perl.
perl -pe "s/word to remove/word to change it to/ge"
"s/(char(what goes in here))/chr($2)/ge"
Output try to achieve is
select -x1-Q-,-x2-Q-,-x3-Q-
Or
select%20-x1-Q-,-x2-Q-,-x3-Q-
Thanks for any help
There's too much to do here for a reasonable one-liner. Also, a script is easier to adjust later
use warnings;
use strict;
use feature 'say';
use URI::Escape 'uri_unescape';
my $string = q{select%20}
. q{char(45,120,49,45,81,45),char(45,120,50,45,81,45),}
. q{char(45,120,51,45,81,45)};
my $new_string = uri_unescape($string); # convert %20 and such
my #parts = $new_string =~ /(.*?)(char.*)/;
$parts[1] = join ',', map { chr( (/([0-9]+)/)[0] ) } split /,/, $parts[1];
$new_string = join '', #parts;
say $new_string;
this prints
select -x1-Q-,-x2-Q-,-x3-Q-
Comments
Module URI::Escape is used to convert percent-encoded characters, per RFC 3986
It is unspecified whether anything can follow the part with char(...)s, and what that might be. If there can be more after last char(...) adjust the splitting into #parts, or clarify
In the part with char(...)s only the numbers are needed, what regex in map uses
If you are going to use regex you should read up on it. See
perlretut, a tutorial
perlrequick, a quick-start introduction
perlre, the full account of syntax
perlreref, a quick reference (its See Also section is useful on its own)
Alright, this is going to be a messy "one-liner". Assuming your text is in a variable called $text.
$text =~ s{char\( ( (?: (?:\d+,)* \d+ )? ) \)}{
my #arr = split /,/, $1;
my $temp = join('', map { chr($_) } #arr);
$temp =~ s/^|$/"/g;
$temp
}xeg;
The regular expression matches char(, followed by a comma-separated list of sequences of digits, followed by ). We capture the digits in capture group $1. In the substitution, we split $1 on the comma (since chr only works on one character, not a whole list of them). Then we map chr over each number and concatenate the result into a string. The next line simply puts quotation marks at the start and end of the string (presumably you want the output quoted) and then returns the new string.
Input:
select%20char(45,120,49,45,81,45),char(45,120,50,45,81,45),char(45,120,51,45,81,45)
Output:
select%20"-x1-Q-","-x2-Q-","-x3-Q-"
If you want to replace the % escape sequences as well, I suggest doing that in a separate line. Trying to integrate both substitutions into one statement is going to get very hairy.
This will do as you ask. It performs the decoding in two stages: first the URI-encoding is decoded using chr hex $1, and then each char() function is translated to the string corresponding to the character equivalents of its decimal parameters
use strict;
use warnings 'all';
use feature 'say';
my $s = 'select%20char(45,120,49,45,81,45),char(45,120,50,45,81,45),char(45,120,51,45,81,45)';
$s =~ s/%(\d+)/ chr hex $1 /eg;
$s =~ s{ char \s* \( ( [^()]+ ) \) }{ join '', map chr, $1 =~ /\d+/g }xge;
say $s;
output
select -x1-Q-,-x2-Q-,-x3-Q-

How can I prevent Perl from interpreting double-backslash as single-backslash character?

How can I print a string (single-quoted) containing double-backslash \\ characters as is without making Perl somehow interpolating it to single-slash \? I don't want to alter the string by adding more escape characters also.
my $string1 = 'a\\\b';
print $string1; #prints 'a\b'
my $string1 = 'a\\\\b';
#I know I can alter the string to escape each backslash
#but I want to keep string as is.
print $string1; #prints 'a\\b'
#I can also use single-quoted here document
#but unfortunately this would make my code syntactically look horrible.
my $string1 = <<'EOF';
a\\b
EOF
print $string1; #prints a\\b, with newline that could be removed with chomp
The only quoting construct in Perl that doesn't interpret backslashes at all is the single-quoted here document:
my $string1 = <<'EOF';
a\\\b
EOF
print $string1; # Prints a\\\b, with newline
Because here-docs are line-based, it's unavoidable that you will get a newline at the end of your string, but you can remove it with chomp.
Other techniques are simply to live with it and backslash your strings correctly (for small amounts of data), or to put them in a __DATA__ section or an external file (for large amounts of data).
If you are mildly crazy, and like the idea of using experimental software that mucks about with perl's internals to improve the aesthetics of your code, you can use the Syntax::Keyword::RawQuote module, on CPAN since this morning.
use syntax 'raw_quote';
my $string1 = r'a\\\b';
print $string1; # prints 'a\\\b'
Thanks to #melpomene for the inspiration.
Since the backslash interpolation happens in string literals, perhaps you could declare your literals using some other arbitrary symbol, then substitute them for something else later.
my $string = 'a!!!b';
$string =~ s{!}{\\}g;
print $string; #prints 'a\\\b'
Of course it doesn't have to be !, any symbol that does not conflict with a normal character in the string will do. You said you need to make a number of strings, so you could put the substitution in a function
sub bs {
$_[0] =~ s{!}{\\}gr
}
my $string = 'a!!!b';
print bs($string); #prints 'a\\\b'
P.S.
That function uses the non-destructive substitution modifier /r introduced in v5.14. If you are using an older version, then the function would need to be written like this
sub bs {
$_[0] =~ s{!}{\\}g;
return $_[0];
}
Or if you like something more readable
sub bs {
my $str = shift;
$str =~ s{!}{\\}g;
return $str;
}

How do I decode a double backslashed PERLQQ escaped string into Perl characters?

I read lines from a file which contains semi-utf8 encoding and I wish to convert it to Perl-internal representation for further operations.
file.in (plain ASCII):
MO\\xc5\\xbdN\\xc3\\x81
NOV\\xc3\\x81
These should translate to MOŽNÁ and NOVÁ.
I load the lines and upgrade them to proper utf8 notation, ie. \\xc5\\xbd -> \x{00c5}\x{00bd}. Then I would like to take this upgraded $line and make perl to represent it internally:
for my $line (#lines) {
$line =~ s/x(..)/x{00$1}/g;
eval { $l = "$line"; };
}
Unfortunately, without success.
use File::Slurp qw(read_file);
use Encode qw(decode);
use Encode::Escape qw();
my $string =
decode 'UTF-8', # octets → characters
decode 'unicode-escape', # \x → octets
decode 'ascii-escape', # \\x → \x
read_file 'file.in';
Read from the bottom upwards.

Convert utf-8 into html &...;

In Perl, how can I convert string containing utf-8 characters to HTML where such characters will be converted into &...; ?
First, split on an empty pattern to get a list of single characters. Then, map each character to itself, if it is ASCII, or its code, if it is not:
use Encode qw( decode_utf8 );
my $utf8_string = "\xE2\x80\x9C\x68\x6F\x6D\x65\xE2\x80\x9D";
my $unicode_string = decode_utf8($utf8_string);
my $html = join q(),
map { ord > 127 ? "&#" . ord . ";"
: $_
} split //, $unicode_string;
Just replace every symbol that is not printable and not low ASCII (that is, anything outside \x20 - \x7F region) with simple calculation of its ord + necessary HTML entity formatting. Perl regexp have /e flag to indicate that replacement should be treated as code.
use utf8;
my $str = "testТест"; # This is correct UTF-8 string right in the code
$str =~ s/([^[\x20-\x7F])/"&#" . ord($1) . ";"/eg;
print $str;
# testТест

Perl - Unicode::String sub need to add/convert for Latin-9 support

Part 3 (Part 2 is here) (Part 1 is here)
Here is the perl Mod I'm using: Unicode::String
How I'm calling it:
print "Euro: ";
print unicode_encode("€")."\n";
print "Pound: ";
print unicode_encode("£")."\n";
would like it to return this format:
€ # Euro
£ # Pound
The function is below:
sub unicode_encode {
shift() if ref( $_[0] );
my $toencode = shift();
return undef unless defined($toencode);
print "Passed: ".$toencode."\n";
Unicode::String->stringify_as("utf8");
my $unicode_str = Unicode::String->new();
my $text_str = "";
my $pack_str = "";
# encode Perl UTF-8 string into latin1 Unicode::String
# - currently only Basic Latin and Latin 1 Supplement
# are supported here due to issues with Unicode::String .
$unicode_str->latin1($toencode);
print "Latin 1: ".$unicode_str."\n";
# Convert to hex format ("U+XXXX U+XXXX ")
$text_str = $unicode_str->hex;
# Now, the interesting part.
# We must search for the (now hex-encoded)
# Unicode escape sequence.
my $pattern =
'U\+005[C|c] U\+0058 U\+00([0-9A-Fa-f])([0-9A-Fa-f]) U\+00([0-9A-Fa-f])([0-9A-Fa-f]) U\+00([0-9A-Fa-f])([0-9A-Fa-f]) U\+00([0-9A-Fa-f])([0-9A-Fa-f])';
# Replace escapes with entities (beginning of string)
$_ = $text_str;
if (/^$pattern/) {
$pack_str = pack "H8", "$1$2$3$4$5$6$7$8";
$text_str =~ s/^$pattern/\&#x$pack_str/;
}
# Replace escapes with entities (middle of string)
$_ = $text_str;
while (/ $pattern/) {
$pack_str = pack "H8", "$1$2$3$4$5$6$7$8";
$text_str =~ s/ $pattern/\;\&#x$pack_str/;
$_ = $text_str;
}
# Replace "U+" with "&#x" (beginning of string)
$text_str =~ s/^U\+/&#x/;
# Replace " U+" with ";&#x" (middle of string)
$text_str =~ s/ U\+/;&#x/g;
# Append ";" to end of string to close last entity.
# This last ";" at the end of the string isn't necessary in most parsers.
# However, it is included anyways to ensure full compatibility.
if ( $text_str ne "" ) {
$text_str .= ';';
}
return $text_str;
}
I need to get the same output but need to Support Latin-9 characters as well, but the Unicode::String is limited to latin1. any thoughts on how I can get around this?
I have a couple of other questions and think I have a somewhat understanding of Unicode and Encodings but having time issues as well.
Thanks to anyone who helps me out!
As you have been told already, Unicode::String is not an appropriate choice of module. Perl ships with a module called 'Encode' which can do everything you need.
If you have a character string in Perl like this:
my $euro = "\x{20ac}";
You can convert it to a string of bytes in Latin-9 like this:
my $bytes = encode("iso8859-15", $euro);
The $bytes variable will now contain \xA4.
Or you can have Perl automatically convert it out output to a filehandle like this:
binmode(STDOUT, ":encoding(iso8859-15)");
You can refer to the documentation for the Encode module. And also, PerlIO describes the encoding layer.
I know you are determined to ignore this final piece of advice but I'll offer it one last time. Latin-9 is a legacy encoding. Perl can quite happily read Latin-9 data and convert it to UTF-8 on the fly (using binmode). You should not be writing more software that generates Latin-9 data you should be migrating away from it.