Convert a character from and to its decimal, binary, octal, or hexadecimal representations in BASH / Shell - unicode

How to convert a character from and to its decimal, binary, octal, or hexadecimal representations in BASH / Shell ?

Convert a character from and to its decimal, binary, octal, or hexadecimal representations in BASH with printf and od
Some relevant documentation and Q&A:
od man pages:
https://www.gnu.org/software/coreutils/manual/html_node/od-invocation.html
printf man pages:
https://www.gnu.org/software/coreutils/manual/html_node/printf-invocation.html
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/printf.html
What is the difference between UTF-8 and Unicode?
How do I print an ASCII character by different code points in Bash?
How to print an octal value's corresponding UTF-8 character in bash?
Unicode char representations in BASH / shell: printf vs od
Convert binary, octal, decimal and hexadecimal values between each other in BASH / Shell
Convert a character from and to its decimal representation
single_ascii_char="A"
echo -n $single_ascii_char | od -A n -t d1
65
printf %d "'$single_ascii_char"
65
code=65
printf "\u$(printf %04x $code)\n" # use \u for up to 4 hexadecimal digits
A
printf "\U$(printf %08x $code)\n" # use \U for up to 8 hexadecimal digits
A
single_unicode_char="😈"
printf %d "'$single_unicode_char"
128520
echo -n $single_unicode_char | iconv -t UTF-32LE | od -A n -t d # d or u, d4, u4, dI, dL
128520 # or UTF-32BE, depending on system's endianness
code=128520
printf "\u$(printf %04x $code)\n" # use \u for up to 4 hexadecimal digits
á½ 8
printf "\U$(printf %08x $code)\n" # use \U for up to 8 hexadecimal digits
😈
Convert a character from and to its binary representation
single_ascii_char="A"
echo "obase=2; $(printf %d "'$single_ascii_char")" | bc
1000001
code="1000001"
printf "\u$(printf %04x $((2#$code)) )\n" # use \u for up to 4 hexadecimal digits
A
printf "\U$(printf %08x $((2#$code)) )\n" # use \U for up to 8 hexadecimal digits
A
single_unicode_char="😈"
echo "obase=2; $(printf %d "'$single_unicode_char")" | bc
11111011000001000
code="11111011000001000" # with or without leading 0s
printf "\u$(printf %04x $((2#$code)) )\n" # use \u for up to 4 hexadecimal digits
á½ 8
printf "\U$(printf %08x $((2#$code)) )\n" # use \U for up to 8 hexadecimal digits
😈
Convert a character from and to its octal representation
single_ascii_char="A"
printf %o "'$single_ascii_char"
101
echo -n $single_ascii_char | od -A n -t o1
101
code="\101"
printf %b "$code\n"
A
printf "$code\n"
A
single_unicode_char="😈"
printf %o "'$single_unicode_char"
373010
echo -n $single_unicode_char | iconv -t UTF-32LE | od -A n -t o # or o4
00000373010 # or UTF-32BE, depending on system's endianness
code="00000373010" # insert at least one leading 0 for printf to understand it's an octal
printf "\U$(printf %08x "$code")\n"
😈
echo -n "$single_unicode_char" | od -A n -t c # c or o1
360 237 230 210
code="\360\237\230\210"
printf %b "$code\n"
😈
printf "$code\n"
😈
Convert a character from and to its hexadecimal representation
single_ascii_char="A"
printf %x "'$single_ascii_char"
41
echo -n "$single_ascii_char" | od -A n -t x1
41
code="41"
printf "\u$code\n" # use \u for up to 4 hexadecimal digits
A
printf "\U$code\n" # use \U for up to 8 hexadecimal digits
A
single_unicode_char="😈"
printf %x "'$single_unicode_char"
1f608
printf %X "'$single_unicode_char"
1F608
echo -n $single_unicode_char | iconv -t UTF-32LE | od -A n -t x
0001f608 # or UTF-32BE, depending on system's endianness
code="1f608"
printf "\u$code\n" # use \u for up to 4 hexadecimal digits
á½ 8
printf "\U$code\n" # use \U for up to 8 hexadecimal digits
😈
printf %#x "'$single_unicode_char"
0x1f608
printf %#X "'$single_unicode_char"
0X1F608
code="0x1f608"
printf "\u$(printf %04x $code)\n" # use \u for up to 4 hexadecimal digits
á½ 8
printf "\U$(printf %08x $code)\n" # use \U for up to 8 hexadecimal digits
😈
echo -n "$single_unicode_char" | od -A n -t x1
f0 9f 98 88
code="\xf0\x9f\x98\x88"
printf %b "$code\n"
😈
printf "$code\n"
😈

Related

What is the difference between :encoding(utf8) and :utf8? [duplicate]

So as you probably know, in Perl "utf8" means Perl's looser understanding of UTF-8 which allows characters that technically aren't valid code points in UTF-8. By contrast "UTF-8" (or "utf-8") is Perl's stricter understanding of UTF-8 which doesn't allow invalid code points.
I have a few usage questions related to this distinction:
Encode::encode by default will replace invalid characters with a substitution character. Is that true even if you are passing the looser "utf8" as the encoding?
What happens when you read and write files which were open'd using "UTF-8"? Does character substitution happen to bad characters or does something else happen?
What is the difference between using open with a layer like '>:utf8' and a layer like '>:encoding(utf8)' ? Can both approaches be used with both 'utf8' and 'UTF-8'?
On Read,Invalid encoding otherthan sequence length
On Read,Outside of Unicode,Unicode nonchar, orUnicode surrogate
On Write,Outside of Unicode,Unicode nonchar, orUnicode surrogate
:encoding(UTF-8)
Warns and Replaces
Warns and Replaces
Warns and Replaces
:encoding(utf8)
Warns and Replaces
Accepts
Warns and Outputs
:utf8
Corrupt scalar
Accepts
Warns and Outputs
(This is the state in Perl 5.26.)
Note that :encoding(UTF-8) actually decodes using utf8, then checks if the resulting character is in the acceptable range. This reduces the number of error messages for bad input, so it's good.
(Encoding names are case-insensitive.)
Tests used to generate the above table:
On read
:encoding(UTF-8)
$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
perl -MB -nle'
use open ":std", ":encoding(UTF-8)";
my $sv = B::svref_2object(\$_);
printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
'
utf8 "\xFFFF" does not map to Unicode.
utf8 "\xD800" does not map to Unicode.
utf8 "\x200000" does not map to Unicode.
utf8 "\x80" does not map to Unicode.
E9 (internal: C3.A9, UTF8=1)
5C.78.7B.46.46.46.46.7D = \x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1)
5C.78.7B.44.38.30.30.7D = \x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1)
5C.78.7B.32.30.30.30.30.30.7D = \x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1)
5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
:encoding(utf8)
$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
perl -MB -nle'
use open ":std", ":encoding(utf8)";
my $sv = B::svref_2object(\$_);
printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
'
utf8 "\x80" does not map to Unicode.
E9 (internal: C3.A9, UTF8=1)
FFFF (internal: EF.BF.BF, UTF8=1)
D800 (internal: ED.A0.80, UTF8=1)
200000 (internal: F8.88.80.80.80, UTF8=1)
5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
:utf8
$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
perl -MB -nle'
use open ":std", ":utf8";
my $sv = B::svref_2object(\$_);
printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
'
E9 (internal: C3.A9, UTF8=1)
FFFF (internal: EF.BF.BF, UTF8=1)
D800 (internal: ED.A0.80, UTF8=1)
200000 (internal: F8.88.80.80.80, UTF8=1)
Malformed UTF-8 character: \x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5.
0 (internal: 80, UTF8=1)
On write
:encoding(UTF-8)
$ perl -e'
use open ":std", ":encoding(UTF-8)";
print "\x{E9}\n";
print "\x{FFFF}\n";
print "\x{D800}\n";
print "\x{20_0000}\n";
' >a
Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4.
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5.
Code point 0x200000 is not Unicode, may not be portable in print at -e line 6.
"\x{ffff}" does not map to utf8.
"\x{d800}" does not map to utf8.
"\x{200000}" does not map to utf8.
$ od -t c a
0000000 303 251 \n \ x { F F F F } \n \ x { D
0000020 8 0 0 } \n \ x { 2 0 0 0 0 0 } \n
0000040
$ cat a
é
\x{FFFF}
\x{D800}
\x{200000}
:encoding(utf8)
$ perl -e'
use open ":std", ":encoding(utf8)";
print "\x{E9}\n";
print "\x{FFFF}\n";
print "\x{D800}\n";
print "\x{20_0000}\n";
' >a
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4.
Code point 0x200000 is not Unicode, may not be portable in print at -e line 5.
$ od -t c a
0000000 303 251 \n 355 240 200 \n 370 210 200 200 200 \n
0000015
$ cat a
é
â–’
â–’
:utf8
Same results as :encoding(utf8).
Tested using Perl 5.26.
Encode::encode by default will replace invalid characters with a substitution character. Is that true even if you are passing the looser "utf8" as the encoding?
Perl strings are strings of 32-bit or 64-bit characters depending on the build. utf8 can encode any 72-bit integer. It is therefore capable of encoding all characters it can be asked to encode.

Using Encode::encode with "utf8"

So as you probably know, in Perl "utf8" means Perl's looser understanding of UTF-8 which allows characters that technically aren't valid code points in UTF-8. By contrast "UTF-8" (or "utf-8") is Perl's stricter understanding of UTF-8 which doesn't allow invalid code points.
I have a few usage questions related to this distinction:
Encode::encode by default will replace invalid characters with a substitution character. Is that true even if you are passing the looser "utf8" as the encoding?
What happens when you read and write files which were open'd using "UTF-8"? Does character substitution happen to bad characters or does something else happen?
What is the difference between using open with a layer like '>:utf8' and a layer like '>:encoding(utf8)' ? Can both approaches be used with both 'utf8' and 'UTF-8'?
On Read,Invalid encoding otherthan sequence length
On Read,Outside of Unicode,Unicode nonchar, orUnicode surrogate
On Write,Outside of Unicode,Unicode nonchar, orUnicode surrogate
:encoding(UTF-8)
Warns and Replaces
Warns and Replaces
Warns and Replaces
:encoding(utf8)
Warns and Replaces
Accepts
Warns and Outputs
:utf8
Corrupt scalar
Accepts
Warns and Outputs
(This is the state in Perl 5.26.)
Note that :encoding(UTF-8) actually decodes using utf8, then checks if the resulting character is in the acceptable range. This reduces the number of error messages for bad input, so it's good.
(Encoding names are case-insensitive.)
Tests used to generate the above table:
On read
:encoding(UTF-8)
$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
perl -MB -nle'
use open ":std", ":encoding(UTF-8)";
my $sv = B::svref_2object(\$_);
printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
'
utf8 "\xFFFF" does not map to Unicode.
utf8 "\xD800" does not map to Unicode.
utf8 "\x200000" does not map to Unicode.
utf8 "\x80" does not map to Unicode.
E9 (internal: C3.A9, UTF8=1)
5C.78.7B.46.46.46.46.7D = \x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1)
5C.78.7B.44.38.30.30.7D = \x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1)
5C.78.7B.32.30.30.30.30.30.7D = \x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1)
5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
:encoding(utf8)
$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
perl -MB -nle'
use open ":std", ":encoding(utf8)";
my $sv = B::svref_2object(\$_);
printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
'
utf8 "\x80" does not map to Unicode.
E9 (internal: C3.A9, UTF8=1)
FFFF (internal: EF.BF.BF, UTF8=1)
D800 (internal: ED.A0.80, UTF8=1)
200000 (internal: F8.88.80.80.80, UTF8=1)
5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
:utf8
$ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
perl -MB -nle'
use open ":std", ":utf8";
my $sv = B::svref_2object(\$_);
printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
'
E9 (internal: C3.A9, UTF8=1)
FFFF (internal: EF.BF.BF, UTF8=1)
D800 (internal: ED.A0.80, UTF8=1)
200000 (internal: F8.88.80.80.80, UTF8=1)
Malformed UTF-8 character: \x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5.
0 (internal: 80, UTF8=1)
On write
:encoding(UTF-8)
$ perl -e'
use open ":std", ":encoding(UTF-8)";
print "\x{E9}\n";
print "\x{FFFF}\n";
print "\x{D800}\n";
print "\x{20_0000}\n";
' >a
Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4.
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5.
Code point 0x200000 is not Unicode, may not be portable in print at -e line 6.
"\x{ffff}" does not map to utf8.
"\x{d800}" does not map to utf8.
"\x{200000}" does not map to utf8.
$ od -t c a
0000000 303 251 \n \ x { F F F F } \n \ x { D
0000020 8 0 0 } \n \ x { 2 0 0 0 0 0 } \n
0000040
$ cat a
é
\x{FFFF}
\x{D800}
\x{200000}
:encoding(utf8)
$ perl -e'
use open ":std", ":encoding(utf8)";
print "\x{E9}\n";
print "\x{FFFF}\n";
print "\x{D800}\n";
print "\x{20_0000}\n";
' >a
Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4.
Code point 0x200000 is not Unicode, may not be portable in print at -e line 5.
$ od -t c a
0000000 303 251 \n 355 240 200 \n 370 210 200 200 200 \n
0000015
$ cat a
é
â–’
â–’
:utf8
Same results as :encoding(utf8).
Tested using Perl 5.26.
Encode::encode by default will replace invalid characters with a substitution character. Is that true even if you are passing the looser "utf8" as the encoding?
Perl strings are strings of 32-bit or 64-bit characters depending on the build. utf8 can encode any 72-bit integer. It is therefore capable of encoding all characters it can be asked to encode.

How to Replace special Character in Unix Command

My source data contains special characters not in readable format. Can anyone help on the below :
Source data:
Commands Tryed:
sed 's/../t/g' test.txt > test2.txt
you can use tr to keep only printable characters:
tr -cd "[:print:]" <test.txt > test2.txt
Uses tr delete option on non-printable (print criteria negated by -c option)
If you want to replace those special chars by something else (ex: X):
tr -c "[:print:]" "X" <test.txt > test2.txt
With sed, you could try that to replace non-printable by X:
sed -r 's/[^[:print:]]/X/g' text.txt > test2.txt
it works on some but fails on chars >127 (maybe because the one I tried is printable as â–’ !) on my machine whereas tr works perfectly.
inline examples (printf to generate special chars + filter + od to show bytes):
$ printf "\x01ABC\x05\xff\xe0" | od -c
0000000 001 A B C 005 377 340
0000007
$ printf "\x01ABC\x05\xff\xe0" | sed "s/[^[:print:]]//g" | od -c
0000000 A B C 377 340
0000005
$ printf "\x01ABC\x05\xff\xe0" | tr -cd "[:print:]" | od -c
0000000 A B C
0000003

Handle command line arguments with different radices in Perl

When used as literals in a Perl program, numbers are handled in the same way as in C: "0x" prefix means hexadecimal, "0" prefix means octal, and no prefix means decimal:
$ perl -E 'print 0x23 . "\n"'
35
$ perl -E 'print 023 . "\n"'
19
$ perl -E 'print 23 . "\n"'
23
I would like to pass command line arguments to Perl using the same notation. e.g. If I pass 23, I want to convert the string argument to a decimal value (23). If I pass 0x23, I want to convert to a hexadecimal value (35), and 023 would be converted to octal (19). Is there a built-in way to handle this? I am aware of hex() and oct(), but they interpret numbers with no prefix to be hex/oct respectively (not decimal). Following this convention, it seems that I want a dec() function, but I don't think that exists.
From http://answers.oreilly.com/topic/419-convert-binary-octal-and-hexidecimal-numbers-in-perl/:
print "Gimme an integer in decimal, binary, octal, or hex: ";
$num = <STDIN>;
chomp $num;
exit unless defined $num;
$num = oct($num) if $num =~ /^0/; # catches 077 0b10 0x20
printf "%d %#x %#o %#bn", ($num) x 4;

Difference between passing string "0x30" and hexadecimal number 0x30 to hex() function

print hex("0x30"); gives the correct hex to decimal conversion.
What does
print hex(0x30); mean?
The value it's giving is 72.
hex() takes a string argument, so due to Perl's weak typing it will read the argument as a string whatever you pass it.
The former is passing 0x30 as a string, which hex() then directly converts to decimal.
The latter is a hex number 0x30, which is 48 in decimal, is passed to hex() which is then interpreted as hex again and converted to decimal number 72. Think of it as doing hex(hex("0x30")).
You should stick with hex("0x30").
$ perl -e 'print 0x30';
48
$ perl -e 'print hex(0x30)';
72
$ perl -e 'print hex(30)';
48
$ perl -e 'print hex("0x30")';
48
$ perl -e 'print hex(hex(30))';
72
To expand on marcog's answer: From perldoc -f hex
hex EXPR: Interprets EXPR as a hex string and returns the corresponding value.
So hex really is for conversion between string and hex value. By typing in 0x30 you have already created a hex value.
perl -E '
say 0x30;
say hex("0x30");
say 0x48;
say hex(0x30);
say hex(hex("0x30"));'
gives
48
48
72
72
72
hex() parses a hex string and returns the appropriate integer.
So when you do hex(0x30) your numeric literal (0x30) gets interpreted as such (0x30 is 48 in hex format), then hex() treats that scalar value as a string ("48") and converts it into a number, assuming the string is in hex format. 0x48 == 72, which is where the 72 is coming from.