How UTF-16 and UTF-8 conversion happen?

How UTF-16 and UTF-8 conversion happen? - unicode

I'm kinda confused about unicode characters codepoints conversion to UTF-16 and I'm looking for someone who can explain it to me in the easiest way possible.
For characters like "𐒌" we get;
d801dc8c --> UTF-16
0001048c --> UTF-32
f090928c --> UTF-8
66700 --> Decimal Value
So, UTF-16 hexadecimal value converts to "11011000 00000001 11011100 10001100" which is "3624000652" in decimal value, so my question is how do we got this value in hexadecimal?? and how can we convert it back to the real codepoint of "66700". ???
UTF-32 hexadecimal value converts to "00000000 0000001 00000100 10001100" which is "66700" in decimal, but UTF-16 value doesn't convert back to "66700" and instead we get "3624000652".
How the conversion is actually happening??
Like for UTF-8,, 4-byte encoding it goes like 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
But how this happens in UTF-16 ?? If anyone can explain it to me in easiest possible way then that would be a huge help, because I've been searching for it for like past few days and haven't been able to find a good answer that makes sense to me.
Websites I used for conversion were Branah.com and rapidtables.com

how do we got this value
how can we convert it back to the real codepoint
about surrogate pairs, how they work?
Study the algorithm for encoding to UTF-16:
my $U = 66_700; # code point
if ($U > 0xffff) {
my $U_prime = $U - 0x1_0000; # some intermediate value 0x0_0000 .. 0xF_FFFF
sprintf '%d', $U_prime; # 1164
sprintf '0x%04X', $U_prime; # 0x048C
sprintf '0b%020b', $U_prime; # 0b00000000010010001100
my $high_ten_bits = $U_prime << 10; # range 0x000 .. 0x3FF
sprintf '0b%010b', $high_ten_bits; # 0b0000000001
my $low_ten_bits = $U_prime ^ 2**10; # range 0x000 .. 0x3FF
sprintf '0b%010b', $low_ten_bits; # 0b0010001100
my $W1 = $high_ten_bits + 0xD800; # high surrogate
sprintf '%d', $W1; # 55297
sprintf '0x%04X', $W1; # 0xD801
sprintf '0b%016b', $W1; # 0b1101100000000001
my $W2 = $low_ten_bits + 0xDC00; # low surrogate
sprintf '%d', $W2; # 56460
sprintf '0x%04X', $W2; # 0xDC8C
sprintf '0b%016b', $W2; # 0b1101110010001100
# finally emit the concatenation of W1 and W2
# your original arithmetic checks out:
($W1 << 16) + $W2 # 3624000652
}
Reverse direction:
my #octets = (0xD8, 0x01, 0xDC, 0x8C);
my $W1 = ($octets[0] << 8) + $octets[1];
sprintf '%d', $W1; # 55297
sprintf '0x%04X', $W1; # 0xD801
sprintf '0b%016b', $W1; # 0b1101100000000001
my $W2 = ($octets[2] << 8) + $octets[3];
sprintf '%d', $W2; # 56460
sprintf '0x%04X', $W2; # 0xDC8C
sprintf '0b%016b', $W2; # 0b1101110010001100
my $high_ten_bits = $W1 - 0xD800;
sprintf '0b%010b', $high_ten_bits; # 0b0000000001
my $low_ten_bits = $W2 - 0xDC00;
sprintf '0b%010b', $low_ten_bits; # 0b0010001100
my $U_prime = ($high_ten_bits << 10) + $low_ten_bits;
sprintf '%d', $U_prime; # 1164
sprintf '0x%04X', $U_prime; # 0x048C
sprintf '0b%020b', $U_prime; # 0b00000000010010001100
my $U = $U_prime + 0x1_0000;
sprintf '%d', $U; # 66700

Related

Perl: Separate the MSB and rest of the bits from a hex

I need to separate the MSB and the rest of the bits from a hex string. So for example: I have a hex string a2, which is equivalent to 1010 0010. I want to separate out the MSB (1 in this case), and rest of the number convert to decimal. I think I can do something like this:
$hex = 'a2';
$dec = hex($hex);
$bin = sprintf("%b", $dec);
$msb = substr $bin, 0, 1;
$rest = substr $bin 1, 7;
$restDec = oct("0b" . $rest);
However, I do not like using strings for bit operations. Is there a better way of doing this?

Trivial using bitwise operators:
$msb = ($dec & 128) >> 7
$rest = ($dec & 127)
Explanation:
Decimal 128 is 0x80 or 0b1000_0000, so using the bitwise "and" operator with 128 masks (sets to zero) all but the top bit, which we then shift down to the LSB where the result ends up being 0 or 1. In actuality you could dispense with the masking operation and just shift right but explicitly masking has two advantages:
It makes the intent crystal clear, and
works even if you inadvertently apply this to a number larger than 255.
Decimal 127 is 0x7F or 0b0111_1111 and bitwise "and-ing" this with $dec sets the MSB to zero while leaving alone the rest of the bits.
Additional note: Perl has hexadecimal numeric literals (0x...) and binary literals (0b...), so the above could also be written
$msb = ($dec & 0x80) >> 7
$rest = ($dec & 0x7F)
Or even
$msb = ($dec & 0b10000000) >> 7
$rest = ($dec & 0b01111111)

How to do bitwise compare of two hex strings?

I have two strings representing hex numbers which I need to make a bitwise comparison of. Each hex number equates to 256 bits. I need to determine how many of the bits are different. How can I do this in perl?
$hash1 = "7ff005f88270898ec31359b9ca80213165a318f149267e4c2f292a00e216e4ef";
$hash2 = "3fb40df88a78890e815251b1fb8021356da330f149266f453f292a11e216e4ee";
My question is similar to this question but I need to do it in perl.

my $bytes1 = pack('H*', $hash1);
my $bytes2 = pack('H*', $hash2);
my $xor = unpack('B*', $bytes1 ^ $bytes2);
my $count = $xor =~ tr/1//;
pack('H*', ...) converts the hex strings into byte strings. The byte strings are then XORed and converted to a bit string with unpack('B*', ...). The tr operator is used to count the number of 1s (different bits) in the bit string.
Or, using the checksum trick described here:
my $bytes1 = pack('H*', $hash1);
my $bytes2 = pack('H*', $hash2);
my $count = unpack('%32B*', $bytes1 ^ $bytes2);

$hash1 =~ s/([a-f0-9][a-f0-9])/unpack('B*',pack('H*',$1))/egi;
$hash2 =~ s/([a-f0-9][a-f0-9])/unpack('B*',pack('H*',$1))/egi;
$count = ($hash1 ^ $hash2) =~ tr/\0//c;

Perl pack/unpack and length of binary string

Consider this short example:
$a = pack("d",255);
print length($a)."\n";
# Prints 8
$aa = pack("ddddd", 255,123,0,45,123);
print length($aa)."\n";
# Prints 40
#unparray = unpack("d "x5, $aa);
print scalar(#unparray)."\n";
# Prints 5
print length($unparray[0])."\n"
# Prints 3
printf "%d\n", $unparray[0] '
# Prints 255
# As a one-liner:
# perl -e '$a = pack("d",255); print length($a)."\n"; $aa = pack("dd", 255,123,0,45,123); print length($aa)."\n"; #unparray = unpack("d "x5, $aa); print scalar(#unparray)."\n"; print length($unparray[0])."\n"; printf "%d\n", $unparray[0] '
Now, I'd expect a double-precision float to be eight bytes, so the first length($a) is correct. But why is the length after the unpack (length($unparray[0])) reporting 3 - when I'm trying to go back the exact same way (double-precision, i.e. eight bytes) - and the value of the item (255) is correctly preserved?

By unpacking what you packed, you've gotten back the original values, and the first value is 255. The stringification of 255 is "255", which is 3 characters long, and that's what length tells you.

How do I unpack a double-precision value in Perl?

From this question:
bytearray - Perl pack/unpack and length of binary string - Stack Overflow
I've learned that #unparray = unpack("d "x5, $aa); in the snippet below results with string items in the unparray - not with double precision numbers (as I expected).
Is it possible to somehow obtain an array of double-precision values from the $aa bytestring in the snippet below?:
$a = pack("d",255);
print length($a)."\n";
# prints 8
$aa = pack("ddddd", 255,123,0,45,123);
print length($aa)."\n";
# prints 40
#unparray = unpack("d "x5, $aa);
print scalar(#unparray)."\n";
# prints 5
print length($unparray[0])."\n"
# prints 3
printf "%d\n", $unparray[0] '
# prints 255
# one liner:
# perl -e '$a = pack("d",255); print length($a)."\n"; $aa = pack("ddddd", 255,123,0,45,123); print length($aa)."\n"; #unparray = unpack("d "x5, $aa); print scalar(#unparray)."\n"; print length($unparray[0])."\n" '
Many thanks in advance for any answers,
Cheers!

What makes you think it's not stored as a double?
use feature qw( say );
use Config qw( %Config );
use Devel::Peek qw( Dump );
my #a = unpack "d5", pack "d5", 255,123,0,45,123;
say 0+#a; # 5
Dump $a[0]; # NOK (floating point format)
say $Config{nvsize}; # 8 byte floats on this build

Sorry, but you've misunderstood hobbs' answer to your earlier question.
$unparray[0] is a double-precision floating-point value; but length is not like (say) C's sizeof operator, and doesn't tell you the size of its argument. Rather, it converts its argument to a string, and then tells you the length of that string.
For example, this:
my $a = 3.0 / 1.5;
print length($a), "\n";
will print this:
1
because it sets $a to 2.0, which gets stringified as 2, which has length 1.

Perl Decimal to Binary 32-bit then 8-bit

I've got a number (3232251030) that needs to be translated from Decimal to Binary.
Once I've gotten the binary, I need to separate 8-bits of it into digits, revealing an ip address.
Converting Decimal to Binary is simple:
sub dec2bin { my $str = unpack("B32", pack("N", shift)); $str =~ s/^0+(?=\d)//; # otherwise you'll get leading zeros return $str; }
sub bin2dec { return unpack("N", pack("B32", substr("0" x 32 . shift, -32))); }
e.g. $num = bin2dec('0110110'); # $num is 54 $binstr = dec2bin(54); # $binstr is 110110
Reference: http://www.perlmonks.org/?node_id=2664
So now, I need to split 8 digits off the binary and save it into numbers that makes an IP address.
$num = dec2bin('3232251030');
($num is "11000000 10101000 01000100 00001110" in binary)
I need to split and save each 8-bits "11000000 10101000 01000100 00001110" into "192.168.60.150".
Care to advice? I'm looking into split function for this..

You don't actually have to convert to a binary string, just a 32-bit integer:
print join '.', unpack('CCCC', pack('N', 3232251030));
will print 192.168.60.150

say join('.', unpack('C4', pack('N', 3232251030)));
and
use Socket qw( inet_ntoa );
say inet_ntoa(pack('N', 3232251030));
both output
192.168.60.150

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How UTF-16 and UTF-8 conversion happen? - unicode

Related

Perl: Separate the MSB and rest of the bits from a hex

How to do bitwise compare of two hex strings?

Perl pack/unpack and length of binary string

How do I unpack a double-precision value in Perl?

Perl Decimal to Binary 32-bit then 8-bit

Categories

Resources