How are surrogate pairs calculated? - unicode

If a unicode uses 17 bits codepoints, how the surrogate pairs is calculated from code points?

Unicode code points are scalar values which range from 0x000000 to 0x10FFFF. Thus they are are 21 bit integers, not 17 bit.
Surrogate pairs are a mechanism of the UTF-16 form. This represents the 21-bit scalar values as one or two 16-bit code units.
Scalar values from 0x000000 to 0x00FFFF are represented as a single 16-bit code unit, from 0x0000 to 0xFFFF.
Scalar values from 0x00D800 to 0x00DFFF are not characters in Unicode, and so they will never occur in a Unicode character string.
Scalar values from 0x010000 to 0x10FFFF are represented as two 16-bit code units. The first code unit encodes the upper 11 bits of the scalar value, as a code unit ranging from 0xD800-0xDBFF. There's a bit of trickiness to encode values from 0x01-0x10 in four bits. The second code unit encodes the lower 10 bits of the scalar value, as a code unit ranging from 0xDC00-0xDFFF.
This is explained in detail, with sample code, in the Unicode consortium's FAQ, UTF-8, UTF-16, UTF-32 & BOM. That FAQ refers to the section of the Unicode Standard which has even more detail.

If it is code you are after, here is how a single codepoint is encoded in UTF-16 and UTF-8 respectively.
A single codepoint to UTF-16 codeunits:
if (cp < 0x10000u)
{
*out++ = static_cast<uint16_t>(cp);
}
else
{
*out++ = static_cast<uint16_t>(0xd800u + (((cp - 0x10000u) >> 10) & 0x3ffu));
*out++ = static_cast<uint16_t>(0xdc00u + ((cp - 0x10000u) & 0x3ffu));
}
A single codepoint to UTF-8 codeunits:
if (cp < 0x80u)
{
*out++ = static_cast<uint8_t>(cp);
}
else if (cp < 0x800u)
{
*out++ = static_cast<uint8_t>((cp >> 6) & 0x1fu | 0xc0u);
*out++ = static_cast<uint8_t>((cp & 0x3fu) | 0x80u);
}
else if (cp < 0x10000u)
{
*out++ = static_cast<uint8_t>((cp >> 12) & 0x0fu | 0xe0u);
*out++ = static_cast<uint8_t>(((cp >> 6) & 0x3fu) | 0x80u);
*out++ = static_cast<uint8_t>((cp & 0x3fu) | 0x80u);
}
else
{
*out++ = static_cast<uint8_t>((cp >> 18) & 0x07u | 0xf0u);
*out++ = static_cast<uint8_t>(((cp >> 12) & 0x3fu) | 0x80u);
*out++ = static_cast<uint8_t>(((cp >> 6) & 0x3fu) | 0x80u);
*out++ = static_cast<uint8_t>((cp & 0x3fu) | 0x80u);
}

Related

How to isolate leftmost bytes in integer

This has to be done in Perl:
I have integers on the order of e.g. 30_146_890_129 and 17_181_116_691 and 21_478_705_663.
These are supposedly made up of 6 bytes, where:
bytes 0-1 : value a
bytes 2-3 : value b
bytes 4-5 : value c
I want to isolate what value a is. How can I do this in Perl?
I've tried using the >> operator:
perl -e '$a = 330971351478 >> 16; print "$a\n";'
5050222
perl -e '$a = 17181116691 >> 16; print "$a\n";'
262163
But these numbers are not on the order of what I am expecting, more like 0-1000.
Bonus if I can also get values b and c but I don't really need those.
Thanks!
number >> 16 returns number shifted by 16 bit and not the shifted bits as you seem to assume. To get the last 16 bit you might for example use number % 2**16 or number & 0xffff. To get to b and c you can just shift before getting the last 16 bits, i.e.
$a = $number & 0xffff;
$b = ($number >> 16) & 0xffff;
$c = ($number >> 32) & 0xffff;
If you have 6 bytes, you don't need to convert them to a number first. You can use one the following depending on the order of the bytes: (Uppercase represents the most significant byte.)
my ($num_c, $num_b, $num_a) = unpack('nnn', "\xCC\xcc\xBB\xbb\xAA\xaa");
my ($num_a, $num_b, $num_c) = unpack('nnn', "\xAA\xaa\xBB\xbb\xAA\xaa");
my ($num_c, $num_b, $num_a) = unpack('vvv', "\xcc\xCC\xbb\xBB\xaa\xAA");
my ($num_a, $num_b, $num_c) = unpack('vvv', "\xaa\xAA\xbb\xBB\xcc\xCC");
If you are indeed provided with a number 0xCCccBBbbAAaa), you can convert it to bytes then extract the numbers you want from it as follows:
my ($num_c, $num_b, $num_a) = unpack('xxnnn', pack('Q>', $num));
Alternatively, you could also use an arithmetic approach like you attempted.
my $num_a = $num & 0xFFFF;
my $num_b = ( $num >> 16 ) & 0xFFFF;
my $num_c = $num >> 32;
While the previous two solutions required a Perl built to use 64-bit integers, the following will work with any build of Perl:
my $num_a = $num % 2**16;
my $num_b = ( $num / 2**16 ) % 2**16;
my $num_c = int( $num / 2**32 );
Let's look at ( $num >> 16 ) & 0xFFFF in detail.
Original number: 0x0000CCccBBbbAAaa
After shifting: 0x00000000CCccBBbb
After masking: 0x000000000000BBbb

How to write a unicode symbol in lua

How can I write a Unicode symbol in lua. For example I have to write symbol with 9658
when I write
string.char( 9658 );
I got an error. So how is it possible to write such a symbol.
Lua does not look inside strings. So, you can just write
mychar = "►"
(added in 2015)
Lua 5.3 introduced support for UTF-8 escape sequences:
The UTF-8 encoding of a Unicode character can be inserted in a literal string with the escape sequence \u{XXX} (note the mandatory enclosing brackets), where XXX is a sequence of one or more hexadecimal digits representing the character code point.
You can also use utf8.char(9658).
Here is an encoder for Lua that takes a Unicode code point and produces a UTF-8 string for the corresponding character:
do
local bytemarkers = { {0x7FF,192}, {0xFFFF,224}, {0x1FFFFF,240} }
function utf8(decimal)
if decimal<128 then return string.char(decimal) end
local charbytes = {}
for bytes,vals in ipairs(bytemarkers) do
if decimal<=vals[1] then
for b=bytes+1,2,-1 do
local mod = decimal%64
decimal = (decimal-mod)/64
charbytes[b] = string.char(128+mod)
end
charbytes[1] = string.char(vals[2]+decimal)
break
end
end
return table.concat(charbytes)
end
end
c=utf8(0x24) print(c.." is "..#c.." bytes.") --> $ is 1 bytes.
c=utf8(0xA2) print(c.." is "..#c.." bytes.") --> ¢ is 2 bytes.
c=utf8(0x20AC) print(c.." is "..#c.." bytes.") --> € is 3 bytes.
c=utf8(0x24B62) print(c.." is "..#c.." bytes.") --> 𤭢 is 4 bytes.
Maybe this can help you:
function FromUTF8(pos)
local mod = math.mod
local function charat(p)
local v = editor.CharAt[p]; if v < 0 then v = v + 256 end; return v
end
local v, c, n = 0, charat(pos), 1
if c < 128 then v = c
elseif c < 192 then
error("Byte values between 0x80 to 0xBF cannot start a multibyte sequence")
elseif c < 224 then v = mod(c, 32); n = 2
elseif c < 240 then v = mod(c, 16); n = 3
elseif c < 248 then v = mod(c, 8); n = 4
elseif c < 252 then v = mod(c, 4); n = 5
elseif c < 254 then v = mod(c, 2); n = 6
else
error("Byte values between 0xFE and OxFF cannot start a multibyte sequence")
end
for i = 2, n do
pos = pos + 1; c = charat(pos)
if c < 128 or c > 191 then
error("Following bytes must have values between 0x80 and 0xBF")
end
v = v * 64 + mod(c, 64)
end
return v, pos, n
end
To get broader support for Unicode string content, one approach is slnunicode which was developed as part of the Selene database library. It will give you a module that supports most of what the standard string library does, but with Unicode characters and UTF-8 encoding.

Convert 16bit colour to 32bit

I've got an 16bit bitmap image with each colour represented as a single short (2 bytes), I need to display this in a 32bit bitmap context. How can I convert a 2 byte colour to a 4 byte colour in C++?
The input format contains each colour in a single short (2 bytes).
The output format is 32bit RGB. This means each pixel has 3 bytes I believe?
I need to convert the short value into RGB colours.
Excuse my lack of knowledge of colours, this is my first adventure into the world of graphics programming.
Normally a 16-bit pixel is 5 bits of red, 6 bits of green, and 5 bits of blue data. The minimum-error solution (that is, for which the output color is guaranteed to be as close a match to the input colour) is:
red8bit = (red5bit << 3) | (red5bit >> 2);
green8bit = (green6bit << 2) | (green6bit >> 4);
blue8bit = (blue5bit << 3) | (blue5bit >> 2);
To see why this solution works, let's look at at a red pixel. Our 5-bit red is some fraction fivebit/31. We want to translate that into a new fraction eightbit/255. Some simple arithmetic:
fivebit eightbit
------- = --------
31 255
Yields:
eightbit = fivebit * 8.226
Or closely (note the squiggly ≈):
eightbit ≈ (fivebit * 8) + (fivebit * 0.25)
That operation is a multiply by 8 and a divide by 4. Owch - both operations that might take forever on your hardware. Lucky thing they're both powers of two and can be converted to shift operations:
eightbit = (fivebit << 3) | (fivebit >> 2);
The same steps work for green, which has six bits per pixel, but you get an accordingly different answer, of course! The quick way to remember the solution is that you're taking the top bits off of the "short" pixel and adding them on at the bottom to make the "long" pixel. This method works equally well for any data set you need to map up into a higher resolution space. A couple of quick examples:
five bit space eight bit space error
00000 00000000 0%
11111 11111111 0%
10101 10101010 0.02%
00111 00111001 -1.01%
Common formats include BGR0,
RGB0, 0RGB, 0BGR. In the code below I have assumed 0RGB. Changing this
is easy, just modify the shift amounts in the last line.
unsigned long rgb16_to_rgb32(unsigned short a)
{
/* 1. Extract the red, green and blue values */
/* from rrrr rggg gggb bbbb */
unsigned long r = (a & 0xF800) >11;
unsigned long g = (a & 0x07E0) >5;
unsigned long b = (a & 0x001F);
/* 2. Convert them to 0-255 range:
There is more than one way. You can just shift them left:
to 00000000 rrrrr000 gggggg00 bbbbb000
r <<= 3;
g <<= 2;
b <<= 3;
But that means your image will be slightly dark and
off-colour as white 0xFFFF will convert to F8,FC,F8
So instead you can scale by multiply and divide: */
r = r * 255 / 31;
g = g * 255 / 63;
b = b * 255 / 31;
/* This ensures 31/31 converts to 255/255 */
/* 3. Construct your 32-bit format (this is 0RGB): */
return (r << 16) | (g << 8) | b;
/* Or for BGR0:
return (r << 8) | (g << 16) | (b << 24);
*/
}
Multiply the three (four, when you have an alpha layer) values by 16 - that's it :)
You have a 16-bit color and want to make it a 32-bit color. This gives you four times four bits, which you want to convert to four times eight bits. You're adding four bits, but you should add them to the right side of the values. To do this, shift them by four bits (multiply by 16). Additionally you could compensate a bit for inaccuracy by adding 8 (you're adding 4 bits, which has the value of 0-15, and you can take the average of 8 to compensate)
Update This only applies to colors that use 4 bits for each channel and have an alpha channel.
There some questions about the model like is it HSV, RGB?
If you wanna ready, fire, aim I'd try this first.
#include <stdint.h>
uint32_t convert(uint16_t _pixel)
{
uint32_t pixel;
pixel = (uint32_t)_pixel;
return ((pixel & 0xF000) << 16)
| ((pixel & 0x0F00) << 12)
| ((pixel & 0x00F0) << 8)
| ((pixel & 0x000F) << 4);
}
This maps 0xRGBA -> 0xRRGGBBAA, or possibly 0xHSVA -> 0xHHSSVVAA, but it won't do 0xHSVA -> 0xRRGGBBAA.
I'm here long after the fight, but I actually had the same problem with ARGB color instead, and none of the answers are truly right: Keep in mind that this answer gives a response for a slightly different situation where we want to do this conversion:
AAAARRRRGGGGBBBB >>= AAAAAAAARRRRRRRRGGGGGGGGBBBBBBBB
If you want to keep the same ratio of your color, you simply have to do a cross-multiplication: You want to convert a value x between 0 and 15 to a value between 0 and 255: therefore you want: y = 255 * x / 15.
However, 255 = 15 * 17, which itself, is 16 + 1: you now have y = 16 * x + x
Which is actually the same as doing a for bits shift to the left and then adding the value again (or more visually, duplicating the value: 0b1101 becomes 0b11011101).
Now that you have this, you can compute your whole number by doing:
a = v & 0b1111000000000000
r = v & 0b111100000000
g = v & 0b11110000
b = v & 0b1111
return b | b << 4 | g << 4 | g << 8 | r << 8 | r << 12 | a << 12 | a << 16
Moreover, as the lower bits wont have much effect on the final color and if exactitude isnt necessary, you can gain some performances by simply multiplying each component by 16:
return b << 4 | g << 8 | r << 12 | a << 16
(All the left shifts values are strange because we did not bother doing a right shift before)

preventing overlong forms when parsing UTF-8

I have been working on another UTF-8 parser as a personal exercise, and while my implementation works quite well, and it rejects most malformed sequences (replacing them with U+FFFD), I can't seem to figure out how to implement rejection of overlong forms. Could anyone tell me how to do so?
Pseudocode:
let w = 0, // the number of continuation bytes pending
c = 0, // the currently being constructed codepoint
b, // the current byte from the source stream
valid(c) = (
(c < 0x110000) &&
((c & 0xFFFFF800) != 0xD800) &&
((c < 0xFDD0) || (c > 0xFDEF)) &&
((c & 0xFFFE) != 0xFFFE))
for each b:
if b < 0x80:
if w > 0: // premature ending to multi-byte sequence
append U+FFFD to output string
w = 0
append U+b to output string
else if b < 0xc0:
if w == 0: // unwanted continuation byte
append U+FFFD to output string
else:
c |= (b & 0x3f) << (--w * 6)
if w == 0: // done
if valid(c):
append U+c to output string
else if b < 0xfe:
if w > 0: // premature ending to multi-byte sequence
append U+FFFD to output string
w = (b < 0xe0) ? 1 :
(b < 0xf0) ? 2 :
(b < 0xf8) ? 3 :
(b < 0xfc) ? 4 : 5;
c = (b & ((1 << (6 - w)) - 1)) << (w * 6); // ugly monstrosity
else:
append U+FFFD to output string
if w > 0: // end of stream and we're still waiting for continuation bytes
append U+FFFD to output string
If you save the number of bytes you'll need (so you save a second copy of the initial value of w), you can compare the UTF32 value of the codepoint (I think you are calling it c) with the number of bytes that were used to encode it. You know that:
U+0000 - U+007F 1 byte
U+0080 - U+07FF 2 bytes
U+0800 - U+FFFF 3 bytes
U+10000 - U+1FFFFF 4 bytes
U+200000 - U+3FFFFFF 5 bytes
U+4000000 - U+7FFFFFFF 6 bytes
(and I hope I have done the right math on the left column! Hex math isn't my strong point :-) )
Just as a sidenote: I think there are some logic errors/formatting errors. if b < 0x80 if w > 0 what happens if w = 0? (so for example if you are decoding A)? And shouldn't you reset c when you find an illegal codepoint?
Once you have the decoded character, you can tell how many bytes it should have had if properly encoded just by looking at the highest bit set.
If the highest set bit's position is <= 7, the UTF-8 encoding requires 1 octet.
If the highest set bit's position is <= 11, the UTF-8 encoding requires 2 octets.
If the highest set bit's position is <= 16, the UTF-8 encoding requires 3 octets.
etc.
If you save the original w and compare it to these values, you'll be able to tell if the encoding was proper or overlong.
I had initially thought that if at any point in time after decoding a byte, w > 0 && c == 0, you have an overlong form. However, it's more complicated than that as Jan pointed out. The simplest answer is probably to have a table like xanatos has, only rejecting anything longer than 4 bytes:
if c < 0x80 && len > 1 ||
c < 0x800 && len > 2 ||
c < 0x10000 && len > 3 ||
len > 4:
append U+FFFD to output string

ObjC getting 64 bit number from stream

I am successfully passing a 64 bit number from a objC client to a java client, but am unable to send to an objC client.
Java Code
/*
* Retrieve a double (64-bit) number from the stream.
*/
private double getDouble() throws IOException
{
byte[] buffer = getBytes(8);
long bits =
((long)buffer[0] & 0x0ff) |
(((long)buffer[1] & 0x0ff) << 8) |
(((long)buffer[2] & 0x0ff) << 16) |
(((long)buffer[3] & 0x0ff) << 24) |
(((long)buffer[4] & 0x0ff) << 32) |
(((long)buffer[5] & 0x0ff) << 40) |
(((long)buffer[6] & 0x0ff) << 48) |
(((long)buffer[7] & 0x0ff) << 56);
return Double.longBitsToDouble(bits);
}
objC code
/*
* Retrieve a double (64-bit) number from the stream.
*/
- (double)getDouble
{
NSRange dblRange = NSMakeRange(0, 8);
char buffer[8];
[stream getBytes:buffer length:8];
[stream replaceBytesInRange:dblRange withBytes:NULL length:0];
long long bits =
((long long)buffer[0] & 0x0ff) |
(((long long)buffer[1] & 0x0ff) << 8) |
(((long long)buffer[2] & 0x0ff) << 16) |
(((long long)buffer[3] & 0x0ff) << 24) |
(((long long)buffer[4] & 0x0ff) << 32) |
(((long long)buffer[5] & 0x0ff) << 40) |
(((long long)buffer[6] & 0x0ff) << 48) |
(((long long)buffer[7] & 0x0ff) << 56);
NSNumber *tempNum = [NSNumber numberWithLongLong:bits];
NSLog(#"\n***********\nsizeof long long %d \n tempNum: %#\nbits %lld",sizeof(long long), tempNum, bits);
return [tempNum doubleValue];
}
the result of NSLog is
sizeof long long 8
tempNum: -4616134021117358511
bits -4616134021117358511
the number should be : -1.012345
The problem is that I am trying to convert Java to objC in the getDouble func. My middleware takes into account the endian issues. The simple solution is if the target is little endian
- (double)getDouble
NSRange dblRange = NSMakeRange(0, 8);
double swapped;
[stream getBytes:&swapped length:8];
[stream replaceBytesInRange:dblRange withBytes:NULL length:0];
return swapped;
Thanks all for input - got a lot of experience and a little understanding from this exercise.
A double and a long long are not the same thing. A long represents an integer, which has no fractional portion, and a double represents a floating-point number, which has a fractional portion. These two types have completely different ways of representing their values in memory. That is to say, if you were to look at the bits for a long long representing the number 4000 and compare those to the bits for a double representing the number 4000, they would be different.
So as Wevah notes, the first step is for you to use the proper double type, and the correct %f formatter in your call to NSLog().
I would add, though, that you also need to be careful to get your bytes in the native order for the machine your C code is running on. For a detailed description of what I'm referring to, see http://en.wikipedia.org/wiki/Endianness The short version is that different processors may represent numbers in different ways in memory, and you need to ensure in your code that once you get a pile of bytes from the network, you are putting the bytes in the right order for your processor before you attempt to interpret it as a number.
Luckily, this is a solved issue, and is easily accounted for by using the CFConvertFloat64SwappedToHost() function from CoreFoundation:
[stream getBytes:buffer length:8];
[stream replaceBytesInRange:dblRange withBytes:NULL length:0];
double myDouble = CFConvertFloat64SwappedToHost(*((double*)buffer));
NSNumber *tempNum = [NSNumber numberWithDouble:myDouble];
NSLog(#"\n***********\nsizeof double %d \n tempNum: %#\nbits %f",sizeof(double), tempNum, myDouble);
return [tempNum doubleValue];
You probably want to convert it to a double (possibly/probably via a union; see Jonathan's comment) and use the %f specifier.