preventing overlong forms when parsing UTF-8 - unicode

I have been working on another UTF-8 parser as a personal exercise, and while my implementation works quite well, and it rejects most malformed sequences (replacing them with U+FFFD), I can't seem to figure out how to implement rejection of overlong forms. Could anyone tell me how to do so?
Pseudocode:
let w = 0, // the number of continuation bytes pending
c = 0, // the currently being constructed codepoint
b, // the current byte from the source stream
valid(c) = (
(c < 0x110000) &&
((c & 0xFFFFF800) != 0xD800) &&
((c < 0xFDD0) || (c > 0xFDEF)) &&
((c & 0xFFFE) != 0xFFFE))
for each b:
if b < 0x80:
if w > 0: // premature ending to multi-byte sequence
append U+FFFD to output string
w = 0
append U+b to output string
else if b < 0xc0:
if w == 0: // unwanted continuation byte
append U+FFFD to output string
else:
c |= (b & 0x3f) << (--w * 6)
if w == 0: // done
if valid(c):
append U+c to output string
else if b < 0xfe:
if w > 0: // premature ending to multi-byte sequence
append U+FFFD to output string
w = (b < 0xe0) ? 1 :
(b < 0xf0) ? 2 :
(b < 0xf8) ? 3 :
(b < 0xfc) ? 4 : 5;
c = (b & ((1 << (6 - w)) - 1)) << (w * 6); // ugly monstrosity
else:
append U+FFFD to output string
if w > 0: // end of stream and we're still waiting for continuation bytes
append U+FFFD to output string

If you save the number of bytes you'll need (so you save a second copy of the initial value of w), you can compare the UTF32 value of the codepoint (I think you are calling it c) with the number of bytes that were used to encode it. You know that:
U+0000 - U+007F 1 byte
U+0080 - U+07FF 2 bytes
U+0800 - U+FFFF 3 bytes
U+10000 - U+1FFFFF 4 bytes
U+200000 - U+3FFFFFF 5 bytes
U+4000000 - U+7FFFFFFF 6 bytes
(and I hope I have done the right math on the left column! Hex math isn't my strong point :-) )
Just as a sidenote: I think there are some logic errors/formatting errors. if b < 0x80 if w > 0 what happens if w = 0? (so for example if you are decoding A)? And shouldn't you reset c when you find an illegal codepoint?

Once you have the decoded character, you can tell how many bytes it should have had if properly encoded just by looking at the highest bit set.
If the highest set bit's position is <= 7, the UTF-8 encoding requires 1 octet.
If the highest set bit's position is <= 11, the UTF-8 encoding requires 2 octets.
If the highest set bit's position is <= 16, the UTF-8 encoding requires 3 octets.
etc.
If you save the original w and compare it to these values, you'll be able to tell if the encoding was proper or overlong.

I had initially thought that if at any point in time after decoding a byte, w > 0 && c == 0, you have an overlong form. However, it's more complicated than that as Jan pointed out. The simplest answer is probably to have a table like xanatos has, only rejecting anything longer than 4 bytes:
if c < 0x80 && len > 1 ||
c < 0x800 && len > 2 ||
c < 0x10000 && len > 3 ||
len > 4:
append U+FFFD to output string

Related

Understanding CRC32 value as division remainder

I'm struggling with understanding CRC algorithm. I've been reading this tutorial and if I got it correctly a CRC value is just a remainder of a division where message serves as the dividend and the divisor is a predefined value - carried out in a special kind of polynomial arithmetic. It looked quote simple so I tried implementing CRC-32:
public static uint Crc32Naive(byte[] bytes)
{
uint poly = 0x04c11db7; // (Poly)
uint crc = 0xffffffff; // (Init)
foreach (var it in bytes)
{
var b = (uint)it;
for (var i = 0; i < 8; ++i)
{
var prevcrc = crc;
// load LSB from current byte into LSB of crc (RefIn)
crc = (crc << 1) | (b & 1);
b >>= 1;
// subtract polynomial if we've just popped out 1
if ((prevcrc & 0x80000000) != 0)
crc ^= poly;
}
}
return Reverse(crc ^ 0xffffffff); // (XorOut) (RefOut)
}
(where Reverese function reverses bit order)
It is supposed to be analogous to following method of division (with some additional adjustments):
1100001010
_______________
10011 ) 11010110110000
10011,,.,,....
-----,,.,,....
10011,.,,....
10011,.,,....
-----,.,,....
00001.,,....
00000.,,....
-----.,,....
00010,,....
00000,,....
-----,,....
00101,....
00000,....
-----,....
01011....
00000....
-----....
10110...
10011...
-----...
01010..
00000..
-----..
10100.
10011.
-----.
01110
00000
-----
1110 = Remainder
For: 0x00 function returns 0xd202ef8d which is correct, but for 0x01 - 0xd302ef8d instead of 0xa505df1b (I've been using this page to verify my results).
Solution from my implementation has more sense to me: incrementing dividend by 1 should only change reminder by 1, right? But it turns out that the result should look completely different. So apparently I am missing something obvious. What is it? How can changing the least significant number in a dividend influence the result this much?
This is an example of a left shifting CRC that emulates division, with the CRC initialized = 0, and no complementing or reversing of the crc. The example code is emulating a division where 4 bytes of zeroes are appended to bytes[] ({bytes[],0,0,0,0} is the dividend, the divisor is 0x104c11db7, the quotient is not used, and the remainder is the CRC).
public static uint Crc32Naive(byte[] bytes)
{
uint poly = 0x04c11db7; // (Poly is actually 0x104c11db7)
uint crc = 0; // (Init)
foreach (var it in bytes)
{
crc ^= (((int)it)<<24); // xor next byte to upper 8 bits of crc
for (var i = 0; i < 8; ++i) // cycle the crc 8 times
{
var prevcrc = crc;
crc = (crc << 1);
// subtract polynomial if we've just popped out 1
if ((prevcrc & 0x80000000) != 0)
crc ^= poly;
}
}
return crc;
}
It's common to initialize the CRC to something other than zero, but it's not that common to post-complement the CRC, and I'm not aware of any CRC that does a post bit reversal of the CRC.
Another variations of CRC is one that right shifts, normally used to emulate hardware where data is sent in bytes least significant bit first.

Does PureScript support “format strings” like C / Java etc.?

I need to output a number with leading zeros and as six digits. In C or Java I would use "%06d" as a format string to do this. Does PureScript support format strings? Or how would I achieve this?
I don't know of any module that would support a printf-style functionality in PureScript. It would be very nice to have a type-safe way to format numbers.
In the meantime, I would write something likes this:
import Data.String (length, fromCharArray)
import Data.Array (replicate)
-- | Pad a string with the given character up to a maximum length.
padLeft :: Char -> Int -> String -> String
padLeft c len str = prefix <> str
where prefix = fromCharArray (replicate (len - length str) c)
-- | Pad a number with leading zeros up to the given length.
padZeros :: Int -> Int -> String
padZeros len num | num >= 0 = padLeft '0' len (show num)
| otherwise = "-" <> padLeft '0' len (show (-num))
Which produces the following results:
> padZeros 6 8
"000008"
> padZeros 6 678
"000678"
> padZeros 6 345678
"345678"
> padZeros 6 12345678
"12345678"
> padZeros 6 (-678)
"-000678"
Edit: In the meantime, I've written a small module that can format numbers in this way:
https://github.com/sharkdp/purescript-format
For your particular example, you would need to do the following:
If you want to format Integers:
> format (width 6 <> zeroFill) 123
"000123"
If you want to format Numbers
> format (width 6 <> zeroFill <> precision 1) 12.345
"0012.3"

How to write a unicode symbol in lua

How can I write a Unicode symbol in lua. For example I have to write symbol with 9658
when I write
string.char( 9658 );
I got an error. So how is it possible to write such a symbol.
Lua does not look inside strings. So, you can just write
mychar = "►"
(added in 2015)
Lua 5.3 introduced support for UTF-8 escape sequences:
The UTF-8 encoding of a Unicode character can be inserted in a literal string with the escape sequence \u{XXX} (note the mandatory enclosing brackets), where XXX is a sequence of one or more hexadecimal digits representing the character code point.
You can also use utf8.char(9658).
Here is an encoder for Lua that takes a Unicode code point and produces a UTF-8 string for the corresponding character:
do
local bytemarkers = { {0x7FF,192}, {0xFFFF,224}, {0x1FFFFF,240} }
function utf8(decimal)
if decimal<128 then return string.char(decimal) end
local charbytes = {}
for bytes,vals in ipairs(bytemarkers) do
if decimal<=vals[1] then
for b=bytes+1,2,-1 do
local mod = decimal%64
decimal = (decimal-mod)/64
charbytes[b] = string.char(128+mod)
end
charbytes[1] = string.char(vals[2]+decimal)
break
end
end
return table.concat(charbytes)
end
end
c=utf8(0x24) print(c.." is "..#c.." bytes.") --> $ is 1 bytes.
c=utf8(0xA2) print(c.." is "..#c.." bytes.") --> ¢ is 2 bytes.
c=utf8(0x20AC) print(c.." is "..#c.." bytes.") --> € is 3 bytes.
c=utf8(0x24B62) print(c.." is "..#c.." bytes.") --> 𤭢 is 4 bytes.
Maybe this can help you:
function FromUTF8(pos)
local mod = math.mod
local function charat(p)
local v = editor.CharAt[p]; if v < 0 then v = v + 256 end; return v
end
local v, c, n = 0, charat(pos), 1
if c < 128 then v = c
elseif c < 192 then
error("Byte values between 0x80 to 0xBF cannot start a multibyte sequence")
elseif c < 224 then v = mod(c, 32); n = 2
elseif c < 240 then v = mod(c, 16); n = 3
elseif c < 248 then v = mod(c, 8); n = 4
elseif c < 252 then v = mod(c, 4); n = 5
elseif c < 254 then v = mod(c, 2); n = 6
else
error("Byte values between 0xFE and OxFF cannot start a multibyte sequence")
end
for i = 2, n do
pos = pos + 1; c = charat(pos)
if c < 128 or c > 191 then
error("Following bytes must have values between 0x80 and 0xBF")
end
v = v * 64 + mod(c, 64)
end
return v, pos, n
end
To get broader support for Unicode string content, one approach is slnunicode which was developed as part of the Selene database library. It will give you a module that supports most of what the standard string library does, but with Unicode characters and UTF-8 encoding.

Convert 16bit colour to 32bit

I've got an 16bit bitmap image with each colour represented as a single short (2 bytes), I need to display this in a 32bit bitmap context. How can I convert a 2 byte colour to a 4 byte colour in C++?
The input format contains each colour in a single short (2 bytes).
The output format is 32bit RGB. This means each pixel has 3 bytes I believe?
I need to convert the short value into RGB colours.
Excuse my lack of knowledge of colours, this is my first adventure into the world of graphics programming.
Normally a 16-bit pixel is 5 bits of red, 6 bits of green, and 5 bits of blue data. The minimum-error solution (that is, for which the output color is guaranteed to be as close a match to the input colour) is:
red8bit = (red5bit << 3) | (red5bit >> 2);
green8bit = (green6bit << 2) | (green6bit >> 4);
blue8bit = (blue5bit << 3) | (blue5bit >> 2);
To see why this solution works, let's look at at a red pixel. Our 5-bit red is some fraction fivebit/31. We want to translate that into a new fraction eightbit/255. Some simple arithmetic:
fivebit eightbit
------- = --------
31 255
Yields:
eightbit = fivebit * 8.226
Or closely (note the squiggly ≈):
eightbit ≈ (fivebit * 8) + (fivebit * 0.25)
That operation is a multiply by 8 and a divide by 4. Owch - both operations that might take forever on your hardware. Lucky thing they're both powers of two and can be converted to shift operations:
eightbit = (fivebit << 3) | (fivebit >> 2);
The same steps work for green, which has six bits per pixel, but you get an accordingly different answer, of course! The quick way to remember the solution is that you're taking the top bits off of the "short" pixel and adding them on at the bottom to make the "long" pixel. This method works equally well for any data set you need to map up into a higher resolution space. A couple of quick examples:
five bit space eight bit space error
00000 00000000 0%
11111 11111111 0%
10101 10101010 0.02%
00111 00111001 -1.01%
Common formats include BGR0,
RGB0, 0RGB, 0BGR. In the code below I have assumed 0RGB. Changing this
is easy, just modify the shift amounts in the last line.
unsigned long rgb16_to_rgb32(unsigned short a)
{
/* 1. Extract the red, green and blue values */
/* from rrrr rggg gggb bbbb */
unsigned long r = (a & 0xF800) >11;
unsigned long g = (a & 0x07E0) >5;
unsigned long b = (a & 0x001F);
/* 2. Convert them to 0-255 range:
There is more than one way. You can just shift them left:
to 00000000 rrrrr000 gggggg00 bbbbb000
r <<= 3;
g <<= 2;
b <<= 3;
But that means your image will be slightly dark and
off-colour as white 0xFFFF will convert to F8,FC,F8
So instead you can scale by multiply and divide: */
r = r * 255 / 31;
g = g * 255 / 63;
b = b * 255 / 31;
/* This ensures 31/31 converts to 255/255 */
/* 3. Construct your 32-bit format (this is 0RGB): */
return (r << 16) | (g << 8) | b;
/* Or for BGR0:
return (r << 8) | (g << 16) | (b << 24);
*/
}
Multiply the three (four, when you have an alpha layer) values by 16 - that's it :)
You have a 16-bit color and want to make it a 32-bit color. This gives you four times four bits, which you want to convert to four times eight bits. You're adding four bits, but you should add them to the right side of the values. To do this, shift them by four bits (multiply by 16). Additionally you could compensate a bit for inaccuracy by adding 8 (you're adding 4 bits, which has the value of 0-15, and you can take the average of 8 to compensate)
Update This only applies to colors that use 4 bits for each channel and have an alpha channel.
There some questions about the model like is it HSV, RGB?
If you wanna ready, fire, aim I'd try this first.
#include <stdint.h>
uint32_t convert(uint16_t _pixel)
{
uint32_t pixel;
pixel = (uint32_t)_pixel;
return ((pixel & 0xF000) << 16)
| ((pixel & 0x0F00) << 12)
| ((pixel & 0x00F0) << 8)
| ((pixel & 0x000F) << 4);
}
This maps 0xRGBA -> 0xRRGGBBAA, or possibly 0xHSVA -> 0xHHSSVVAA, but it won't do 0xHSVA -> 0xRRGGBBAA.
I'm here long after the fight, but I actually had the same problem with ARGB color instead, and none of the answers are truly right: Keep in mind that this answer gives a response for a slightly different situation where we want to do this conversion:
AAAARRRRGGGGBBBB >>= AAAAAAAARRRRRRRRGGGGGGGGBBBBBBBB
If you want to keep the same ratio of your color, you simply have to do a cross-multiplication: You want to convert a value x between 0 and 15 to a value between 0 and 255: therefore you want: y = 255 * x / 15.
However, 255 = 15 * 17, which itself, is 16 + 1: you now have y = 16 * x + x
Which is actually the same as doing a for bits shift to the left and then adding the value again (or more visually, duplicating the value: 0b1101 becomes 0b11011101).
Now that you have this, you can compute your whole number by doing:
a = v & 0b1111000000000000
r = v & 0b111100000000
g = v & 0b11110000
b = v & 0b1111
return b | b << 4 | g << 4 | g << 8 | r << 8 | r << 12 | a << 12 | a << 16
Moreover, as the lower bits wont have much effect on the final color and if exactitude isnt necessary, you can gain some performances by simply multiplying each component by 16:
return b << 4 | g << 8 | r << 12 | a << 16
(All the left shifts values are strange because we did not bother doing a right shift before)

Three boolean values saved in one tinyint

probably a simple question but I seem to be suffering from programmer's block. :)
I have three boolean values: A, B, and C. I would like to save the state combination as an unsigned tinyint (max 255) into a database and be able to derive the states from the saved integer.
Even though there are only a limited number of combinations, I would like to avoid hard-coding each state combination to a specific value (something like if A=true and B=true has the value 1).
I tried to assign values to the variables so (A=1, B=2, C=3) and then adding, but I can't differentiate between A and B being true from i.e. only C being true.
I am stumped but pretty sure that it is possible.
Thanks
Binary maths I think. Choose a location that's a power of 2 (1, 2, 4, 8 etch) then you can use the 'bitwise and' operator & to determine the value.
Say A = 1, B = 2 , C= 4
00000111 => A B and C => 7
00000101 => A and C => 5
00000100 => C => 4
then to determine them :
if( val & 4 ) // same as if (C)
if( val & 2 ) // same as if (B)
if( val & 1 ) // same as if (A)
if((val & 4) && (val & 2) ) // same as if (C and B)
No need for a state table.
Edit: to reflect comment
If the tinyint has a maximum value of 255 => you have 8 bits to play with and can store 8 boolean values in there
binary math as others have said
encoding:
myTinyInt = A*1 + B*2 + C*4 (assuming you convert A,B,C to 0 or 1 beforehand)
decoding
bool A = myTinyInt & 1 != 0 (& is the bitwise and operator in many languages)
bool B = myTinyInt & 2 != 0
bool C = myTinyInt & 4 != 0
I'll add that you should find a way to not use magic numbers. You can build masks into constants using the Left Logical/Bit Shift with a constant bit position that is the position of the flag of interest in the bit field. (Wow... that makes almost no sense.) An example in C++ would be:
enum Flags {
kBitMask_A = (1 << 0),
kBitMask_B = (1 << 1),
kBitMask_C = (1 << 2),
};
uint8_t byte = 0; // byte = 0b00000000
byte |= kBitMask_A; // Set A, byte = 0b00000001
byte |= kBitMask_C; // Set C, byte = 0b00000101
if (byte & kBitMask_A) { // Test A, (0b00000101 & 0b00000001) = T
byte &= ~kBitMask_A; // Clear A, byte = 0b00000100
}
In any case, I would recommend looking for Bitset support in your favorite programming language. Many languages will abstract the logical operations away behind normal arithmetic or "test/set" operations.
Need to use binary...
A = 1,
B = 2,
C = 4,
D = 8,
E = 16,
F = 32,
G = 64,
H = 128
This means A + B = 3 but C = 4. You'll never have two conflicting values. I've listed the maximum you can have for a single byte, 8 values or (bits).