How do I iterate through an UFT16 encoded string character by character? - unicode

I have an UFT16 encoded string theUFT16string. It contains double byte characters. I would like to interate through it Unicode character by Unicode character. I understand that the chunk expressions work by single-byte characters?
An example
We have the following string
abcαβɣ
We want to iterate through it and put each character on a line of its own in another container.

In LiveCode, there are two ways to get a character from a UTF16 string. If the string is displayed in a field, you can do
select char 3 of fld 1
and if you have a Russian or Polish text in the field, it will correctly select 1 character. However, this feature isn't very well developed in LiveCode and will fail with many Chinese, Japanese and Arabic (and other) languages. Therefore, it is better to use bytes for now:
select byte 5 to 6 of fld 1
The latter will also be compatible with future versions of LiveCode, while the former may not be.
Anyway, you have your string in a variable, which means you have to handle the string as bytes (you could use chars, but bytes and chars are dealt with in the same way in this case, because the data is in a variable). You can iterate through the variable with steps of two, i.e. one char at a time:
repeat with x = 1 to number of bytes of theUFT16String step 2
put byte x to x+1 into myChar
// do something with myChar here, e.g. reverse the bytes?
put byte 2 of myChar & char 1 of myChar after myNewString
end repeat
// myNewString now contains the entire theUTF16String in reverse byte order.
(You could do this in 3 lines instead of 4, but for the purpose of the example I have added a line that stores the bytes in var myChar).

Related

postgres: how to count multibyte emoji strings display length in UTF-8

Postgres (v11) counts the red heart ❤️ as two characters, and so on for other multibyte UTF-8 chars with selector units. Anyone know how I get postgres to count true characters and not the bytes?
For example, I would like both of the examples below should return 1.
select length('❤️') = 2 (Unicode: 2764 FE0F)
select length('🏃‍♂️') = 4 (Unicode: 1F3C3 200D 2642 FE0F)
UPDATE
Thank you to folks pointing out that postgres is correctly counting the Unicode code points and why and how this happens.
I don't see any other option other than pre-processing the emoji strings as bytes against a table of official Unicode character bytes, in Python or some such, to get the perceived length.
So one way to do this is to ignore all characters in the Variation Selector and decrement by 2 if you hit the General Punctuation range.
This could be converted into a postgres function.
python
"""
# For reference, these code pages apply to emojis
Name Range
Emoticons 1F600-1F64F
Supplemental_Symbols_and_Pictographs 1F900-1F9FF
Miscellaneous Symbols and Pictographs 1F300-1F5FF
General Punctuation 2000-206F
Miscellaneous Symbols 2600-26FF
Variation Selectors FE00-FE0F
Dingbats 2700-27BF
Transport and Map Symbols 1F680-1F6FF
Enclosed Alphanumeric Supplement 1F100-1F1FF
"""
emojis="🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️" # true count is 7, postgres length() returns 28
true_count=0
for char in emojis:
d=ord(char)
char_type=None
if (d>=0x2000 and d<=0x206F) : char_type="GP" # Zero Width Joiner
elif (d>=0xFE00 and d<=0xFE0F) : char_type="VS" # Variation Selector
print(d, char_type)
if ( char_type=="GP") : true_count-=2
elif (char_type!="VS" ): true_count+=1
print(true_count)

Why the trailing 0x00 byte after BSON string (not Cstring/ename)?

obviously, for bson cstring the trailing byte is used to determine the length of the string, so it is: (byte*) "\x00". They are used as regex patterns, rexegs options and ename, which are not long / used in iterations, so the length is not necessary, but then comes...
bson string is written as: int32 (byte*) "\x00"
with specification as follows: The int32 is the number bytes in the (byte*) + 1 (for the trailing '\x00'). The (byte*) is zero or more UTF-8 encoded characters.
but why the use of trailing zero byte? if we have the utf-8 encoded string length, it is sufficient for the byte data workflow, and the 0x00 byte just adds an unneeded byte. Am I missing something?
The reasoning for both the length of the string and the null terminator is twofold: compatibility with existing C-style strings, and performance.
For performance, MongoDB needs to be able to quickly go to a specific field in a document without iterating through the whole BSON. This is important especially if you're looking for a field that is close to the end of a large (say 16 MB) document. With the length of the string encoded as one of the first information on a string type, it can just skip that number of bytes and get to the next field. Otherwise, it will need to iterate over the whole string until it finds the end of the string.
For compatibility, MongoDB is written in C++, where strings are null terminated. It can cut off that null terminator to save one byte since the length is encoded, but getting that string out of BSON into a format that's usable by C++ would require tacking on that null again. This will need specialized string handling routine that's the only advantage is saving a single byte.
Overall, it was decided that "wasting" a single byte is an acceptable tradeoff.

Encode a Date and a four digit number into a string with max 8 characters

I have a datetime and a four digit number and I need to encode this into a 8 character case insensitive ASCII string.
The four digit number is not actually an arbitrary number, but there are only a certain numbers (about 20 or so) of the form (2513, 2595, 2579, ...).
My current approach is to use Base36 encoding. Further, I have a dictionary for the four digit numbers that maps like this:
2513 -> '00'
2595 -> '01'
...
The first two characters of the resulting string are used for this. The remaining six characters are used for encoding a unix timestamp with seconds stripped (I only need seconds resolution) in Base36.
So, (2513, 07.01.2015) maps to '000E3HEU'.
My question is, if anyone can think of an even more compact encoding?

Extract the first letter of a UTF-8 string with Lua

Is there any way to extract the first letter of a UTF-8 encoded string with Lua?
Lua does not properly support Unicode, so string.sub("ÆØÅ", 2, 2) will return "?" rather than "Ø".
Is there a relatively simple UTF-8 parsing algorithm I could use on the string byte per byte, for the sole purpose of getting the first letter of the string, be it a Chinese character or an A?
Or is this way too complex, requiring a huge library, etc.?
You can easily extract the first letter from a UTF-8 encoded string with the following code:
function firstLetter(str)
return str:match("[%z\1-\127\194-\244][\128-\191]*")
end
Because a UTF-8 code point either begins with a byte from 0 to 127, or with a byte from 194 to 244 followed by one or several bytes from 128 to 191.
You can even iterate over UTF-8 code points in a similar manner:
for code in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
print(code)
end
Note that both examples return a string value for each letter, and not the Unicode code point numerical value.
Lua 5.3 provide a UTF-8 library.
You can use utf8.codes to get each code point, and then use utf8.char to get the character:
local str = "ÆØÅ"
for _, c in utf8.codes(str) do
print(utf8.char(c))
end
This also works:
local str = "ÆØÅ"
for w in str:gmatch(utf8.charpattern ) do
print(w)
end
where utf8.charpattern is just the string "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" for the pattern to match one UTF-8 byte sequence.

Why does a base64 encoded string have an = sign at the end

I know what base64 encoding is and how to calculate base64 encoding in C#, however I have seen several times that when I convert a string into base64, there is an = at the end.
A few questions came up:
Does a base64 string always end with =?
Why does an = get appended at the end?
Q Does a base64 string always end with =?
A: No. (the word usb is base64 encoded into dXNi)
Q Why does an = get appended at the end?
A: As a short answer:
The last character (= sign) is added only as a complement (padding) in the final process of encoding a message with a special number of characters.
You will not have an = sign if your string has a multiple of 3 characters, because Base64 encoding takes each three bytes (a character=1 byte) and represents them as four printable characters in the ASCII standard.
Example:
(a) If you want to encode
ABCDEFG <=> [ABC] [DEF] [G]
Base64 deals with the first block (producing 4 characters) and the second (as they are complete). But for the third, it will add a double == in the output in order to complete the 4 needed characters. Thus, the result will be QUJD REVG Rw== (without spaces).
[ABC] => QUJD
[DEF] => REVG
[G] => Rw==
(b) If you want to encode ABCDEFGH <=> [ABC] [DEF] [GH]
similarly, it will add one = at the end of the output to get 4 characters.
The result will be QUJD REVG R0g= (without spaces).
[ABC] => QUJD
[DEF] => REVG
[GH] => R0g=
It serves as padding.
A more complete answer is that a base64 encoded string doesn't always end with a =, it will only end with one or two = if they are required to pad the string out to the proper length.
From Wikipedia:
The final '==' sequence indicates that the last group contained only one byte, and '=' indicates that it contained two bytes.
Thus, this is some sort of padding.
Its defined in RFC 2045 as a special padding character if fewer than 24 bits are available at the end of the encoded data.
No.
To pad the Base64-encoded string to a multiple of 4 characters in length, so that it can be decoded correctly.
The equals sign (=) is used as padding in certain forms of base64 encoding. The Wikipedia article on base64 has all the details.
It's padding. From http://en.wikipedia.org/wiki/Base64:
In theory, the padding character is not needed for decoding, since the
number of missing bytes can be calculated from the number of Base64
digits. In some implementations, the padding character is mandatory,
while for others it is not used. One case in which padding characters
are required is concatenating multiple Base64 encoded files.
http://www.hcidata.info/base64.htm
Encoding "Mary had" to Base 64
In this example we are using a simple text string ("Mary had") but the principle holds no matter what the data is (e.g. graphics file). To convert each 24 bits of input data to 32 bits of output, Base 64 encoding splits the 24 bits into 4 chunks of 6 bits. The first problem we notice is that "Mary had" is not a multiple of 3 bytes - it is 8 bytes long. Because of this, the last group of bits is only 4 bits long. To remedy this we add two extra bits of '0' and remember this fact by putting a '=' at the end. If the text string to be converted to Base 64 was 7 bytes long, the last group would have had 2 bits. In this case we would have added four extra bits of '0' and remember this fact by putting '==' at the end.
= is a padding character. If the input stream has length that is not a multiple of 3, the padding character will be added. This is required by decoder: if no padding present, the last byte would have an incorrect number of zero bits.
Better and deeper explanation here: https://base64tool.com/detect-whether-provided-string-is-base64-or-not/
The equals or double equals serves as padding. It's a stupid concept defined in RFC2045 and it is actually superfluous. Any decend parser can encode and decode a base64 string without knowing about padding by just counting up the number of characters and filling in the rest if size isn't dividable by 3 or 4 respectively. This actually leads to difficulties every now and then, because some parsers expect padding while others blatantly ignore it. My MPU base64 decoder for example needs padding, but it receives a non-padded base64 string over the network. This leads to erronous parsing and I had to account for it myself.