How do I use length() for unicode characters? - unicode

When working in the Moovweb SDK, length("çãêá") is expected to return 4, but instead returns 8. How can I ensure that the length function works correctly when using Unicode characters?

This is a common issue with Unicode characters and the length() function using the wrong character set. To fix it you need to set the charset_determined variable to make sure the correct character set is being used before making the call to length(), like so in your tritium code:
$charset_determined = "utf-8"
# your call to length() here

In Unicode, there is no such thing as a length of a string or "number of characters". All this comes from ASCII thinking.
You can choose from one of the following, depending what you exactly need:
For cursor movement, text selection and alike, grapheme clusters shall be used.
For limiting the length of a string in input fields, file formats, protocols, or databases, the length is measured in code units of some predetermined encoding. The reason is that any length limit is derived from the fixed amount of memory allocated for the string at a lower level, be it in memory, disk or in a particular data structure.
The size of the string as it appears on the screen is unrelated to the number of code points in the string. One has to communicate with the rendering engine for this. Code points do not occupy one column even in monospace fonts and terminals. POSIX takes this into account.
There is more info in http://utf8everywhere.org

Related

The list of unicode unusual characters

Where can I get the complete list of all unicode characters that doesn't behave as simple characters. Examples: character 0x0363 (won't be printed without another one before), character 0x0084 (does weird things when printed). I need just a raw list of such unusual characters to replace them with something harmless to avoid unwanted output effects. Regular characters (those who not in this list) should use exactly one character place when printed (= cursor moved +1 to the right), should not depend on previous or next characters, and should not affect printing style in any way.
Edit because of multiple comments:
I have some unicode string, usually consists of "usual" characters like 0x20-0x7E or cyrillic letters. Also, there are a lot of other unicode characters that are usual and may be safely assumed as having strlen() = 1. The string is printed on the terminal and I should know the resulting position of the cursor. I don't want to use some complex and non-stable libraries to do that, i want to have simplest possible logic to do that. Every problematic character may be replaced with U+0xFFFD or something like "<U+0363>" (ASCII string with its index instead of character itself). I want to have a list of "possibly-problematic" characters to replace. It is acceptable to have some non-problematic characters in this list too, but not much.
There is no simple algorithm for this. You'll likely need a complex, but extremely stable library: libicu, or something based on it. Basically every other library that does this kind of work is based on libicu, which is maintained by the Unicode organization.
If you don't want to use the official library (or something based on their library), you'll need to parse the Unicode Character Database yourself. In particular, you need to look at Character Properties, and parse the files in the UCD.
I believe you're asking for Bidi_Class (i.e. "direction") to be Left_To_Right, Canonical_Combining_Class to be Not_Reordered, and Joining_Type to be Non_Joining.
You probably also want to check the General_Category and avoid M* (Marks) and C* (Other).
This should work for some Emoji, but this whole approach will break a lot of emoji that look simple and are not. Most famously: ❤️, which is two "characters," not one. You may want to filter out Emoji. As a simple starting point, you may want to restrict yourself to the Basic Multilingual Plane (BMP), which are code points 0000-FFFF. Anything above this range is, almost by definition, rare or unusual. The BMP does include some emoji, but most emoji (and all new emoji) are outside the range.
Remember that the glyphs for single characters can still have radically different widths, even in nominally fixed-width fonts. For example, 𒈙 (U+12219 CUNEIFORM SIGN LUGAL OPPOSING LUGAL) is a completely "normal" character in the way you're describing. It is left-to-right. It doesn't depend on or influence characters around it (it's non-combining and non-joining). Its "length in characters" is 1. Its glyph is also extremely wide in most fonts and breaks a lot of layout. I don't know anything in the Unicode database that would warn you of this, since "glyph width" is entirely a function of fonts, not characters, and Unicode explicitly does not consider fonts. (That said, most of the most problematic characters are outside the BMP. Probably the most common exception is DŽ, but many fixed-width fonts have a narrow glyph for it: DŽ.)
Let's write some cuneiform in a fixed-width font.
Normally, every character should line up with a character above.
Here: 𒈙. See how these characters don't align correctly?
Not only is it a very wide glyph, but its width is not even a multiple.
At least not in my font (Mac Safari 15.0).
But DŽ is ok.
Also remember that there are multiple ways to encode the same "character." For example, é can be a "simple" character (U+00E9), or it can be two characters (U+0065, U+0301). So in some cases é may print in your scheme, and in others it won't. I suspect this is fine for your problem, but if it isn't, you're going to need to apply a normalization form (likely NFC).

How do I create a character set like ASCII?

I'm curious about the way that in the past it was implemented and I want to get information about how can I implement a character set of my own.
ASCII (American Standard Code for Information Interchange) was the "original" characterset, and remains the basis for most text data. ASCII is actually a 7-bit code (the numeric values range from 0 to 127) with the most significant bit of a byte indicating if the rest of the byte refers to ASCII (if zero) or the current Codepage.
Extra (non-ascii) characters were then added to these codepages, and the user's computer would load a specific codepage to use. Unfortunately this meant that you needed to load the correct codepage before viewing a file or the wrong characters would appear.
We have now moved on, and most systems use Unicode which is a variable character length (rather than the single-byte characters used previously) which can contain thousands upon thousands of characters, allowing for a single encoding to cater for what would have been multiple codepages using the ASCII+Codepage method of old.
That's the brief history; As to how to create your own characterset, I'm not sure what you are trying to achieve - You can create your own fonts, but if you're talking about an actual characterset (i.e. characters that do not already exist) then you'll have to get your characterset added to a standard such as Unicode so that other computers can make use of your new characters, which would be a considerable amount of work (and I have no idea how you'd even go about it) -- It's worth considering, however, that almost every character in existence already exists in Unicode so you may want to review what's already been done before you try and take on a mammoth undertaking such as creating an entirely new characterset.

dart, total available string characters?

I'm not familiar with character sets and whether languages pick them up from their environments or if they are baked into the language itself, I wanted to make a simple number system in dart that has the largest possible base it can have, like hex has 0-9a-f I would have every single character in some specified ascending order with lower case and upper case having different values to give me the largest possible base to my number system. I want to do this so I can send numbers as strings with as few characters as possible, so my question is, does dart have a standard baked in character set that I can be certain will exist in every environment it runs in?
You should be able to use every value even if no concrete character is assigned to a code.
This would only be a problem when you try to display the character.
Some codes are control characters with special meaning (like 0x0000) which you should avoid
more info here: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt.
If you want to transport the result over the internet using text protocols you may be limited to ASCII. In this case I suggest Base64 encoding.

Displaying Unicode Characters

I already searched for answers to this sort of question here, and have found plenty of them -- but I still have this nagging doubt about the apparent triviality of the matter.
I have read this very interesting an helpful article on the subject: http://www.joelonsoftware.com/articles/Unicode.html, but it left me wondering about how one would go about identifying individual glyphs given a buffer of Unicode data.
My questions are:
How would I go about parsing a Unicode string, say UTF-8?
Assuming I know the byte order, what happens when I encounter the beginning of a glyph that is supposed to be represented by 6 bytes?
That is, if I interpreted the method of storage correctly.
This is all related to a text display system I am designing to work with OpenGL.
I am storing glyph data in display lists and I need to translate the contents of a string to a sequence of glyph indexes, which are then mapped to display list indices (since, obviously, storing the entire glyph set in graphics memory is not always practical).
To have to represent every string as an array of shorts would require a significant amount of storage considering everything I have need to display.
Additionally, it seems to me that 2 bytes per character simply isn't enough to represent every possible Unicode element.
How would I go about parsing a Unicode string, say UTF-8?
I'm assuming that by "parsing", you mean converting to code points.
Often, you don't have to do that. For example, you can search for a UTF-8 string within another UTF-8 string without needing to care about what characters those bytes represent.
If you do need to convert to code points (UTF-32), then:
Check the first byte to see how many bytes are in the character.
Look at the trailing bytes of the character to ensure that they're in the range 80-BF. If not, report an error.
Use bit masking and shifting to convert the bytes to the code point.
Report an error if the byte sequence you got was longer than the minimum needed to represent the character.
Increment your pointer by the sequence length and repeat for the next character.
Additionally, it seems to me that 2
bytes per character simply isn't
enough to represent every possible
Unicode element.
It's not. Unicode was originally intended to be a fixed-with 16-bit encoding. It was later decided that 65,536 characters wasn't enough, so UTF-16 was created, and Unicode was redefined to use code points between 0 and 1,114,111.
If you want a fixed-width encoding, you need 21 bits. But they aren't many languages that have a 21-bit integer type, so in practice you need 32 bits.
Well, I think this answers it:
http://en.wikipedia.org/wiki/UTF-8
Why it didn't show up the first time I went searching, I have no idea.

How should I handle digits from different sets of UNICODE digits in the same string?

I am writing a function that transliterates UNICODE digits into ASCII digits, and I am a bit stumped on what to do if the string contains digits from different sets of UNICODE digits. So for example, if I have the string "\x{2463}\x{24F6}" ("④⓶"). Should my function
return 42?
croak that the string contains mixed sets?
carp that the string contains mixed sets and return 42?
give the user an additional argument to specify one of the three above behaviours?
do something else?
Your current function appears to do #1.
I suggest that you should also write another function to do #4, but only when the requirement appears, and not before .
I'm sure Joel wrote about "premature implementation" in a blog article sometime recently, but I can't find it.
I'm not sure I see a problem.
You support numeric conversion from a range of scripts, which is to say, you are aware of the Unicode codepoints for their numeric characters.
If you find an unknown codepoint in your input data, it is an error.
It is up to you what you do in the event of an error; you may insert a space or underscore, or you may abort conversion. What you would do will depend on the environment in which your function executes; it is not something we can tell you.
My initial thought was #4; strictly based on the fact that I like options. However, I changed my mind, when I viewed your function.
The purpose of the function seems to be, simply, to get the resulting digits 0..9. Users may find it useful to send in mixed sets (a feature :) . I'll use it.
If you ever have to handle input in bases greater than 10, you may end up having to treat many variants on the first 6 letters of the Latin alphabet ('ABCDEF') as digits in all their forms.