How can I display a character above U+FFFF? Say I want to show U+1F384. "\u1F384" is interpreted as "\u1F38" followed by the character "4", and gives the error No glyph found for the ? (\u1F38) character. "\uD83C\uDF84" is interpreted as the character "\uD83C" followed by the character "\uDF84", and gives the error
No glyph found for the ? (\uD83C) character
No glyph found for the ? (\uDF84) character
Here is some example code to demonstrate:
PFont font;
void setup()
{
size(128, 128);
background(0);
font = loadFont("Symbola-8.vlw");
textFont(font, 8);
fill(255);
text("\u1F384", 10, 10);
}
void draw()
{
}
As stated by the tag, this is in the language Processing.
\U0001F384 gives the error unexpected char: 'U', presumably because processing doesn't support UTF-32 in that format.
It doesn't really matter how it is displayed, the main problem is making a string contain a character whose decimal codepoint is greater than 65,535.
Unfortunately, there does not appear to be a way to do this.
You can now use '\u{1F384}' in ECMASCript 6 compatible browser.
Related
Since I updated Flutter and all libraries, I encounter a strange bug when decoding a list of bytes.
The app communicates with a bluetooth device with flutter_blue library like that:
import 'dart:convert';
var result = await characteristic.read(); // [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
return utf8.decode(result, allowMalformed: true);
The decoded string is displayed in a widget.
Previously, I had no problem, the string seems empty. But recently everything was updated, the string looks empty in the console but not in the widget since I see several empty squares as character. And the length of the string, even after the trim method, is 15, not 0.
I don't find any reason about this change on internet neither how to solve the problem.
Have you ever met this bug? Do you have a good solution?
Thanks
Edit:
The result is the same with allowMalformed = true, of with
new String.fromCharCodes(result)
I think there is a bug with flutter when decoding only 0
Char NUL (the 0) is an allowed character in UTF-8. It looks like the previous updates didn't implement the decoding right and ignored the NUL character. It should be expected that the NUL char is present in UTF-8 as a well-formed char.
The official docs also say:
If allowMalformed is true, the decoder replaces invalid (or unterminated) character sequences with the Unicode Replacement character U+FFFD (�).
So, a solution for this problem is to parse the resulting string and check/replace the NUL chars and/or the Unicode Replacement chars.
As editing in my question, it seems there is a problem with trailing zero.
From decimal to string, space equal to 32, and 0 is normally a empty character. But it's not anymore.
The solution I found is
var result = await characteristic.read();
var tempResult = List<int>.from(result);
tempResult.removeWhere((item) => item == 0);
return utf8.decode(tempResult, allowMalformed: true);
I copy the first list into a mutable list, and remove all 0 from it. That's work perfertly.
This wikipedia page lists the "values" of all the code points which are in the Unicode character block "Number Forms". This character block includes vulgar fractions but also roman numeral characters in addition to the ASCII digits.
I have found a solution for the vulgar fractions which involves normalizing them but is there a library which will just return me the value given any "Number Forms" Unicode code point? I am looking for something more elegant than a simple code-point-to-float map. Consider the following Java code:
public static float getValue(char theCharacter) {
if (UnicodeBlock.of(theCharacter).equals(UnicodeBlock.NUMBER_FORMS)) {
return getValueOfNumericFormCharacter(theCharacter);
}
return float.MIN_VALUE;
}
I am looking for something elegant which implements getValueOfNumericFormCharacter(theCharacter). Is there something out there?
How can I get the length (not number of bytes) of a string in its UTF-8 encoded form (PHP's mb_strlen(.., 'UTF-8') equivalent)?
I tried string.characters.count but it does not return the correct length for certain characters like an emoji.
Example:
let s = "✌🏿️"
print(s.characters.count) // prints 2, but should print 3.
You can access the UTF-8 encoding of a string with the .utf8 property. Use count on that to get the number of UTF-8 code units in the string:
let string = "\u{1f603}" // One of the smiley face emojis...
print(string.utf8.count) // prints "4"
Based on your edited question, what you are probably looking for is the number of UnicodeScalars used to encode the string. You access that with the unicodeScalars property:
let s = "✌🏿️"
print(s.unicodeScalars.count) // prints 3
The reason everyone is confused is because your original question asks for the length of the string in its UTF-8 encoded form. The answer that you actually wanted had nothing to do with the length of the string in its UTF-8 encoded form.
I think you are confused about the difference between Unicode "extended grapheme clusters", Unicode code points, and the various encodings (like UTF-8) that can be used to encode a Unicode code point.
A Character in Swift represents what Unicode calls an "extended grapheme cluster". That is to say, it is a single visual character, even if it is made up of multiple Unicode code points.
A Unicode code point is a single linguistic symbol that is given a 32-bit value. Two or more Unicode code points can combine to create a single Character. In Swift, the Unicode code point is represented by the UnicodeScalar type.
When it comes time to store a string, or send it over the internet, or otherwise turn it into data that is represented by bytes, you have to decide how to encode it. There are all kinds of encodings, the most common is probably UTF-8, which encodes the string as a series of UInt8 values.
That's just a brief snippet of the difference between the three concepts. It is actually a really interesting subject and if you Google some of those terms, you will find a lot more good information.
let str = "ačŘ"
print("str has \(str.characters.count) characters") // 3
print("and \(str.utf8.count) bytes as encoded in UTF-8") // 5
update (based on your notes)
let s = "✌🏿️"
let arr:[UInt8] = [226, 156, 140, 240, 159, 143, 191, 239, 184, 143]
var arrCchar = arr.map { (uint8) -> Int8 in
Int8(bitPattern: uint8)
}
arrCchar += [0] // to be null terminated
let str = String.fromCString(&arrCchar)
print(str) // Optional("✌🏿️")
s == str // TRUE !!!!
by characters
s.characters.forEach { (c) -> () in
let str = String(c)
print(str.utf8.map{$0}, "which represents character: ", c)
str.unicodeScalars.forEach({ (u) -> () in
print("composed from unicode scalar(s): ", u.debugDescription)
})
}
/*
[226, 156, 140] which represents character: ✌
composed from unicode scalar(s): "\u{270C}"
[240, 159, 143, 191, 239, 184, 143] which represents character: 🏿️
composed from unicode scalar(s): "\u{0001F3FF}"
composed from unicode scalar(s): "\u{FE0F}"
*/
Every character in Unicode can be represented by one or more unicode scalars. A unicode scalar is a unique 21-bit number (and name) for a character or modifier, such as U+0061 for LOWERCASE LATIN LETTER A("a"), or U+1F425 for FRONT-FACING BABY CHICK ("\U0001f425").
When a Unicode string is written to a text file or some other storage, these unicode scalars are encoded in one of several Unicode-defined formats. Each format encodes the string in small chunks known as code units. These include the UTF-8 format (which encodes a string as 8-bit code units) and the UTF-16 format (which encodes a string as 16-bit code units).
//copy from Apple Developer swift programming guide
How can I extract the Unicode code point(s) of a given Character without first converting it to a String? I know that I can use the following:
let ch: Character = "A"
let s = String(ch).unicodeScalars
s[s.startIndex].value // returns 65
but it seems like there should be a more direct way to accomplish this using just Swift's standard library. The Language Guide sections "Working with Characters" and "Unicode" only discuss iterating through the characters in a String, not working directly with Characters.
From what I can gather in the documentation, they want you to get Character values from a String because it gives context. Is this Character encoded with UTF8, UTF16, or 21-bit code points (scalars)?
If you look at how a Character is defined in the Swift framework, it is actually an enum value. This is probably done due to the various representations from String.utf8, String.utf16, and String.unicodeScalars.
It seems they do not expect you to work with Character values but rather Strings and you as the programmer decide how to get these from the String itself, allowing encoding to be preserved.
That said, if you need to get the code points in a concise manner, I would recommend an extension like such:
extension Character
{
func unicodeScalarCodePoint() -> UInt32
{
let characterString = String(self)
let scalars = characterString.unicodeScalars
return scalars[scalars.startIndex].value
}
}
Then you can use it like so:
let char : Character = "A"
char.unicodeScalarCodePoint()
In summary, string and character encoding is a tricky thing when you factor in all the possibilities. In order to allow each possibility to be represented, they went with this scheme.
Also remember this is a 1.0 release, I'm sure they will expand Swift's syntactical sugar soon.
I think there are some misunderstandings about the Unicode. Unicode itself is NOT an encoding, it does not transform any grapheme clusters (or "Characters" from human reading respect) into any sort of binary sequence. The Unicode is just a big table which collects all the grapheme clusters used by all languages on Earth (unofficially also includes the Klingon). Those grapheme clusters are organized and indexed by the code points (a 21-bit number in swift, and looks like U+D800). You can find where the character you are looking for in the big Unicode table by using the code points
Meanwhile, the protocol called UTF8, UTF16, UTF32 is actually encodings. Yes, there are more than one ways to encode the Unicode characters into binary sequences. Using which protocol depends on the project you are working, but most of the web page is encoded by UTF-8 (you can actually check it now).
Concept 1: The Unicode point is called the Unicode Scalar in Swift
A Unicode scalar is any Unicode code point in the range U+0000 to U+D7FF inclusive or U+E000 to U+10FFFF inclusive. Unicode scalars do not include the Unicode surrogate pair code points, which are the code points in the range U+D800 to U+DFFF inclusive.
Concept 2: The Code Unit is the abstract representation of the encoding.
Consider the following code snippet
let theCat = "Cat!🐱"
for char in theCat.utf8 {
print("\(char) ", terminator: "") //Code Unit of each grapheme cluster for the UTF-8 encoding
}
print("")
for char in theCat.utf8 {
print("\(String(char, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-8 encoding
}
print("")
for char in theCat.utf16 {
print("\(char) ", terminator: "") //Code Unit of each grapheme cluster for the UTF-16 encoding
}
print("")
for char in theCat.utf16 {
print("\(String(char, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-16 encoding
}
print("")
for char in theCat.unicodeScalars {
print("\(char.value) ", terminator: "") //Code Unit of each grapheme cluster for the UTF-32 encoding
}
print("")
for char in theCat.unicodeScalars {
print("\(String(char.value, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-32 encoding
}
Abstract representation means: Code unit is written by the base-10 number (decimal number) it equals to the base-2 encoding (binary sequence). Encoding is made for the machines, Code Unit is more for humans, it is easy to read than binary sequences.
Concept 3: A character may have different Unicode point(s). It depends on how the character is contracted by what grapheme clusters, (this is why I said "Characters" from human reading respect in the beginning)
consider the following code snippet
let precomposed: String = "\u{D55C}"
let decomposed: String = "\u{1112}\u{1161}\u{11AB}"
print(precomposed.characters.count) // print "1"
print(decomposed.characters.count) // print "1" => Character != grapheme cluster
print(precomposed) //print "한"
print(decomposed) //print "한"
The character precomposed and decomposed is visually and linguistically equal, But they have different Unicode point and different code unit if they encoded by the same encoding protocol (see the following example)
for preCha in precomposed.utf16 {
print("\(preCha) ", terminator: "") //print 55357 56374 128054 54620
}
print("")
for deCha in decomposed.utf16 {
print("\(deCha) ", terminator: "") //print 4370 4449 4523
}
Extra example
var word = "cafe"
print("the number of characters in \(word) is \(word.characters.count)")
word += "\u{301}"
print("the number of characters in \(word) is \(word.characters.count)")
Summary: Code Points, A.k.a the position index of the characters in Unicode, has nothing to do with UTF-8, UTF-16 and UTF-32 encoding schemes.
Further Readings:
http://www.joelonsoftware.com/articles/Unicode.html
http://kunststube.net/encoding/
https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-swifts-string-api-so-hard.html
I think the issue is that Character doesn't represent a Unicode code point. It represents a "Unicode grapheme cluster", which can consist of multiple code points.
Instead, UnicodeScalar represents a Unicode code point.
I agree with you, there should be a way to get the code directly from character. But all I can offer is a shorthand:
let ch: Character = "A"
for code in String(ch).utf8 { println(code) }
#1. Using Unicode.Scalar's value property
With Swift 5, Unicode.Scalar has a value property that has the following declaration:
A numeric representation of the Unicode scalar.
var value: UInt32 { get }
The following Playground sample code shows how to iterate over the unicodeScalars property of a Character and print the value of each Unicode scalar that composes it:
let character: Character = "A"
for scalar in character.unicodeScalars {
print(scalar.value)
}
/*
prints: 65
*/
As an alternative, you can use the sample code below if you only want to print the value of the first unicode scalar of a Character:
let character: Character = "A"
let scalars = character.unicodeScalars
let firstScalar = scalars[scalars.startIndex]
print(firstScalar.value)
/*
prints: 65
*/
#2. Using Character's asciiValue property
If what you really want is to get the ASCII encoding value of a character, you can use Character's asciiValue. asciiValue has the following declaration:
Returns the ASCII encoding value of this Character, if ASCII.
var asciiValue: UInt8? { get }
The Playground sample code below show how to use asciiValue:
let character: Character = "A"
print(String(describing: character.asciiValue))
/*
prints: Optional(65)
*/
let character: Character = "П"
print(String(describing: character.asciiValue))
/*
prints: nil
*/
Have you tried:
import Foundation
let characterString: String = "abc"
var numbers: [Int] = Array<Int>()
for character in characterString.utf8 {
let stringSegment: String = "\(character)"
let anInt: Int = stringSegment.toInt()!
numbers.append(anInt)
}
numbers
Output:
[97, 98, 99]
It may also be only one Character in the String.
I'd like to retrieve the ANSI code value of a given character.
E.g. when I now get the int value of the trademark character, I get 8482.
Instead I would like to get 153, which is the value of the trademark character in codepage 1252.
Some help would be appreciated.
Jurgen
Found it myself:
Encoding ansiEncoding = Encoding.GetEncoding(1252);
byte[] bytes = ansiEncoding.GetBytes(c);
int code = bytes[0];