I was wondering what is the best way for converting an UTF8 Array or String to its base 2 representation(each UTF8 value of each character to its base 2 representation) . Since you could have two values representing the code for the same character, I suppose extracting values from the array and then converting it is not a valid method. So which one is? Thank you!
Here is a possible approach:
Enumerate the unicode scalars of the string.
Convert each unicode scalar back to a string, and enumerate its
UTF-8 encoding.
Convert each UTF-8 byte to a "binary string".
The last task can be done with the following generic method which
works for all unsigned integer types:
extension UnsignedIntegerType {
func toBinaryString() -> String {
let s = String(self, radix: 2)
let numBits = 8 * sizeofValue(self)
return String(count: numBits - s.characters.count, repeatedValue: Character("0")) + s
}
}
// Example:
// UInt8(100).toBinaryString() = "01100100"
// UInt16.max.toBinaryString() = "1111111111111111"
Then the conversion to a UTF-8 binary representation can be
implemented like this:
func binaryUTF8Strings(string: String) -> [String] {
return string.unicodeScalars.map {
String($0).utf8.map { $0.toBinaryString() }.joinWithSeparator(" ")
}
}
Example usage:
for u in base2UTF8("Hโฌllรถ ๐ฉ๐ช") {
print(u)
}
Output:
01001000
11100010 10000010 10101100
01101100
01101100
11000011 10110110
00100000
11110000 10011111 10000111 10101001
11110000 10011111 10000111 10101010
Note that "๐ฉ๐ช" is a single character (an "extended grapheme cluster")
but two unicode scalars.
Related
The problem I have is that given a sequence of bytes, I want to determine its longest prefix which forms a valid Unicode character (extended grapheme cluster) assuming UTF8 encoding.
I am using Swift, so I would like to use Swift's built-in function(s) to do so. But these functions only decode a complete sequence of bytes. So I was thinking to convert prefixes of the byte sequence via Swift and take the last prefix that didn't fail and consists of 1 character only. Obviously, this might lead to trying out the entire sequence of bytes, which I want to avoid. A solution would be to stop trying out prefixes after 4 prefixes in a row failed. If the property asked in my question holds, this would then guarantee that all longer prefixes must also fail.
I find the Unicode Text Segmentation Standard unreadable, otherwise I would try to directly implement boundary detection of extended grapheme clusters...
After taking a long hard look at the specification for computing the boundaries for extended grapheme clusters (EGCs) at https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules,
it is obvious that the rules for EGCs all have the shape of describing when it is allowed to append a code point to an existing EGC to form a longer EGC. From that fact alone my two questions follow: 1) Yes, every non-empty prefix of code points which form an EGC is also an EGC. 2) No, by adding a code point to a valid Unicode string you will not decrease its length in terms of number of EGCs it consists of.
So, given this, the following Swift code will extract the longest Unicode character from the start of a byte sequence (or return nil if there is no valid Unicode character there):
func lex<S : Sequence>(_ input : S) -> (length : Int, out: Character)? where S.Element == UInt8 {
// This code works under three assumptions, all of which are true:
// 1) If a sequence of codepoints does not form a valid character, then appending codepoints to it does not yield a valid character
// 2) Appending codepoints to a sequence of codepoints does not decrease its length in terms of extended grapheme clusters
// 3) a codepoint takes up at most 4 bytes in an UTF8 encoding
var chars : [UInt8] = []
var result : String = ""
var resultLength = 0
func value() -> (length : Int, out : Character)? {
guard let character = result.first else { return nil }
return (length: resultLength, out: character)
}
var length = 0
var iterator = input.makeIterator()
while length - resultLength <= 4 {
guard let char = iterator.next() else { return value() }
chars.append(char)
length += 1
guard let s = String(bytes: chars, encoding: .utf8) else { continue }
guard s.count == 1 else { return value() }
result = s
resultLength = length
}
return value()
}
Trying to find the shortest / most compact way to write out ASCII characters in Swift into a single string. For example, in JavaScript you can do '\x00' for the decimal equivalent of 0 in ASCII, or you can write '\0, which is 2 characters shorter. So if you have a lot of these characters, that is 2x smaller file size.
Wondering how to write the ASCII characters 0-31 and 127 in Swift so they are minimal, into a single string. In JavaScript, that sort of looks like this:
'\0...\33abcdef...\127ยกยขยฃยคยฅยฆยงยจยฉยชยซยฌยญยฎยฏยฐยฑยฒยณยดยตยถยทยธยนยบยปยผยฝ...'
In general, you would use \u{x} where x is the hex value. In your case \u{0} through \u{1f} and \u{7f}.
As in C based languages, Swift strings also supports \0 for "null", \t for "tab", \n for "newline", and \r for "carriage return". Unlike C, Swift does not support \b or \f.
If you want to create single String will all 128 ASCII characters then you can do:
let ascii = String(Array(0...127).map { Character(Unicode.Scalar($0)) })
If you have a lot of these characters, maybe put them in a Data object and then convert it to a string:
let data = Data(bytes: Array(0...31) + [127])
let text = String(data: data, encoding: .utf8)!
Based on your comment, you could do:
let tab = Data(bytes: [9])
let null = Data(bytes: [0])
let data = "abc".data(using: .utf8)! + tab + null + "morechars".data(using: .utf8)! + tab
I am creating an iPhone app and I need to convert a single digit number into an integer.
My code has a variable called char that has a type Character, but I need to be able to do math with it, therefore I think I need to convert it to a string, however I cannot find a way to do that.
In the latest Swift versions (at least in Swift 5) there is a more straighforward way of converting Character instances. Character has property wholeNumberValue which tries to convert a character to Int and returns nil if the character does not represent and integer.
let char: Character = "5"
if let intValue = char.wholeNumberValue {
print("Value is \(intValue)")
} else {
print("Not an integer")
}
With a Character you can create a String. And with a String you can create an Int.
let char: Character = "1"
if let number = Int(String(char)) {
// use number
}
The String middleman type conversion isnโt necessary if you use the unicodeScalars property of Swift 4.0โs Character type.
let myChar: Character = "3"
myChar.unicodeScalars.first!.value - Unicode.Scalar("0")!.value // 3: UInt32
This uses a trick commonly seen in C code of subtracting the value of the char โ0โ literal to convert from ascii values to decimal values. See this site for the conversions: https://www.asciitable.com
Also there are some implicit unwraps in my answer. To avoid those, you can validate that you have a decimal digit with CharacterSet.decimalDigits, and/or use guard lets around the first property. You can also subtract 48 directly rather than converting โ0โ through Unicode.Scalar.
How can I extract the Unicode code point(s) of a given Character without first converting it to a String? I know that I can use the following:
let ch: Character = "A"
let s = String(ch).unicodeScalars
s[s.startIndex].value // returns 65
but it seems like there should be a more direct way to accomplish this using just Swift's standard library. The Language Guide sections "Working with Characters" and "Unicode" only discuss iterating through the characters in a String, not working directly with Characters.
From what I can gather in the documentation, they want you to get Character values from a String because it gives context. Is this Character encoded with UTF8, UTF16, or 21-bit code points (scalars)?
If you look at how a Character is defined in the Swift framework, it is actually an enum value. This is probably done due to the various representations from String.utf8, String.utf16, and String.unicodeScalars.
It seems they do not expect you to work with Character values but rather Strings and you as the programmer decide how to get these from the String itself, allowing encoding to be preserved.
That said, if you need to get the code points in a concise manner, I would recommend an extension like such:
extension Character
{
func unicodeScalarCodePoint() -> UInt32
{
let characterString = String(self)
let scalars = characterString.unicodeScalars
return scalars[scalars.startIndex].value
}
}
Then you can use it like so:
let char : Character = "A"
char.unicodeScalarCodePoint()
In summary, string and character encoding is a tricky thing when you factor in all the possibilities. In order to allow each possibility to be represented, they went with this scheme.
Also remember this is a 1.0 release, I'm sure they will expand Swift's syntactical sugar soon.
I think there are some misunderstandings about the Unicode. Unicode itself is NOT an encoding, it does not transform any grapheme clusters (or "Characters" from human reading respect) into any sort of binary sequence. The Unicode is just a big table which collects all the grapheme clusters used by all languages on Earth (unofficially also includes the Klingon). Those grapheme clusters are organized and indexed by the code points (a 21-bit number in swift, and looks like U+D800). You can find where the character you are looking for in the big Unicode table by using the code points
Meanwhile, the protocol called UTF8, UTF16, UTF32 is actually encodings. Yes, there are more than one ways to encode the Unicode characters into binary sequences. Using which protocol depends on the project you are working, but most of the web page is encoded by UTF-8 (you can actually check it now).
Concept 1: The Unicode point is called the Unicode Scalar in Swift
A Unicode scalar is any Unicode code point in the range U+0000 to U+D7FF inclusive or U+E000 to U+10FFFF inclusive. Unicode scalars do not include the Unicode surrogate pair code points, which are the code points in the range U+D800 to U+DFFF inclusive.
Concept 2: The Code Unit is the abstract representation of the encoding.
Consider the following code snippet
let theCat = "Cat!๐ฑ"
for char in theCat.utf8 {
print("\(char) ", terminator: "") //Code Unit of each grapheme cluster for the UTF-8 encoding
}
print("")
for char in theCat.utf8 {
print("\(String(char, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-8 encoding
}
print("")
for char in theCat.utf16 {
print("\(char) ", terminator: "") //Code Unit of each grapheme cluster for the UTF-16 encoding
}
print("")
for char in theCat.utf16 {
print("\(String(char, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-16 encoding
}
print("")
for char in theCat.unicodeScalars {
print("\(char.value) ", terminator: "") //Code Unit of each grapheme cluster for the UTF-32 encoding
}
print("")
for char in theCat.unicodeScalars {
print("\(String(char.value, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-32 encoding
}
Abstract representation means: Code unit is written by the base-10 number (decimal number) it equals to the base-2 encoding (binary sequence). Encoding is made for the machines, Code Unit is more for humans, it is easy to read than binary sequences.
Concept 3: A character may have different Unicode point(s). It depends on how the character is contracted by what grapheme clusters, (this is why I said "Characters" from human reading respect in the beginning)
consider the following code snippet
let precomposed: String = "\u{D55C}"
let decomposed: String = "\u{1112}\u{1161}\u{11AB}"
print(precomposed.characters.count) // print "1"
print(decomposed.characters.count) // print "1" => Character != grapheme cluster
print(precomposed) //print "ํ"
print(decomposed) //print "ํ"
The character precomposed and decomposed is visually and linguistically equal, But they have different Unicode point and different code unit if they encoded by the same encoding protocol (see the following example)
for preCha in precomposed.utf16 {
print("\(preCha) ", terminator: "") //print 55357 56374 128054 54620
}
print("")
for deCha in decomposed.utf16 {
print("\(deCha) ", terminator: "") //print 4370 4449 4523
}
Extra example
var word = "cafe"
print("the number of characters in \(word) is \(word.characters.count)")
word += "\u{301}"
print("the number of characters in \(word) is \(word.characters.count)")
Summary: Code Points, A.k.a the position index of the characters in Unicode, has nothing to do with UTF-8, UTF-16 and UTF-32 encoding schemes.
Further Readings:
http://www.joelonsoftware.com/articles/Unicode.html
http://kunststube.net/encoding/
https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-swifts-string-api-so-hard.html
I think the issue is that Character doesn't represent a Unicode code point. It represents a "Unicode grapheme cluster", which can consist of multiple code points.
Instead, UnicodeScalar represents a Unicode code point.
I agree with you, there should be a way to get the code directly from character. But all I can offer is a shorthand:
let ch: Character = "A"
for code in String(ch).utf8 { println(code) }
#1. Using Unicode.Scalar's value property
With Swift 5, Unicode.Scalar has a value property that has the following declaration:
A numeric representation of the Unicode scalar.
var value: UInt32 { get }
The following Playground sample code shows how to iterate over the unicodeScalars property of a Character and print the value of each Unicode scalar that composes it:
let character: Character = "A"
for scalar in character.unicodeScalars {
print(scalar.value)
}
/*
prints: 65
*/
As an alternative, you can use the sample code below if you only want to print the value of the first unicode scalar of a Character:
let character: Character = "A"
let scalars = character.unicodeScalars
let firstScalar = scalars[scalars.startIndex]
print(firstScalar.value)
/*
prints: 65
*/
#2. Using Character's asciiValue property
If what you really want is to get the ASCII encoding value of a character, you can use Character's asciiValue. asciiValue has the following declaration:
Returns the ASCII encoding value of this Character, if ASCII.
var asciiValue: UInt8? { get }
The Playground sample code below show how to use asciiValue:
let character: Character = "A"
print(String(describing: character.asciiValue))
/*
prints: Optional(65)
*/
let character: Character = "ะ"
print(String(describing: character.asciiValue))
/*
prints: nil
*/
Have you tried:
import Foundation
let characterString: String = "abc"
var numbers: [Int] = Array<Int>()
for character in characterString.utf8 {
let stringSegment: String = "\(character)"
let anInt: Int = stringSegment.toInt()!
numbers.append(anInt)
}
numbers
Output:
[97, 98, 99]
It may also be only one Character in the String.
Like some other emoji characters, the 0x0001F1E9 0x0001F1EA combination (German flag) is represented as a single character on screen although it is really two different Unicode character points combined. Is it represented as one or two different characters in Swift?
let flag = "\u{1f1e9}\u{1f1ea}"
then flag is ๐ฉ๐ช .
For more regional indicator symbols, see:
http://en.wikipedia.org/wiki/Regional_Indicator_Symbol
Support for "extended grapheme clusters" has been added to Swift in the meantime.
Iterating over the characters of a string produces a single character for
the "flags":
let string = "Hi๐ฉ๐ช!"
for char in string.characters {
print(char)
}
Output:
H
i
๐ฉ๐ช
!
Swift 3 implements Unicode in its String struct. In Unicode, all flags are pairs of Regional Indicator Symbols. So, ๐ฉ๐ช is actually ๐ฉ followed by ๐ช (try copying the two and pasting them next to eachother!).
When two or more Regional Indicator Symbols are placed next to eachother, they form an "Extended Grapheme Cluster", which means they're treated as one character. This is why "๐ช๐บ = ๐ซ๐ท๐ช๐ธ๐ฉ๐ช...".characters gives you ["๐ช๐บ", " ", "=", " ", "๐ซ๐ท๐ช๐ธ๐ฉ๐ช", ".", ".", "."].
If you want to see every single Unicode code point (AKA "scalar"), you can use .unicodeScalars, so that "Hi๐ฉ๐ช!".unicodeScalars gives you ["H", "i", "๐ฉ", "๐ช", "!"]
tl;dr
๐ฉ๐ช is one character (in both Swift and Unicode), which is made up of two code points (AKA scalars). Don't forget these are different! ๐
See Also
Why are emoji characters like ๐ฉโ๐ฉโ๐งโ๐ฆ treated so strangely in Swift strings?
The Swift Programming Language (Swift 3.1) - Strings and Characters - Unicode
With Swift 5, you can iterate over the unicodeScalars property of a flag emoji character in order to print the Unicode scalar values that compose it:
let emoji: Character = "๐ฎ๐น"
for scalar in emoji.unicodeScalars {
print(scalar)
}
/*
prints:
๐ฎ
๐น
*/
If you combine those scalars (that are Regional Indicator Symbols), you get a flag emoji:
let italianFlag = "๐ฎ" + "๐น"
print(italianFlag) // prints: ๐ฎ๐น
print(italianFlag.count) // prints: 1
Each Unicode.Scalar instance also has a property value that you can use in order to display a numeric representation of it:
let emoji: Character = "๐ฎ๐น"
for scalar in emoji.unicodeScalars {
print(scalar.value)
}
/*
prints:
127470
127481
*/
You can create Unicode scalars from those numeric representations then associate them into a string:
let scalar1 = Unicode.Scalar(127470)
let scalar2 = Unicode.Scalar(127481)
let italianFlag = String(scalar1!) + String(scalar2!)
print(italianFlag) // prints: ๐ฎ๐น
print(italianFlag.count) // prints: 1
If needed, you can use Unicode.Scalar's escaped(asASCII:) method in order to display a string representation of the Unicode scalars (using ASCII characters):
let emoji: Character = "๐ฎ๐น"
for scalar in emoji.unicodeScalars {
print(scalar.escaped(asASCII: true))
}
/*
prints:
\u{0001F1EE}
\u{0001F1F9}
*/
let italianFlag = "\u{0001F1EE}\u{0001F1F9}"
print(italianFlag) // prints: ๐ฎ๐น
print(italianFlag.count) // prints: 1
String's init(_:radix:uppercase:) may also be relevant to convert the scalar value to an hexadecimal value:
let emoji: Character = "๐ฎ๐น"
for scalar in emoji.unicodeScalars {
print(String(scalar.value, radix: 16, uppercase: true))
}
/*
prints:
1F1EE
1F1F9
*/
let italianFlag = "\u{1F1EE}\u{1F1F9}"
print(italianFlag) // prints: ๐ฎ๐น
print(italianFlag.count) // prints: 1
Swift doesn't tell you what the internal representation of a String is. You interact with a String as a list of full-size (32-bit) Unicode code points:
for character in "Dog!๐ถ" {
println(character)
}
// prints D, o, g, !, ๐ถ
If you want to work with a string as a sequence of UTF-8 or UTF-16 code points, use its utf8 or utf16 properties. See Strings and Characters in the docs.