Include a UTF8 character literal in a [UInt8] array or Data - swift

I would like something similar to:
let a = ["v".utf8[0], 1, 2]
The closest I have figured out is:
let a = [0x76, 1, 2]
and
"v".data(using: String.Encoding.utf8)! + [1, 2]
Note: Either [UInt8] or Data is an acceptable type.

String's UTF8View is not indexed by an Int, rather it's own String.UTF8View.Index type, therefore in order to include the first byte of a UTF-8 sequence of a given string in your array literal, you could use its first property instead:
let a = ["v".utf8.first!, 1, 2] // [118, 1, 2]
If there's more than one byte in the sequence, you can concatenate the UTF-8 bytes with an array literal simply by using the + operator:
let a = "๐Ÿ˜€".utf8 + [1, 2] // [240, 159, 152, 128, 1, 2]
Also note that your example to concatenate a [UInt8] to a Data could be shortened slightly to:
let a = "v".data(using: .utf8)! + [1, 2] // Data with bytes [0x76, 0x1, 0x2]

There is a specific UInt8 initializer (introduced in Swift 2.2+):
let a = [UInt8(ascii:"v"), 1 ,2]

(Some addendums to the already posted answers; regarding UnicodeScalar's in particular)
In you question you've used a literal "v" as the base instance to be converted to UInt8; we don't really know if this is a String or e.g. UnicodeScalar in your actual use case. The accepted answer shows some neat approaches in case you are working wit a String instance.
In case you happen to be working with a UnicodeScalar instance (rather than a String), one answer has already mentioned the init(ascii:) initializer of UInt8. You should take care however, to verify that the UnicodeScalar instance used in this initializer is indeed one that that fits within ASCII character encoding; the majority of UnicodeScalar values will not (which will lead to a runtime exeception for this initializer). You may use e.g. the isASCII property of UnicodeScalar to verify this fact prior to making use of the initializer.
let ucScalar: UnicodeScalar = "z"
var a = [UInt8]()
if ucScalar.isASCII {
a = [UInt8(ascii: ucScalar), 1, 2]
}
else {
// ... unexpected but not a runtime error
}
Another approach, in case you'd like to encode the full UnicodeScalar into UInt8 format (even for UnicodeScalar's that cannot be single-byte ASCII endoded) is using the encode(_:into:) method of UTF8:
let ucScalar: UnicodeScalar = "z"
var bytes: [UTF8.CodeUnit] = []
UTF8.encode(ucScalar, into: { bytes.append($0) })
bytes += [1, 2]
print(bytes) // [122, 1, 2]
// ...
let ucScalar: UnicodeScalar = "\u{03A3}" // ฮฃ
var bytes: [UTF8.CodeUnit] = []
UTF8.encode(ucScalar, into: { bytes.append($0) })
bytes += [1, 2]
print(bytes) // [206, 163, 1, 2]

Related

If a sequence of code points forms a Unicode character, does every non-empty prefix of that sequence also form a valid character?

The problem I have is that given a sequence of bytes, I want to determine its longest prefix which forms a valid Unicode character (extended grapheme cluster) assuming UTF8 encoding.
I am using Swift, so I would like to use Swift's built-in function(s) to do so. But these functions only decode a complete sequence of bytes. So I was thinking to convert prefixes of the byte sequence via Swift and take the last prefix that didn't fail and consists of 1 character only. Obviously, this might lead to trying out the entire sequence of bytes, which I want to avoid. A solution would be to stop trying out prefixes after 4 prefixes in a row failed. If the property asked in my question holds, this would then guarantee that all longer prefixes must also fail.
I find the Unicode Text Segmentation Standard unreadable, otherwise I would try to directly implement boundary detection of extended grapheme clusters...
After taking a long hard look at the specification for computing the boundaries for extended grapheme clusters (EGCs) at https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules,
it is obvious that the rules for EGCs all have the shape of describing when it is allowed to append a code point to an existing EGC to form a longer EGC. From that fact alone my two questions follow: 1) Yes, every non-empty prefix of code points which form an EGC is also an EGC. 2) No, by adding a code point to a valid Unicode string you will not decrease its length in terms of number of EGCs it consists of.
So, given this, the following Swift code will extract the longest Unicode character from the start of a byte sequence (or return nil if there is no valid Unicode character there):
func lex<S : Sequence>(_ input : S) -> (length : Int, out: Character)? where S.Element == UInt8 {
// This code works under three assumptions, all of which are true:
// 1) If a sequence of codepoints does not form a valid character, then appending codepoints to it does not yield a valid character
// 2) Appending codepoints to a sequence of codepoints does not decrease its length in terms of extended grapheme clusters
// 3) a codepoint takes up at most 4 bytes in an UTF8 encoding
var chars : [UInt8] = []
var result : String = ""
var resultLength = 0
func value() -> (length : Int, out : Character)? {
guard let character = result.first else { return nil }
return (length: resultLength, out: character)
}
var length = 0
var iterator = input.makeIterator()
while length - resultLength <= 4 {
guard let char = iterator.next() else { return value() }
chars.append(char)
length += 1
guard let s = String(bytes: chars, encoding: .utf8) else { continue }
guard s.count == 1 else { return value() }
result = s
resultLength = length
}
return value()
}

Swift: how to convert readLine() input " [-5,20,8...] " to an Int array

I already run the search today and found a similar issue here, but it not fully fix the issue. In my case, I want to convert readLine input string "[3,-1,6,20,-5,15]" to an Int array [3,-1,6,20,-5,15].
I'm doing an online coding quest from one website, which requires input the test case from readLine().
For example, if I input [1,3,-5,7,22,6,85,2] in the console then I need to convert it to Int array type. After that I could deal with the algorithm part to solve the quest. Well, I think it is not wise to limit the input as readLine(), but simply could do nothing about that:(
My code as below, it could deal with positive array with only numbers smaller than 10. But for this array, [1, -3, 22, -6, 5,6,7,8,9] it will give nums as [1, 3, 2, 2, 6, 5, 6, 7, 8, 9], so how could I correctly convert the readLine() input?
print("please give the test array with S length")
if let numsInput = readLine() {
let nums = numsInput.compactMap {Int(String($0))}
print("nums: \(nums)")
}
Here is a one liner to convert the input into an array of integers. Of course you might want to split this up in separate steps if some validation is needed
let numbers = input
.trimmingCharacters(in: .whitespacesAndNewlines)
.dropFirst()
.dropLast()
.split(separator: ",")
.compactMap {Int($0)}
dropFirst/dropLast can be replaced with a replace using a regular expression
.replacingOccurrences(of: "[\\[\\]]", with: "", options: .regularExpression)
Use split method to get a sequence of strings from an input string
let nums = numsInput.split(separator: ",").compactMap {Int($0)}

How to get a character from its ASCII value?

I want to get letters from their ASCII code in Swift 3. I would do it like this in Java :
for(int i = 65 ; i < 75 ; i++)
{
System.out.print((char)i);
}
Which would log letters from A to J.
Now I tried this in Swift :
let s = String(describing: UnicodeScalar(i))
Instead of only getting the letter, I get this :
Optional("A")
Optional("B")
Optional("C")
...
What am I doing wrong? Thanks for your help.
UnicodeScalar has a few failable initialisers for integer types that can reprent values that aren't valid Unicode code points. Therefore you'll need to unwrap the UnicodeScalar? returned, as in the case that you pass an invalid code point, the initialiser will return nil.
However, given that you're dealing exclusively with ASCII characters, you can simply annotate i as a UInt8 and take advantage of the fact that UnicodeScalar has a non-failable initialiser for a UInt8 input (as it will always represent a valid code point):
for i : UInt8 in 65..<75 {
print(UnicodeScalar(i))
}

UTF8 to Base 2 Representation Swift

I was wondering what is the best way for converting an UTF8 Array or String to its base 2 representation(each UTF8 value of each character to its base 2 representation) . Since you could have two values representing the code for the same character, I suppose extracting values from the array and then converting it is not a valid method. So which one is? Thank you!
Here is a possible approach:
Enumerate the unicode scalars of the string.
Convert each unicode scalar back to a string, and enumerate its
UTF-8 encoding.
Convert each UTF-8 byte to a "binary string".
The last task can be done with the following generic method which
works for all unsigned integer types:
extension UnsignedIntegerType {
func toBinaryString() -> String {
let s = String(self, radix: 2)
let numBits = 8 * sizeofValue(self)
return String(count: numBits - s.characters.count, repeatedValue: Character("0")) + s
}
}
// Example:
// UInt8(100).toBinaryString() = "01100100"
// UInt16.max.toBinaryString() = "1111111111111111"
Then the conversion to a UTF-8 binary representation can be
implemented like this:
func binaryUTF8Strings(string: String) -> [String] {
return string.unicodeScalars.map {
String($0).utf8.map { $0.toBinaryString() }.joinWithSeparator(" ")
}
}
Example usage:
for u in base2UTF8("Hโ‚ฌllรถ ๐Ÿ‡ฉ๐Ÿ‡ช") {
print(u)
}
Output:
01001000
11100010 10000010 10101100
01101100
01101100
11000011 10110110
00100000
11110000 10011111 10000111 10101001
11110000 10011111 10000111 10101010
Note that "๐Ÿ‡ฉ๐Ÿ‡ช" is a single character (an "extended grapheme cluster")
but two unicode scalars.

How is the ๐Ÿ‡ฉ๐Ÿ‡ช character represented in Swift strings?

Like some other emoji characters, the 0x0001F1E9 0x0001F1EA combination (German flag) is represented as a single character on screen although it is really two different Unicode character points combined. Is it represented as one or two different characters in Swift?
let flag = "\u{1f1e9}\u{1f1ea}"
then flag is ๐Ÿ‡ฉ๐Ÿ‡ช .
For more regional indicator symbols, see:
http://en.wikipedia.org/wiki/Regional_Indicator_Symbol
Support for "extended grapheme clusters" has been added to Swift in the meantime.
Iterating over the characters of a string produces a single character for
the "flags":
let string = "Hi๐Ÿ‡ฉ๐Ÿ‡ช!"
for char in string.characters {
print(char)
}
Output:
H
i
๐Ÿ‡ฉ๐Ÿ‡ช
!
Swift 3 implements Unicode in its String struct. In Unicode, all flags are pairs of Regional Indicator Symbols. So, ๐Ÿ‡ฉ๐Ÿ‡ช is actually ๐Ÿ‡ฉ followed by ๐Ÿ‡ช (try copying the two and pasting them next to eachother!).
When two or more Regional Indicator Symbols are placed next to eachother, they form an "Extended Grapheme Cluster", which means they're treated as one character. This is why "๐Ÿ‡ช๐Ÿ‡บ = ๐Ÿ‡ซ๐Ÿ‡ท๐Ÿ‡ช๐Ÿ‡ธ๐Ÿ‡ฉ๐Ÿ‡ช...".characters gives you ["๐Ÿ‡ช๐Ÿ‡บ", " ", "=", " ", "๐Ÿ‡ซ๐Ÿ‡ท๐Ÿ‡ช๐Ÿ‡ธ๐Ÿ‡ฉ๐Ÿ‡ช", ".", ".", "."].
If you want to see every single Unicode code point (AKA "scalar"), you can use .unicodeScalars, so that "Hi๐Ÿ‡ฉ๐Ÿ‡ช!".unicodeScalars gives you ["H", "i", "๐Ÿ‡ฉ", "๐Ÿ‡ช", "!"]
tl;dr
๐Ÿ‡ฉ๐Ÿ‡ช is one character (in both Swift and Unicode), which is made up of two code points (AKA scalars). Don't forget these are different! ๐Ÿ™‚
See Also
Why are emoji characters like ๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ treated so strangely in Swift strings?
The Swift Programming Language (Swift 3.1) - Strings and Characters - Unicode
With Swift 5, you can iterate over the unicodeScalars property of a flag emoji character in order to print the Unicode scalar values that compose it:
let emoji: Character = "๐Ÿ‡ฎ๐Ÿ‡น"
for scalar in emoji.unicodeScalars {
print(scalar)
}
/*
prints:
๐Ÿ‡ฎ
๐Ÿ‡น
*/
If you combine those scalars (that are Regional Indicator Symbols), you get a flag emoji:
let italianFlag = "๐Ÿ‡ฎ" + "๐Ÿ‡น"
print(italianFlag) // prints: ๐Ÿ‡ฎ๐Ÿ‡น
print(italianFlag.count) // prints: 1
Each Unicode.Scalar instance also has a property value that you can use in order to display a numeric representation of it:
let emoji: Character = "๐Ÿ‡ฎ๐Ÿ‡น"
for scalar in emoji.unicodeScalars {
print(scalar.value)
}
/*
prints:
127470
127481
*/
You can create Unicode scalars from those numeric representations then associate them into a string:
let scalar1 = Unicode.Scalar(127470)
let scalar2 = Unicode.Scalar(127481)
let italianFlag = String(scalar1!) + String(scalar2!)
print(italianFlag) // prints: ๐Ÿ‡ฎ๐Ÿ‡น
print(italianFlag.count) // prints: 1
If needed, you can use Unicode.Scalar's escaped(asASCII:) method in order to display a string representation of the Unicode scalars (using ASCII characters):
let emoji: Character = "๐Ÿ‡ฎ๐Ÿ‡น"
for scalar in emoji.unicodeScalars {
print(scalar.escaped(asASCII: true))
}
/*
prints:
\u{0001F1EE}
\u{0001F1F9}
*/
let italianFlag = "\u{0001F1EE}\u{0001F1F9}"
print(italianFlag) // prints: ๐Ÿ‡ฎ๐Ÿ‡น
print(italianFlag.count) // prints: 1
String's init(_:radix:uppercase:) may also be relevant to convert the scalar value to an hexadecimal value:
let emoji: Character = "๐Ÿ‡ฎ๐Ÿ‡น"
for scalar in emoji.unicodeScalars {
print(String(scalar.value, radix: 16, uppercase: true))
}
/*
prints:
1F1EE
1F1F9
*/
let italianFlag = "\u{1F1EE}\u{1F1F9}"
print(italianFlag) // prints: ๐Ÿ‡ฎ๐Ÿ‡น
print(italianFlag.count) // prints: 1
Swift doesn't tell you what the internal representation of a String is. You interact with a String as a list of full-size (32-bit) Unicode code points:
for character in "Dog!๐Ÿถ" {
println(character)
}
// prints D, o, g, !, ๐Ÿถ
If you want to work with a string as a sequence of UTF-8 or UTF-16 code points, use its utf8 or utf16 properties. See Strings and Characters in the docs.