Strange String.unicodeScalars and CharacterSet behaviour - swift

I'm trying to use a Swift 3 CharacterSet to filter characters out of a String but I'm getting stuck very early on. A CharacterSet has a method called contains
func contains(_ member: UnicodeScalar) -> Bool
Test for membership of a particular UnicodeScalar in the CharacterSet.
But testing this doesn't produce the expected behaviour.
let characterSet = CharacterSet.capitalizedLetters
let capitalAString = "A"
if let capitalA = capitalAString.unicodeScalars.first {
print("Capital A is \(characterSet.contains(capitalA) ? "" : "not ")in the group of capital letters")
} else {
print("Couldn't get the first element of capitalAString's unicode scalars")
}
I'm getting Capital A is not in the group of capital letters yet I'd expect the opposite.
Many thanks.

CharacterSet.capitalizedLetters
returns a character set containing the characters in Unicode General Category Lt aka "Letter, titlecase". That are
"Ligatures containing uppercase followed by lowercase letters (e.g., Dž, Lj, Nj, and Dz)" (compare Wikipedia: Unicode character property or
Unicode® Standard Annex #44 – Table 12. General_Category Values).
You can find a list here: Unicode Characters in the 'Letter, Titlecase' Category.
You can also use the code from
NSArray from NSCharacterset to dump the contents of the character
set:
extension CharacterSet {
func allCharacters() -> [Character] {
var result: [Character] = []
for plane: UInt8 in 0...16 where self.hasMember(inPlane: plane) {
for unicode in UInt32(plane) << 16 ..< UInt32(plane + 1) << 16 {
if let uniChar = UnicodeScalar(unicode), self.contains(uniChar) {
result.append(Character(uniChar))
}
}
}
return result
}
}
let characterSet = CharacterSet.capitalizedLetters
print(characterSet.allCharacters())
// ["Dž", "Lj", "Nj", "Dz", "ᾈ", "ᾉ", "ᾊ", "ᾋ", "ᾌ", "ᾍ", "ᾎ", "ᾏ", "ᾘ", "ᾙ", "ᾚ", "ᾛ", "ᾜ", "ᾝ", "ᾞ", "ᾟ", "ᾨ", "ᾩ", "ᾪ", "ᾫ", "ᾬ", "ᾭ", "ᾮ", "ᾯ", "ᾼ", "ῌ", "ῼ"]
What you probably want is CharacterSet.uppercaseLetters which
Returns a character set containing the characters in Unicode General Category Lu and Lt.

Related

Swift. How to get the previous character?

For example: I have character "b" and I what to get "a", so "a" is the previous character.
let b: Character = "b"
let a: Character = b - 1 // Compilation error
It's actually pretty complicated to get the previous character from Swift's Character type because Character is actually comprised of one or more Unicode.Scalar values. Depending on your needs you could restrict your efforts to just the ASCII characters. Or you could support all characters comprised of a single Unicode scalar. Once you get into characters comprised of multiple Unicode scalars (such as the flag Emojis or various skin toned Emojis) then I'm not even sure what the "previous character" means.
Here is a pair of methods added to a Character extension that can handle ASCII and single-Unicode scalar characters.
extension Character {
var previousASCII: Character? {
if let ascii = asciiValue, ascii > 0 {
return Character(Unicode.Scalar(ascii - 1))
}
return nil
}
var previousScalar: Character? {
if unicodeScalars.count == 1 {
if let scalar = unicodeScalars.first, scalar.value > 0 {
if let prev = Unicode.Scalar(scalar.value - 1) {
return Character(prev)
}
}
}
return nil
}
}
Examples:
let b: Character = "b"
let a = b.previousASCII // Gives Optional("a")
let emoji: Character = "😆"
let previous = emoji.previousScalar // Gives Optional("😅")

Swift - How to check a string not included punctuations and numbers

I want to check a string to be able to understand that string is suitable for using as a display name in the app. Below block looks only for english characters. How can I cover all language letters? Also all punctuations and numbers won't be allowed.
func isSuitableForDisplayName(inputString: String) -> Bool {
let mergedString = inputString.stringByRemovingWhitespaces
let characterset = CharacterSet(charactersIn: "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
if mergedString.rangeOfCharacter(from: characterset.inverted) != nil {
return false
} else {
return true
}
}
You can use CharacterSet.letters, which contains all the characters in the Unicode categories L and M.
Category M includes combining marks. If you don't want those, use:
CharacterSet.letters.subtracting(.nonBaseCharacters)
Also, your way of checking whether a string contains only the characters in a character set is quite weird. I would do something like this:
return mergedString.trimmingCharacters(in: CharacterSet.letters) == ""

How to get all characters of the font with CTFontCopyCharacterSet() in Swift?

How does one get all characters of the font with CTFontCopyCharacterSet() in Swift? ... for macOS?
The issue occured when implementing the approach from an OSX: CGGlyph to UniChar answer in Swift.
func createUnicodeFontMap() {
// Get all characters of the font with CTFontCopyCharacterSet().
let cfCharacterSet: CFCharacterSet = CTFontCopyCharacterSet(ctFont)
//
let cfCharacterSetStr = "\(cfCharacterSet)"
print("CFCharacterSet: \(cfCharacterSet)")
// Map all Unicode characters to corresponding glyphs
var unichars = [UniChar](…NYI…) // NYI: lacking unichars for CFCharacterSet
var glyphs = [CGGlyph](repeating: 0, count: unichars.count)
guard CTFontGetGlyphsForCharacters(
ctFont, // font: CTFont
&unichars, // characters: UnsafePointer<UniChar>
&glyphs, // UnsafeMutablePointer<CGGlyph>
unichars.count // count: CFIndex
)
else {
return
}
// For each Unicode character and its glyph,
// store the mapping glyph -> Unicode in a dictionary.
// ... NYI
}
What to do with CFCharacterSet to retrieve the actual characters has been elusive. Autocompletion of the cfCharacterSet instance offers show no relavant methods.
And the Core Foundation > CFCharacterSet appears have methods for creating another CFCharacterSet, but not something the provides an array|list|string of unichars to be able to create a mapped dictionary.
Note: I'm looking for a solution which is not specific to iOS as in Get all available characters from a font which uses UIFont.
CFCharacterSet is toll-free bridged with the Cocoa Foundation counterpart NSCharacterSet, and can be bridged to the corresponding Swift value type CharacterSet:
let charset = CTFontCopyCharacterSet(ctFont) as CharacterSet
Then the approach from NSArray from NSCharacterSet can be used to enumerate all Unicode scalar values of that character set (including non-BMP points, i.e. Unicode scalar values greater than U+FFFF).
The CTFontGetGlyphsForCharacters() expects non-BMP characters as surrogate pair, i.e. as an array of UTF-16 code units.
Putting it together, the function would look like this:
func createUnicodeFontMap(ctFont: CTFont) -> [CGGlyph : UnicodeScalar] {
let charset = CTFontCopyCharacterSet(ctFont) as CharacterSet
var glyphToUnicode = [CGGlyph : UnicodeScalar]() // Start with empty map.
// Enumerate all Unicode scalar values from the character set:
for plane: UInt8 in 0...16 where charset.hasMember(inPlane: plane) {
for unicode in UTF32Char(plane) << 16 ..< UTF32Char(plane + 1) << 16 {
if let uniChar = UnicodeScalar(unicode), charset.contains(uniChar) {
// Get glyph for this `uniChar` ...
let utf16 = Array(uniChar.utf16)
var glyphs = [CGGlyph](repeating: 0, count: utf16.count)
if CTFontGetGlyphsForCharacters(ctFont, utf16, &glyphs, utf16.count) {
// ... and add it to the map.
glyphToUnicode[glyphs[0]] = uniChar
}
}
}
}
return glyphToUnicode
}
You can do something like this.
let cs = CTFontCopyCharacterSet(font) as NSCharacterSet
let bitmapRepresentation = cs.bitmapRepresentation
The format of the bitmap is defined in the reference page for CFCharacterSetCreateWithBitmapRepresentation

How to get unicode code point(s) representation of character/string in Swift?

As a generic solution, how can we get the unicode code point/s for a character or a string in Swift?
Consider the following:
let A: Character = "A" // "\u{0041}"
let Á: Character = "Á" // "\u{0041}\u{0301}"
let sparklingHeart = "💖" // "\u{1F496}"
let SWIFT = "SWIFT" // "\u{0053}\u{0057}\u{0049}\u{0046}\u{0054}"
If I am not mistaking, the desired function might return an array of strings, for instance:
extension Character {
func getUnicodeCodePoints() -> [String] {
//...
}
}
A.getUnicodeCodePoints()
// the output should be: ["\u{0041}"]
Á.getUnicodeCodePoints()
// the output should be: ["\u{0041}", "\u{0301}"]
sparklingHeart.getUnicodeCodePoints()
// the output should be: ["\u{1F496}"]
SWIFT.getUnicodeCodePoints()
// the output should be: ["\u{0053}", "\u{0057}", "\u{0049}", "\u{0046}", "\u{0054}"]
Any more suggested elegant approach would be appreciated.
Generally, the unicodeScalars property of a String returns a collection
of its unicode scalar values. (A Unicode scalar value is any
Unicode code point except high-surrogate and low-surrogate code points.)
Example:
print(Array("Á".unicodeScalars)) // ["A", "\u{0301}"]
print(Array("💖".unicodeScalars)) // ["\u{0001F496}"]
Up to Swift 3 there is no way to access
the unicode scalar values of a Character directly, it has to be
converted to a String first (for the Swift 4 status, see below).
If you want to see all Unicode scalar values as hexadecimal numbers
then you can access the value property (which is a UInt32 number)
and format it according to your needs.
Example (using the U+NNNN notation for Unicode values):
extension String {
func getUnicodeCodePoints() -> [String] {
return unicodeScalars.map { "U+" + String($0.value, radix: 16, uppercase: true) }
}
}
extension Character {
func getUnicodeCodePoints() -> [String] {
return String(self).getUnicodeCodePoints()
}
}
print("A".getUnicodeCodePoints()) // ["U+41"]
print("Á".getUnicodeCodePoints()) // ["U+41", "U+301"]
print("💖".getUnicodeCodePoints()) // ["U+1F496"]
print("SWIFT".getUnicodeCodePoints()) // ["U+53", "U+57", "U+49", "U+46", "U+54"]
print("🇯🇴".getUnicodeCodePoints()) // ["U+1F1EF", "U+1F1F4"]
Update for Swift 4:
As of Swift 4, the unicodeScalars of a Character can be
accessed directly,
see SE-0178 Add unicodeScalars property to Character. This makes the conversion to a String
obsolete:
let c: Character = "🇯🇴"
print(Array(c.unicodeScalars)) // ["\u{0001F1EF}", "\u{0001F1F4}"]

Get numbers characters from a string [duplicate]

This question already has answers here:
Filter non-digits from string
(12 answers)
Closed 6 years ago.
How to get numbers characters from a string? I don't want to convert in Int.
var string = "string_1"
var string2 = "string_20_certified"
My result have to be formatted like this:
newString = "1"
newString2 = "20"
Pattern matching a String's unicode scalars against Western Arabic Numerals
You could pattern match the unicodeScalars view of a String to a given UnicodeScalar pattern (covering e.g. Western Arabic numerals).
extension String {
var westernArabicNumeralsOnly: String {
let pattern = UnicodeScalar("0")..."9"
return String(unicodeScalars
.flatMap { pattern ~= $0 ? Character($0) : nil })
}
}
Example usage:
let str1 = "string_1"
let str2 = "string_20_certified"
let str3 = "a_1_b_2_3_c34"
let newStr1 = str1.westernArabicNumeralsOnly
let newStr2 = str2.westernArabicNumeralsOnly
let newStr3 = str3.westernArabicNumeralsOnly
print(newStr1) // 1
print(newStr2) // 20
print(newStr3) // 12334
Extending to matching any of several given patterns
The unicode scalar pattern matching approach above is particularly useful extending it to matching any of a several given patterns, e.g. patterns describing different variations of Eastern Arabic numerals:
extension String {
var easternArabicNumeralsOnly: String {
let patterns = [UnicodeScalar("\u{0660}")..."\u{0669}", // Eastern Arabic
"\u{06F0}"..."\u{06F9}"] // Perso-Arabic variant
return String(unicodeScalars
.flatMap { uc in patterns.contains{ $0 ~= uc } ? Character(uc) : nil })
}
}
This could be used in practice e.g. if writing an Emoji filter, as ranges of unicode scalars that cover emojis can readily be added to the patterns array in the Eastern Arabic example above.
Why use the UnicodeScalar patterns approach over Character ones?
A Character in Swift contains of an extended grapheme cluster, which is made up of one or more Unicode scalar values. This means that Character instances in Swift does not have a fixed size in the memory, which means random access to a character within a collection of sequentially (/contiguously) stored character will not be available at O(1), but rather, O(n).
Unicode scalars in Swift, on the other hand, are stored in fixed sized UTF-32 code units, which should allow O(1) random access. Now, I'm not entirely sure if this is a fact, or a reason for what follows: but a fact is that if benchmarking the methods above vs equivalent method using the CharacterView (.characters property) for some test String instances, its very apparent that the UnicodeScalar approach is faster than the Character approach; naive testing showed a factor 10-25 difference in execution times, steadily growing for growing String size.
Knowing the limitations of working with Unicode scalars vs Characters in Swift
Now, there are drawbacks using the UnicodeScalar approach, however; namely when working with characters that cannot represented by a single unicode scalar, but where one of its unicode scalars are contained in the pattern to which we want to match.
E.g., consider a string holding the four characters "Café". The last character, "é", is represented by two unicode scalars, "e" and "\u{301}". If we were to implement pattern matching against, say, UnicodeScalar("a")...e, the filtering method as applied above would allow one of the two unicode scalars to pass.
extension String {
var onlyLowercaseLettersAthroughE: String {
let patterns = [UnicodeScalar("1")..."e"]
return String(unicodeScalars
.flatMap { uc in patterns.contains{ $0 ~= uc } ? Character(uc) : nil })
}
}
let str = "Cafe\u{301}"
print(str) // Café
print(str.onlyLowercaseLettersAthroughE) // Cae
/* possibly we'd want "Ca" or "Caé"
as result here */
In the particular use case queried by from the OP in this Q&A, the above is not an issue, but depending on the use case, it will sometimes be more appropriate to work with Character pattern matching over UnicodeScalar.
Edit: Updated for Swift 4 & 5
Here's a straightforward method that doesn't require Foundation:
let newstring = string.filter { "0"..."9" ~= $0 }
or borrowing from #dfri's idea to make it a String extension:
extension String {
var numbers: String {
return filter { "0"..."9" ~= $0 }
}
}
print("3 little pigs".numbers) // "3"
print("1, 2, and 3".numbers) // "123"
import Foundation
let string = "a_1_b_2_3_c34"
let result = string.components(separatedBy: CharacterSet.decimalDigits.inverted).joined(separator: "")
print(result)
Output:
12334
Here is a Swift 2 example:
let str = "Hello 1, World 62"
let intString = str.componentsSeparatedByCharactersInSet(
NSCharacterSet
.decimalDigitCharacterSet()
.invertedSet)
.joinWithSeparator("") // Return a string with all the numbers
This method iterate through the string characters and appends the numbers to a new string:
class func getNumberFrom(string: String) -> String {
var number: String = ""
for var c : Character in string.characters {
if let n: Int = Int(String(c)) {
if n >= Int("0")! && n < Int("9")! {
number.append(c)
}
}
}
return number
}
For example with regular expression
let text = "string_20_certified"
let pattern = "\\d+"
let regex = try! NSRegularExpression(pattern: pattern, options: [])
if let match = regex.firstMatch(in: text, options: [], range: NSRange(location: 0, length: text.characters.count)) {
let newString = (text as NSString).substring(with: match.range)
print(newString)
}
If there are multiple occurrences of the pattern use matches(in..
let matches = regex.matches(in: text, options: [], range: NSRange(location: 0, length: text.characters.count))
for match in matches {
let newString = (text as NSString).substring(with: match.range)
print(newString)
}