NSAttributedString and emojis: issue with positions and lengths - swift

I'm coloring some parts of a text coming from an API (think "#mention" as on Twitter) using NSAttributedString.
The API gives me the text and an array of entities representing the parts of the text that are mentions (or links, tags, etc) which should be colored.
But sometimes, the coloration is offset because of emojis.
For example, with this text:
"#ericd Some text. #apero"
the API gives:
[
{
"text" : "ericd",
"len" : 6,
"pos" : 0
},
{
"text" : "apero",
"len" : 6,
"pos" : 18
}
]
which I successfully translate to an NSAttributedString using NSRange:
for m in entities.mentions {
let r = NSMakeRange(m.pos, m.len)
myAttributedString.addAttribute(NSForegroundColorAttributeName, value: someValue, range: r)
}
We see that "pos": 18 is correct, this is where "#apero" starts. The colored parts are "#ericd" and "#apero", as expected.
but when some specific combinations of emojis are used in the text, the API does not translate well to NSATtributedString, the coloration is offset:
"#ericd Some text. ๐Ÿ˜บโœŒ๐Ÿป #apero"
gives:
[
{
"text" : "ericd",
"len" : 6,
"pos" : 0
},
{
"text" : "apero",
"len" : 6,
"pos" : 22
}
]
"pos": 22: the API author states that this is correct, and I understand their point of view.
Unfortunately, NSAttributedString does not agree, my coloration is off:
The last characters for the second mention are not colored (because the "pos" is too short because of the emojis?).
As you might have already guessed, I cannot in any way change the way the API behaves, I have to adapt on client side.
Except that... I have no idea what to do. Should I try to detect what kind of emojis are in the text and manually amend the position of mentions when there's a problematic emoji? But what would be the criteria to detect which emoji shifts the position and which doesn't? And how to decide how much offset I need? Maybe the problem is caused by NSAttributedString?
I understand that this is related to the emojis length once composed compared to their length as discrete characters, but... well... I'm lost (sigh).
Note that I've tried to implement a solution similar to this stuff because my API is compatible with this one, but it only worked partially, some emojis were still breaking the indexes:

A Swift String provides different "views" on its contents.
A good overview is given in "Strings in Swift 2" in the Swift Blog:
characters is a collection of Character values, or extended grapheme clusters.
unicodeScalars is a collection of Unicode scalar values.
utf8 is a collection of UTFโ€“8 code units.
utf16 is a collection of UTFโ€“16 code units.
As it turned out in the discussion, pos and len from your API
are indices into the Unicode scalars view.
On the other hand, the addAttribute() method of NSMutableAttributedString takes an NSRange, i.e. the range corresponding
to indices of the UTF-16 code points in an NSString.
String provides methods to "translate" between indices of the
different views (compare NSRange to Range<String.Index>):
let text = "#ericd Some text. ๐Ÿ˜บโœŒ๐Ÿป #apero"
let pos = 22
let len = 6
// Compute String.UnicodeScalarView indices for first and last position:
let from32 = text.unicodeScalars.index(text.unicodeScalars.startIndex, offsetBy: pos)
let to32 = text.unicodeScalars.index(from32, offsetBy: len)
// Convert to String.UTF16View indices:
let from16 = from32.samePosition(in: text.utf16)
let to16 = to32.samePosition(in: text.utf16)
// Convert to NSRange by computing the integer distances:
let nsRange = NSRange(location: text.utf16.distance(from: text.utf16.startIndex, to: from16),
length: text.utf16.distance(from: from16, to: to16))
This NSRange is what you need for the attributed string:
let attrString = NSMutableAttributedString(string: text)
attrString.addAttribute(NSForegroundColorAttributeName,
value: UIColor.red,
range: nsRange)
Update for Swift 4 (Xcode 9): In Swift 4, the standard library
provides methods to convert between Swift String ranges and NSString
ranges, therefore the calculations simplify to
let text = "#ericd Some text. ๐Ÿ˜บโœŒ๐Ÿป #apero"
let pos = 22
let len = 6
// Compute String.UnicodeScalarView indices for first and last position:
let fromIdx = text.unicodeScalars.index(text.unicodeScalars.startIndex, offsetBy: pos)
let toIdx = text.unicodeScalars.index(fromIdx, offsetBy: len)
// Compute corresponding NSRange:
let nsRange = NSRange(fromIdx..<toIdx, in: text)

Related

How to shift a string's Range?

I have the Range of a word and its enclosing sentence within a big long String. After extracting that sentence into its own String, I'd like to know the position of the word within it.
If we were dealing with integer indexes, I would just subtract the sentence's starting index from the word's range and I'd be done. For example, if the word was in characters 10โ€“12 and its sentence started at character 8, then I'd have a new word range of 2โ€“4.
Here's what I've got, ready to copy&paste to a Playground:
// The Setup (this is just to get easy testing values, no need for feedback on this part)
let bigLongString = "A beginning is the time for taking the most delicate care that the balances are correct. This every sister of the Bene Gesserit knows."
let sentenceInString = bigLongString.range(of: "This every sister of the Bene Gesserit knows.")!
let wordInString = bigLongString.range(of: "sister")!
let sentence = String(bigLongString[sentenceInString])
// The Code In Question
let wordInSentence = ??? // Something that shifts the `wordInString` range
// The Test (again, just for testing. it should read "This every *sister* of the Bene Gesserit knows.")
print(sentence.replacingCharacters(in: wordInSentence,
with: "*\(sentence[wordInSentence])*"))
Also, note that wordInString may refer to any instance of a given word, not just the first one. (So, re-finding the word in sentence, i.e., sentence.range(of: "sister"), won't do the trick here unfortunately.) The range needs to be shifted somehow.
Thanks for reading!
EDIT:
Introducing a slightly more complicated bigLongString seems to be an issue with the solution I posted. E.g.,
let bigLongString = "Reallyโ€ฆ? Thought I had it."
let sentenceInString = bigLongString.range(of: "Thought I had it.")!
let wordInString = bigLongString.range(of: "I")!
This can get kinda tricky, depending on precisely what you need to do.
NSRange
Firstly, as you may have noticed, Range<String.Index> and NSRange are different.
Range<String.Index> is how Swift represent ranges of indices in native Swift.Strings. It's an opaque type, that's only usable by the String APIs that consume it. It understands Swift strings as collections of Swift.Characters, which represent what Unicode calls "extended grapheme clusters".
NSRange is the older range representation, used by Objective C to represent ranges in Foundation.NSStrings. It's an open container, containing a "start" location and a length. Importantly, these NSRange and NSString understand collections of utf16 encoded unicode scalars.
Because NSRange and NSString expose so many of their internals, they haven't undergone the same migration from utf16 to utf8 that Swift.String underwent. A migration that most people probably didn't even notice, since Swift.String guarded its implementation details much more than NSString did.
NSRange is more amenable to the kinds of simple operations you might be looking for. You can offset the start location just like you describe. However, you need to be careful that the resulting range doesn't start/end in the middle of an extended grapheme cluster. In that case, slicing could lead to a substring with invalid unicode characters (for example, you might accidentally cut an e away from its accent. the accent modifier isn't valid on its own without the e.)
Bridging back and forth between NSRange and Range<String.Index> is possible, but can be error prone if you're not careful. For that reason, I suggest you try to minimize conversions, by trying to either exclusively use NSRange, or Range<String.Index>, but not mix the two too much.
replacingCharacters(in:with:)
I suspect you're only using this as example way of consuming wordInSentence, but it's still worth noting that:
Foundation.NSString.replacingCharacters(in:with:)](https://developer.apple.com/documentation/foundation/nsstring/1412937-replacingoccurrences) is an NSString API that's imported onto Swift.String when Foundation is imported. It accept an NSString. If you're dealing with Range<String.Index>, you should use its Swift-native counterpart, Swift.String.replaceSubrange(_:with:).
Substring is your friend
Don't fight it; unless you absolutely need sentence to be a String, keep it as a Substring for the duration of these short-lived processing actions. Not only does this save you a copy of the string's contents, but it also makes it so that the indices can be shared between the slice and the parent string. This is valid:
let sentence = bigLongString[sentenceInString]
print(sentence[wordInString])
or even just: bigLongString[sentenceInString][wordInString] or bigLongString[wordInString]
Shifting around
I couldn't find a native solution for this, so I rolled my own. I could definitely be missing something simpler, but here's what I came up with:
import Foundation
struct SubstringOffset {
let offset: String.IndexDistance
let parent: String
init(of substring: Substring, in parent: String) {
self.offset = parent.distance(from: parent.startIndex, to: substring.startIndex)
self.parent = parent
}
func convert(indexInParent: String.Index, toIndexIn newString: String) -> String.Index {
let distance = parent.distance(from: parent.startIndex, to: indexInParent)
let distanceInNewString = distance - offset
return newString.index(newString.startIndex, offsetBy: distanceInNewString)
}
func convert(rangeInParent: Range<String.Index>, toRangeIn newString: String) -> Range<String.Index> {
let newLowerBound = self.convert(indexInParent: rangeInParent.lowerBound, toIndexIn: newString)
let span = self.parent.distance(from: rangeInParent.lowerBound, to: rangeInParent.upperBound)
let newUpperBound = newString.index(newLowerBound, offsetBy: span)
return newLowerBound ..< newUpperBound
}
}
// The Setup (this is just to get easy testing values, no need for feedback on this part)
let bigLongString = "Reallyโ€ฆ? Thought I had it."
let sentenceInString = bigLongString.range(of: "Thought I had it.")!
let wordInString = bigLongString.range(of: "I")!
var sentence: String = String(bigLongString[sentenceInString])
let offset = SubstringOffset(of: bigLongString[sentenceInString], in: bigLongString)
// The Code In Question
let wordInSentence: Range<String.Index> = offset.convert(rangeInParent: wordInString, toRangeIn: sentence)
sentence.replaceSubrange(wordInSentence, with: "*\(sentence[wordInSentence])*")
print(sentence)
OK, this is what I've come up with. It appears to work OK for both examples in the question.
We use the String instance method distance(from:to:) to get the distance between the bigLongString start and the sentence start. (Analogous to the "8" in the question.) Then the word range is shifted back by this amount by shifting the upper and lower bounds separately, and then reforming them into a Range.
let wordStartInSentence = bigLongString.distance(from: sentenceInString.lowerBound,
to: wordInString.lowerBound)
let wordEndInSentence = bigLongString.distance(from: sentenceInString.lowerBound,
to: wordInString.upperBound)
let wordStart = sentence.index(sentence.startIndex, offsetBy: wordStartInSentence)
let wordEnd = sentence.index(sentence.startIndex, offsetBy: wordEndInSentence)
let wordInSentence = wordStart..<wordEnd
EDIT: Updated answer to work for the more complicated bigLongString example (and coincidentally also reduce the "string walking," especially when bigLongString is very big).

How to determine the display count of a Swift String?

I've reviewed questions such as Get the length of a String and Why are emoji characters like ๐Ÿ‘ฉโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ treated so strangely in Swift strings? but neither cover this specific question.
This all started when trying to apply skin tone modifiers to Emoji characters (see Add skin tone modifier to an emoji programmatically). This led to wondering what happens when you apply a skin tone modifier to a regular character such as "A".
Examples:
let tonedThumbsUp = "๐Ÿ‘" + "๐Ÿป" // ๐Ÿ‘๐Ÿป
let tonedA = "A" + "๐Ÿพ" // A๐Ÿพ
I'm trying to detect that second case. The count of both of those strings is 1. And the unicodeScalars.count is 2 for both.
How do I determine if the resulting string appears as a single character when displayed? In other words, how can I determine if the skin tone modifier was applied to make a single character or not?
I've tried a few ways to dump information about the string but none give the desired result.
func dumpString(_ str: String) {
print("Raw:", str, str.count)
print("Scalars:", str.unicodeScalars, str.unicodeScalars.count)
print("UTF16:", str.utf16, str.utf16.count)
print("UTF8:", str.utf8, str.utf16.count)
print("Range:", str.startIndex, str.endIndex)
print("First/Last:", str.first == str.last, str.first, str.last)
}
dumpString("A๐Ÿฝ")
dumpString("\u{1f469}\u{1f3fe}")
Results:
Raw: A๐Ÿฝ 1
Scalars: A๐Ÿฝ 2
UTF16: A๐Ÿฝ 3
UTF8: A๐Ÿฝ 3
First/Last: true Optional("A๐Ÿฝ") Optional("A๐Ÿฝ")
Raw: ๐Ÿ‘ฉ๐Ÿพ 1
Scalars: ๐Ÿ‘ฉ๐Ÿพ 2
UTF16: ๐Ÿ‘ฉ๐Ÿพ 4
UTF8: ๐Ÿ‘ฉ๐Ÿพ 4
First/Last: true Optional("๐Ÿ‘ฉ๐Ÿพ") Optional("๐Ÿ‘ฉ๐Ÿพ")
What happens if you print ๐Ÿ‘๐Ÿป on a system that doesn't support the Fitzpatrick modifiers? You get ๐Ÿ‘ followed by whatever the system uses for an unknown character placeholder.
So I think to answer this, you must consult your system's typesetter. For Apple platforms, you can use Core Text to create a CTLine and then count the line's glyph runs. Example:
import Foundation
import CoreText
func test(_ string: String) {
let richText = NSAttributedString(string: string)
let line = CTLineCreateWithAttributedString(richText as CFAttributedString)
let runs = CTLineGetGlyphRuns(line) as! [CTRun]
print(string, runs.count)
}
test("๐Ÿ‘" + "๐Ÿป")
test("A" + "๐Ÿพ")
test("B\u{0300}\u{0301}\u{0302}" + "๐Ÿพ")
Output from a macOS playground in Xcode 10.2.1 on macOS 10.14.6 Beta (18G48f):
๐Ÿ‘๐Ÿป 1
A๐Ÿพ 2
Bฬ€ฬฬ‚๐Ÿพ 2
I think it might be possible to reason about this by looking to see whether the modifier is present and if so whether it has increased the character count.
So for example:
let tonedThumbsUp = "๐Ÿ‘" + "๐Ÿป"
let tonedA = "A" + "๐Ÿป"
tonedThumbsUp.count // 1
tonedThumbsUp.unicodeScalars.count // 2
tonedA.count //2
tonedThumbsUp.unicodeScalars.count //2
let c = "\u{1F3FB}"
tonedThumbsUp.contains(c) // true
tonedA.contains(c) // true
Okay, so they both contain a modifier character, and they both contain two unicode scalars, but one is count 1 and the other is count 2. Surely that's a useful distinction.

Detecting Cursor position in UITextView that contains emojis returns the wrong position in swift 4

I'm using this code for detecting the cursor's position in a UITextView :
if let selectedRange = textView.selectedTextRange {
let cursorPosition = textView.offset(from: textView.beginningOfDocument, to: selectedRange.start)
print("\(cursorPosition)")
}
I put this under textViewDidChange func for detecting cursor position each time the text change.
It is working fine but when I putting emojis the textView.text.count is different with the cursor position. from swift 4 each emoji counted as one character but it seems that for cursor position it is different.
so How can I get the exact cursor position that matches the count of characters in a text ?
Long story short: When using Swift with String and NSRange use this extension for Range conversion
extension String {
/// Fixes the problem with `NSRange` to `Range` conversion
var range: NSRange {
let fromIndex = unicodeScalars.index(unicodeScalars.startIndex, offsetBy: 0)
let toIndex = unicodeScalars.index(fromIndex, offsetBy: count)
return NSRange(fromIndex..<toIndex, in: self)
}
}
Let's take a deeper look:
let myStr = "Wรฉll hellรณ โš™๏ธ"
myStr.count // 12
myStr.unicodeScalars.count // 13
myStr.utf8.count // 19
myStr.utf16.count // 13
In Swift 4 string is a collection of characters (composite character like รถ and emoji will count as one character). UTF-8 and UTF-16 views are the collections of UTF-8 and UTF-16 code units respectively.
Your problem is, that textView.text.count counts collection elements (emoji as well as composite character will count as one element) and NSRange counts indexes of UTF-16 code units. The difference is illustrated in the snipped above.
More here:
Strings And Characters

UTF8 String length and indices in Go vs Swift

I have apps in Go and Swift which process strings, such as finding substrings and their indices. At first it worked nicely even with multi-byte characters (e.g. emojis), using to Go's utf8.RuneCountInString() and Swift's native String.
But there are some UTF8 characters that break the string length and indices for substrings, e.g. a string "Lorem ๐Ÿ˜‚๐Ÿ˜ƒโœŒ๏ธ๐Ÿค” ipsum":
Go's utf8.RuneCountInString("Lorem ๐Ÿ˜‚๐Ÿ˜ƒโœŒ๏ธ๐Ÿค” ipsum") returns 17 and the start index of ipsum is 12.
Swift's "Lorem ๐Ÿ˜‚๐Ÿ˜ƒโœŒ๏ธ๐Ÿค” ipsum".count returns 16 and the start index of ipsum is 11.
Using Swift String's utf8, utf16 or casting to NSString gives also different lengths and indices. There are also other emojis composed from multiple other emoji's like ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ which gives even funnier numbers.
This is with Go 1.8 and Swift 4.1.
Is there any way to get the same string lengths and substrings' indices with same values with Go and Swift?
EDIT
I created a Swift String extension based on #MartinR's great answer:
extension String {
func runesRangeToNSRange(from: Int, to: Int) -> NSRange {
let length = to - from
let start = unicodeScalars.index(unicodeScalars.startIndex, offsetBy: from)
let end = unicodeScalars.index(start, offsetBy: length)
let range = start..<end
return NSRange(range, in: self)
}
}
In Swift a Character is an โ€œextended grapheme cluster,โ€ and each of "๐Ÿ˜‚", "๐Ÿ˜ƒ", "โœŒ๏ธ", "๐Ÿค”", "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ" counts as a single character.
I have no experience with Go, but as I understand it from Strings, bytes, runes and characters in Go,
a โ€œruneโ€ is a Unicode code point, which essentially corresponds to a UnicodeScalar in Swift.
In your example, the difference comes from "โœŒ๏ธ" which
counts as a single Swift character, but is built from two Unicode scalars:
print("โœŒ๏ธ".count) // 1
print("โœŒ๏ธ".unicodeScalars.count) // 2
Here is an example how you can compute the length and offsets in
terms of Unicode scalars:
let s = "Lorem ๐Ÿ˜‚๐Ÿ˜ƒโœŒ๏ธ๐Ÿค” ipsum"
print(s.unicodeScalars.count) // 17
if let idx = s.range(of: "ipsum") {
print(s.unicodeScalars.distance(from: s.startIndex, to: idx.lowerBound)) // 12
}
As you can see, this gives the same numbers as in your example from Go.
A rune in Go identifies a specific UTF-8 code point; that does not necessarily mean it maps 1:1 to visually distinct characters. Some characters may be made up of multiple runes/code points, therefor counting runes may not give you what you'd expect from a visual inspection of the string. I don't know what "some text".count actually counts in Swift so I can't offer any comparison there.

NSString.rangeOfString returns unusual result with non-latin characters

I need to get the range of two words in a string, for example:
เธขเธฑเธŸเธดเนเธ เน„เธŸเธซเธ
(this is literally me typing PYABCD WASD) - it's a non-sensical test since I don't speak Thai.
//Find all the ranges of each word
var words: [String] = []
var ranges: [NSRange] = []
//Convert to nsstring first because otherwise you get stuck with Ranges and Strings.
let nstext = backgroundTextField.stringValue as NSString //contains "เธขเธฑเธŸเธดเนเธ เน„เธŸเธซเธ"
words = nstext.componentsSeparatedByString(" ")
var nstextLessWordsWeHaveRangesFor = nstext //if you have two identical words this prevents just getting the first word's range
for word in words
{
let range:NSRange = nstextLessWordsWeHaveRangesFor.rangeOfString(word)
Swift.print(range)
ranges.append(range)
//create a string the same length as word
var fillerString:String = ""
for i in 0..<word.characters.count{
//for var i=0;i<word.characters.count;i += 1{
Swift.print("i: \(i)")
fillerString = fillerString.stringByAppendingString(" ")
}
//remove duplicate words / letters so that we get correct range each time.
if range.length <= nstextLessWordsWeHaveRangesFor.length
{
nstextLessWordsWeHaveRangesFor = nstextLessWordsWeHaveRangesFor.stringByReplacingCharactersInRange(range, withString: fillerString)
}
}
outputs:
(0,6)
(5,4)
Those ranges are overlapping.
This causes problems down the road where I'm trying to use NSLayoutManager.enumerateEnclosingRectsForGlyphRange since the ranges are inconsistent.
How can I get the correct range (or in this specific case, non-overlapping ranges)?
Swift String characters describe "extended grapheme clusters", and NSString
uses UTF-16 code points, therefore the length of a string differs
depending on which representation you use.
For example, the first character "เธขเธฑ" is actually the combination
of "เธข" (U+0E22) with the diacritical mark " เธฑ" (U+0E31).
That counts as one String character, but as two NSString characters.
As a consequence, indices change when you replace the word with
spaces.
The simplest solution is to stick to one, either String or NSString
(if possible). Since you are working with NSString, changing
for i in 0..<word.characters.count {
to
for i in 0..<range.length {
should solve the problem. The creation of the filler string
can be simplified to
//create a string the same length as word
let fillerString = String(count: range.length, repeatedValue: Character(" "))
Removing nstextLessWordsWeHaveRangesFor solves the issue (at the bottom starting with range.length <= nstextLessWordsWeHaveRangesFor.length). The modification of that variable is changing the range and giving unexpected output. Here is the result when the duplicate word removal is removed:
var words: [String] = []
let nstext = "เธขเธฑเธŸเธดเนเธ เน„เธŸเธซเธ" as NSString
words = nstext.componentsSeparatedByString(" ")
for word in words {
let range = nstext.rangeOfString(word)
print(range)
}
Output is: (0,6) and (7,4)