Index distance to FileHandle pointer and characters encoding in Swift 4

Index distance to FileHandle pointer and characters encoding in Swift 4 - swift

I have this function to return (and seek) a FileHandle pointer at a specific word:
func getFilePointerIndex(atWord word: String, inFile file: FileHandle) -> UInt64? {
let offset = file.offsetInFile
if let str = String(data: file.readDataToEndOfFile(), encoding: .utf8) {
if let range = str.range(of: word) {
let intIndex = str.distance(from: str.startIndex, to: range.lowerBound)
file.seek(toFileOffset: offset + UInt64(intIndex))
return UInt64(intIndex) + offset
}
}
return nil
}
When applied on some utf8 text files, it yields offset results far from the location of the word passed in. I thought it has to be the character encoding (variable-byte characters), since the seek(toFileOffset:) method applies to class Data objects.
Any idea to fix it?

let intIndex = str.distance(from: str.startIndex, to: range.lowerBound)
measures the distance in Characters, i.e. “extended Unicode grapheme
clusters”. For example, the character "€" would be stored as three
bytes "0xE2 0x82 0xAC" in UTF-8 encoding, but counts as a single
Character.
To measure the distance in UTF-8 code units, use
let intIndex = str.utf8.distance(from: str.utf8.startIndex, to: range.lowerBound)
See also Strings in Swift 2 in the Swift blog for an overview about grapheme clusters and
the different views of a Swift string.

Related

In Swift 5, how do I display the raw bits returned from range.UpperBound or range.LowerBound in a readable way (with ints)?

I am starting to experiment in Swift Playgrounds to familiarize myself with the language. To repeat my question: In Swift 5, how do I display the raw bits returned from range.UpperBound or range.LowerBound in a readable way (with ints)?
As an example, let's say I have a string var myStr = "Hello World!" and I was looking to see if a 'substring' exists in myStr in order. If it does, then I would like to print the indices of that found range:
var myStr = "Hello World!"
if let rangeFound = myStr.range(of: "ello"){
print(rangeFound) //Im getting: Index(_rawBits: 65536)..<Index(_rawBits: 327680)
print("Found ello from \(rangeFound.lowerBound) to \(rangeFound.upperBound)")
//prints: Found ello from Index(_rawBits: 65536) to Index(_rawBits: 327680)
}
Instead of printing the raw bits I would like to print the readable indices ..."from 1 to 4"
I am not trying to use these numbers in another range, only trying to print the readable indices. Thanks

An Unicode character can consist of more than one byte so an index cannot be simply Int.
A workaround to get the integer indices is a conversion to NSRange
let myStr = "Hello World!"
if let rangeFound = myStr.range(of: "ello"){
let nsRange = NSRange(rangeFound, in: myStr)
print("Found ello from \(nsRange.location) to \(nsRange.location + nsRange.length - 1)")
}

Find number in string and return location and length

let myStr = "I have 4.34 apples."
I need the location range and the length, because I'm using NSRange(location:, length:) to bold the number 4.34
extension String{
func findNumbersAndBoldThem()->NSAttributedString{
//the code
}
}

My suggestion is also based on regular expression but there is a more convenient way to get NSRange from Range<String.Index>
let myStr = "I have 4.34 apples."
if let range = myStr.range(of: "\\d+\\.\\d+", options: .regularExpression) {
let nsRange = NSRange(range, in: myStr)
print(nsRange)
}
If you want to detect integer and floating point values use the pattern
"\\d+(\\.\\d+)?"
The parentheses and the trailing question mark indicate that the decimal point and the fractional digits are optional.

Detecting Cursor position in UITextView that contains emojis returns the wrong position in swift 4

I'm using this code for detecting the cursor's position in a UITextView :
if let selectedRange = textView.selectedTextRange {
let cursorPosition = textView.offset(from: textView.beginningOfDocument, to: selectedRange.start)
print("\(cursorPosition)")
}
I put this under textViewDidChange func for detecting cursor position each time the text change.
It is working fine but when I putting emojis the textView.text.count is different with the cursor position. from swift 4 each emoji counted as one character but it seems that for cursor position it is different.
so How can I get the exact cursor position that matches the count of characters in a text ?

Long story short: When using Swift with String and NSRange use this extension for Range conversion
extension String {
/// Fixes the problem with `NSRange` to `Range` conversion
var range: NSRange {
let fromIndex = unicodeScalars.index(unicodeScalars.startIndex, offsetBy: 0)
let toIndex = unicodeScalars.index(fromIndex, offsetBy: count)
return NSRange(fromIndex..<toIndex, in: self)
}
}
Let's take a deeper look:
let myStr = "Wéll helló ⚙️"
myStr.count // 12
myStr.unicodeScalars.count // 13
myStr.utf8.count // 19
myStr.utf16.count // 13
In Swift 4 string is a collection of characters (composite character like ö and emoji will count as one character). UTF-8 and UTF-16 views are the collections of UTF-8 and UTF-16 code units respectively.
Your problem is, that textView.text.count counts collection elements (emoji as well as composite character will count as one element) and NSRange counts indexes of UTF-16 code units. The difference is illustrated in the snipped above.
More here:
Strings And Characters

Swift CharacterSet subtract

Why characters1 is not empty?
var characters1 = CharacterSet.decimalDigits
let characters2 = CharacterSet(charactersIn: "01234567890")
characters1.subtract(characters2)
print(characters1.isEmpty)
Here everything is OK
var characters1 = CharacterSet(charactersIn: "9876543210")
let characters2 = CharacterSet(charactersIn: "0123456789")
characters1.subtract(characters2)
print(characters1.isEmpty)

From the docs (emphasis mine)
Informally, this set is the set of all characters used to represent
the decimal values 0 through 9. These characters include, for example,
the decimal digits of the Indic scripts and Arabic.
Therefore, CharacterSet.decimalDigits don't only contains "9876543210", they also have numerals from the Indic scripts (and other scripts).

NSString.rangeOfString returns unusual result with non-latin characters

I need to get the range of two words in a string, for example:
ยัฟิแก ไฟหก
(this is literally me typing PYABCD WASD) - it's a non-sensical test since I don't speak Thai.
//Find all the ranges of each word
var words: [String] = []
var ranges: [NSRange] = []
//Convert to nsstring first because otherwise you get stuck with Ranges and Strings.
let nstext = backgroundTextField.stringValue as NSString //contains "ยัฟิแก ไฟหก"
words = nstext.componentsSeparatedByString(" ")
var nstextLessWordsWeHaveRangesFor = nstext //if you have two identical words this prevents just getting the first word's range
for word in words
{
let range:NSRange = nstextLessWordsWeHaveRangesFor.rangeOfString(word)
Swift.print(range)
ranges.append(range)
//create a string the same length as word
var fillerString:String = ""
for i in 0..<word.characters.count{
//for var i=0;i<word.characters.count;i += 1{
Swift.print("i: \(i)")
fillerString = fillerString.stringByAppendingString(" ")
}
//remove duplicate words / letters so that we get correct range each time.
if range.length <= nstextLessWordsWeHaveRangesFor.length
{
nstextLessWordsWeHaveRangesFor = nstextLessWordsWeHaveRangesFor.stringByReplacingCharactersInRange(range, withString: fillerString)
}
}
outputs:
(0,6)
(5,4)
Those ranges are overlapping.
This causes problems down the road where I'm trying to use NSLayoutManager.enumerateEnclosingRectsForGlyphRange since the ranges are inconsistent.
How can I get the correct range (or in this specific case, non-overlapping ranges)?

Swift String characters describe "extended grapheme clusters", and NSString
uses UTF-16 code points, therefore the length of a string differs
depending on which representation you use.
For example, the first character "ยั" is actually the combination
of "ย" (U+0E22) with the diacritical mark " ั" (U+0E31).
That counts as one String character, but as two NSString characters.
As a consequence, indices change when you replace the word with
spaces.
The simplest solution is to stick to one, either String or NSString
(if possible). Since you are working with NSString, changing
for i in 0..<word.characters.count {
to
for i in 0..<range.length {
should solve the problem. The creation of the filler string
can be simplified to
//create a string the same length as word
let fillerString = String(count: range.length, repeatedValue: Character(" "))

Removing nstextLessWordsWeHaveRangesFor solves the issue (at the bottom starting with range.length <= nstextLessWordsWeHaveRangesFor.length). The modification of that variable is changing the range and giving unexpected output. Here is the result when the duplicate word removal is removed:
var words: [String] = []
let nstext = "ยัฟิแก ไฟหก" as NSString
words = nstext.componentsSeparatedByString(" ")
for word in words {
let range = nstext.rangeOfString(word)
print(range)
}
Output is: (0,6) and (7,4)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Index distance to FileHandle pointer and characters encoding in Swift 4 - swift

Related

In Swift 5, how do I display the raw bits returned from range.UpperBound or range.LowerBound in a readable way (with ints)?

Find number in string and return location and length

Detecting Cursor position in UITextView that contains emojis returns the wrong position in swift 4

Swift CharacterSet subtract

NSString.rangeOfString returns unusual result with non-latin characters

Categories

Resources