Reliable function to get position of substring in string in Swift

Reliable function to get position of substring in string in Swift - swift

This is working well for English:
public static func posOf(needle: String, haystack: String) -> Int {
return haystack.distance(from: haystack.startIndex, to: (haystack.range(of: needle)?.lowerBound)!)
}
But for foreign characters the returned value is always too small. For example "का" is considered one unit instead of 2.
posOf(needle: "काम", haystack: "वह बीना की खुली कोयला खदान में काम करता था।") // 21
I later use the 21 in NSRange(location:length:) where it needs to be 28 to make NSRange work properly.

A Swift String is a collection of Characters, and each Character
represents an "extended Unicode grapheme cluster".
NSString is a collection of UTF-16 code units.
Example:
print("का".characters.count) // 1
print(("का" as NSString).length) // 2
Swift String ranges are represented as Range<String.Index>,
and NSString ranges are represented as NSRange.
Your function counts the number of Characters from the start
of the haystack to the start of the needle, and that is different
from the number of UTF-16 code points.
If you need a "NSRange compatible"
character count then the easiest method would be use the
range(of:) method of NSString:
let haystack = "वह बीना की खुली कोयला खदान में काम करता था।"
let needle = "काम"
if let range = haystack.range(of: needle) {
let pos = haystack.distance(from: haystack.startIndex, to: range.lowerBound)
print(pos) // 21
}
let nsRange = (haystack as NSString).range(of: needle)
if nsRange.location != NSNotFound {
print(nsRange.location) // 31
}
Alternatively, use the utf16 view of the Swift string to
count UTF-16 code units:
if let range = haystack.range(of: needle) {
let lower16 = range.lowerBound.samePosition(in: haystack.utf16)
let pos = haystack.utf16.distance(from: haystack.utf16.startIndex, to: lower16)
print(pos) // 31
}
(See for example
NSRange to Range<String.Index> for more methods to convert between Range<String.Index>
and NSRange).

Related

Why does swift substring with range require a special type of Range

Consider this function to build a string of random characters:
func makeToken(length: Int) -> String {
let chars: String = "abcdefghijklmnopqrstuvwxyz0123456789!?##$%ABCDEFGHIJKLMNOPQRSTUVWXYZ"
var result: String = ""
for _ in 0..<length {
let idx = Int(arc4random_uniform(UInt32(chars.characters.count)))
let idxEnd = idx + 1
let range: Range = idx..<idxEnd
let char = chars.substring(with: range)
result += char
}
return result
}
This throws an error on the substring method:
Cannot convert value of type 'Range<Int>' to expected argument
type 'Range<String.Index>' (aka 'Range<String.CharacterView.Index>')
I'm confused why I can't simply provide a Range with 2 integers, and why it's making me go the roundabout way of making a Range<String.Index>.
So I have to change the Range creation to this very over-complicated way:
let idx = Int(arc4random_uniform(UInt32(chars.characters.count)))
let start = chars.index(chars.startIndex, offsetBy: idx)
let end = chars.index(chars.startIndex, offsetBy: idx + 1)
let range: Range = start..<end
Why isn't it good enough for Swift for me to simply create a range with 2 integers and the half-open range operator? (..<)
Quite the contrast to "swift", in javascript I can simply do chars.substr(idx, 1)

I suggest converting your String to [Character] so that you can index it easily with Int:
func makeToken(length: Int) -> String {
let chars = Array("abcdefghijklmnopqrstuvwxyz0123456789!?##$%ABCDEFGHIJKLMNOPQRSTUVWXYZ".characters)
var result = ""
for _ in 0..<length {
let idx = Int(arc4random_uniform(UInt32(chars.count)))
result += String(chars[idx])
}
return result
}

Swift takes great care to provide a fully Unicode-compliant, type-safe, String abstraction.
Indexing a given Character, in an arbitrary Unicode string, is far from a trivial task. Each Character is a sequence of one or more Unicode scalars that (when combined) produce a single human-readable character. In particular, hiding all this complexity behind a simple Int based indexing scheme might result in the wrong performance mental model for programmers.
Having said that, you can always convert your string to a Array<Character> once for easy (and fast!) indexing. For instance:
let chars: String = "abcdefghijklmnop"
var charsArray = Array(chars.characters)
...
let resultingString = String(charsArray)

Get numbers characters from a string [duplicate]

This question already has answers here:
Filter non-digits from string
(12 answers)
Closed 6 years ago.
How to get numbers characters from a string? I don't want to convert in Int.
var string = "string_1"
var string2 = "string_20_certified"
My result have to be formatted like this:
newString = "1"
newString2 = "20"

Pattern matching a String's unicode scalars against Western Arabic Numerals
You could pattern match the unicodeScalars view of a String to a given UnicodeScalar pattern (covering e.g. Western Arabic numerals).
extension String {
var westernArabicNumeralsOnly: String {
let pattern = UnicodeScalar("0")..."9"
return String(unicodeScalars
.flatMap { pattern ~= $0 ? Character($0) : nil })
}
}
Example usage:
let str1 = "string_1"
let str2 = "string_20_certified"
let str3 = "a_1_b_2_3_c34"
let newStr1 = str1.westernArabicNumeralsOnly
let newStr2 = str2.westernArabicNumeralsOnly
let newStr3 = str3.westernArabicNumeralsOnly
print(newStr1) // 1
print(newStr2) // 20
print(newStr3) // 12334
Extending to matching any of several given patterns
The unicode scalar pattern matching approach above is particularly useful extending it to matching any of a several given patterns, e.g. patterns describing different variations of Eastern Arabic numerals:
extension String {
var easternArabicNumeralsOnly: String {
let patterns = [UnicodeScalar("\u{0660}")..."\u{0669}", // Eastern Arabic
"\u{06F0}"..."\u{06F9}"] // Perso-Arabic variant
return String(unicodeScalars
.flatMap { uc in patterns.contains{ $0 ~= uc } ? Character(uc) : nil })
}
}
This could be used in practice e.g. if writing an Emoji filter, as ranges of unicode scalars that cover emojis can readily be added to the patterns array in the Eastern Arabic example above.
Why use the UnicodeScalar patterns approach over Character ones?
A Character in Swift contains of an extended grapheme cluster, which is made up of one or more Unicode scalar values. This means that Character instances in Swift does not have a fixed size in the memory, which means random access to a character within a collection of sequentially (/contiguously) stored character will not be available at O(1), but rather, O(n).
Unicode scalars in Swift, on the other hand, are stored in fixed sized UTF-32 code units, which should allow O(1) random access. Now, I'm not entirely sure if this is a fact, or a reason for what follows: but a fact is that if benchmarking the methods above vs equivalent method using the CharacterView (.characters property) for some test String instances, its very apparent that the UnicodeScalar approach is faster than the Character approach; naive testing showed a factor 10-25 difference in execution times, steadily growing for growing String size.
Knowing the limitations of working with Unicode scalars vs Characters in Swift
Now, there are drawbacks using the UnicodeScalar approach, however; namely when working with characters that cannot represented by a single unicode scalar, but where one of its unicode scalars are contained in the pattern to which we want to match.
E.g., consider a string holding the four characters "Café". The last character, "é", is represented by two unicode scalars, "e" and "\u{301}". If we were to implement pattern matching against, say, UnicodeScalar("a")...e, the filtering method as applied above would allow one of the two unicode scalars to pass.
extension String {
var onlyLowercaseLettersAthroughE: String {
let patterns = [UnicodeScalar("1")..."e"]
return String(unicodeScalars
.flatMap { uc in patterns.contains{ $0 ~= uc } ? Character(uc) : nil })
}
}
let str = "Cafe\u{301}"
print(str) // Café
print(str.onlyLowercaseLettersAthroughE) // Cae
/* possibly we'd want "Ca" or "Caé"
as result here */
In the particular use case queried by from the OP in this Q&A, the above is not an issue, but depending on the use case, it will sometimes be more appropriate to work with Character pattern matching over UnicodeScalar.

Edit: Updated for Swift 4 & 5
Here's a straightforward method that doesn't require Foundation:
let newstring = string.filter { "0"..."9" ~= $0 }
or borrowing from #dfri's idea to make it a String extension:
extension String {
var numbers: String {
return filter { "0"..."9" ~= $0 }
}
}
print("3 little pigs".numbers) // "3"
print("1, 2, and 3".numbers) // "123"

import Foundation
let string = "a_1_b_2_3_c34"
let result = string.components(separatedBy: CharacterSet.decimalDigits.inverted).joined(separator: "")
print(result)
Output:
12334

Here is a Swift 2 example:
let str = "Hello 1, World 62"
let intString = str.componentsSeparatedByCharactersInSet(
NSCharacterSet
.decimalDigitCharacterSet()
.invertedSet)
.joinWithSeparator("") // Return a string with all the numbers

This method iterate through the string characters and appends the numbers to a new string:
class func getNumberFrom(string: String) -> String {
var number: String = ""
for var c : Character in string.characters {
if let n: Int = Int(String(c)) {
if n >= Int("0")! && n < Int("9")! {
number.append(c)
}
}
}
return number
}

For example with regular expression
let text = "string_20_certified"
let pattern = "\\d+"
let regex = try! NSRegularExpression(pattern: pattern, options: [])
if let match = regex.firstMatch(in: text, options: [], range: NSRange(location: 0, length: text.characters.count)) {
let newString = (text as NSString).substring(with: match.range)
print(newString)
}
If there are multiple occurrences of the pattern use matches(in..
let matches = regex.matches(in: text, options: [], range: NSRange(location: 0, length: text.characters.count))
for match in matches {
let newString = (text as NSString).substring(with: match.range)
print(newString)
}

Swift 3.0 iterate over String.Index range

The following was possible with Swift 2.2:
let m = "alpha"
for i in m.startIndex..<m.endIndex {
print(m[i])
}
a
l
p
h
a
With 3.0, we get the following error:
Type 'Range' (aka 'Range') does not conform to protocol 'Sequence'
I am trying to do a very simple operation with strings in swift -- simply traverse through the first half of the string (or a more generic problem: traverse through a range of a string).
I can do the following:
let s = "string"
var midIndex = s.index(s.startIndex, offsetBy: s.characters.count/2)
let r = Range(s.startIndex..<midIndex)
print(s[r])
But here I'm not really traversing the string. So the question is: how do I traverse through a range of a given string. Like:
for i in Range(s.startIndex..<s.midIndex) {
print(s[i])
}

You can traverse a string by using indices property of the characters property like this:
let letters = "string"
let middle = letters.index(letters.startIndex, offsetBy: letters.characters.count / 2)
for index in letters.characters.indices {
// to traverse to half the length of string
if index == middle { break } // s, t, r
print(letters[index]) // s, t, r, i, n, g
}
From the documentation in section Strings and Characters - Counting Characters:
Extended grapheme clusters can be composed of one or more Unicode scalars. This means that different characters—and different representations of the same character—can require different amounts of memory to store. Because of this, characters in Swift do not each take up the same amount of memory within a string’s representation. As a result, the number of characters in a string cannot be calculated without iterating through the string to determine its extended grapheme cluster boundaries.
emphasis is my own.
This will not work:
let secondChar = letters[1]
// error: subscript is unavailable, cannot subscript String with an Int

Another option is to use enumerated() e.g:
let string = "Hello World"
for (index, char) in string.characters.enumerated() {
print(char)
}
or for Swift 4 just use
let string = "Hello World"
for (index, char) in string.enumerated() {
print(char)
}

Use the following:
for i in s.characters.indices[s.startIndex..<s.endIndex] {
print(s[i])
}
Taken from Migrating to Swift 2.3 or Swift 3 from Swift 2.2

Iterating over characters in a string is cleaner in Swift 4:
let myString = "Hello World"
for char in myString {
print(char)
}

If you want to traverse over the characters of a String, then instead of explicitly accessing the indices of the String, you could simply work with the CharacterView of the String, which conforms to CollectionType, allowing you access to neat subsequencing methods such as prefix(_:) and so on.
/* traverse the characters of your string instance,
up to middle character of the string, where "middle"
will be rounded down for strings of an odd amount of
characters (e.g. 5 characters -> travers through 2) */
let m = "alpha"
for ch in m.characters.prefix(m.characters.count/2) {
print(ch, ch.dynamicType)
} /* a Character
l Character */
/* round odd division up instead */
for ch in m.characters.prefix((m.characters.count+1)/2) {
print(ch, ch.dynamicType)
} /* a Character
l Character
p Character */
If you'd like to treat the characters within the loop as strings, simply use String(ch) above.
With regard to your comment below: if you'd like to access a range of the CharacterView, you could easily implement your own extension of CollectionType (specified for when Generator.Element is Character) making use of both prefix(_:) and suffix(_:) to yield a sub-collection given e.g. a half-open (from..<to) range
/* for values to >= count, prefixed CharacterView will be suffixed until its end */
extension CollectionType where Generator.Element == Character {
func inHalfOpenRange(from: Int, to: Int) -> Self {
guard case let to = min(to, underestimateCount()) where from <= to else {
return self.prefix(0) as! Self
}
return self.prefix(to).suffix(to-from) as! Self
}
}
/* example */
let m = "0123456789"
for ch in m.characters.inHalfOpenRange(4, to: 8) {
print(ch) /* \ */
} /* 4 a (sub-collection) CharacterView
5
6
7 */

The best way to do this is :-
let name = "nick" // The String which we want to print.
for i in 0..<name.count
{
// Operation name[i] is not allowed in Swift, an alternative is
let index = name.index[name.startIndex, offsetBy: i]
print(name[index])
}
for more details visit here

Swift 4.2
Simply:
let m = "alpha"
for i in m.indices {
print(m[i])
}

Swift 4:
let mi: String = "hello how are you?"
for i in mi {
print(i)
}

To concretely demonstrate how to traverse through a range in a string in Swift 4, we can use the where filter in a for loop to filter its execution to the specified range:
func iterateStringByRange(_ sentence: String, from: Int, to: Int) {
let startIndex = sentence.index(sentence.startIndex, offsetBy: from)
let endIndex = sentence.index(sentence.startIndex, offsetBy: to)
for position in sentence.indices where (position >= startIndex && position < endIndex) {
let char = sentence[position]
print(char)
}
}
iterateStringByRange("string", from: 1, to: 3) will print t, r and i

When iterating over the indices of characters in a string, you seldom only need the index. You probably also need the character at the given index. As specified by Paulo (updated for Swift 4+), string.indices will give you the indices of the characters. zip can be used to combine index and character:
let string = "string"
// Define the range to conform to your needs
let range = string.startIndex..<string.index(string.startIndex, offsetBy: string.count / 2)
let substring = string[range]
// If the range is in the type "first x characters", like until the middle, you can use:
// let substring = string.prefix(string.count / 2)
for (index, char) in zip(substring.indices, substring) {
// index is the index in the substring
print(char)
}
Note that using enumerated() will produce a pair of index and character, but the index is not the index of the character in the string. It is the index in the enumeration, which can be different.

Fit Swift string in database VARCHAR(255)

I'm trying to get a valid substring of at most 255 UTF8 code units from a Swift string (the idea is to be able to store it an a database VARCHAR(255) field).
The standard way of getting a substring is this :
let string: String = "Hello world!"
let startIndex = string.startIndex
let endIndex = string.startIndex.advancedBy(255, limit: string.endIndex)
let databaseSubstring1 = string[startIndex..<endIndex]
But obviously that would give me a string of 255 characters that may require more than 255 bytes in UTF8 representation.
For UTF8 I can write this :
let utf8StartIndex = string.utf8.startIndex
let utf8EndIndex = utf8StartIndex.advancedBy(255, limit: string.utf8.endIndex)
let databaseSubstringUTF8View = name.utf8[utf8StartIndex..<utf8EndIndex]
let databaseSubstring2 = String(databaseSubstringUTF8View)
But I run the risk of having half a character at the end, which means my UTF8View would not be a valid UTF8 sequence.
And as expected databaseSubstring2 is an optional string because the initializer can fail (it is defined as public init?(_ utf8: String.UTF8View)).
So I need some way of stripping invalid UTF8 code points at the end, or – if possible – a builtin way of doing what I'm trying to do here.
EDIT
Turns out that databases understand characters, so I should not try to count UTF8 code units, but rather how many characters the database will count in my string (which will probably depend on the database).
According to #OOPer, MySQL counts characters as UTF-16 code units. I have come up with the following implementation :
private func databaseStringForString(string: String, maxLength: Int = 255) -> String
{
// Start by clipping to 255 characters
let startIndex = string.startIndex
let endIndex = startIndex.advancedBy(maxLength, limit: string.endIndex)
var string = string[startIndex..<endIndex]
// Remove characters from the end one by one until we have less than
// the maximum number of UTF-16 code units
while (string.utf16.count > maxLength) {
let startIndex = string.startIndex
let endIndex = string.endIndex.advancedBy(-1, limit: startIndex)
string = string[startIndex..<endIndex]
}
return string
}
The idea is to count UTF-16 code units, but remove characters from the end (that is what Swift think what a character is).
EDIT 2
Still according to #OOPer, Posgresql counts characters as unicode scalars, so this should probably work :
private func databaseStringForString(string: String, maxLength: Int = 255) -> String
{
// Start by clipping to 255 characters
let startIndex = string.startIndex
let endIndex = startIndex.advancedBy(maxLength, limit: string.endIndex)
var string = string[startIndex..<endIndex]
// Remove characters from the end one by one until we have less than
// the maximum number of Unicode Scalars
while (string.unicodeScalars.count > maxLength) {
let startIndex = string.startIndex
let endIndex = string.endIndex.advancedBy(-1, limit: startIndex)
string = string[startIndex..<endIndex]
}
return string
}

As I write in my comment, you may need your databaseStringForString(_:maxLength:) to truncate your string to match the length limit of your DBMS. PostgreSQL with utf8, MySQL with utf8mb4.
And I would write the same functionality as your EDIT 2:
func databaseStringForString(string: String, maxUnicodeScalarLength: Int = 255) -> String {
let start = string.startIndex
for index in start..<string.endIndex {
if string[start..<index.successor()].unicodeScalars.count > maxUnicodeScalarLength {
return string[start..<index]
}
}
return string
}
This may be less efficient, but a little bit shorter.
let s = "abc\u{1D122}\u{1F1EF}\u{1F1F5}" //->"abc𝄢🇯🇵"
let dbus = databaseStringForString(s, maxUnicodeScalarLength: 5) //->"abc𝄢"(=="abc\u{1D122}")
So, someone who works with MySQL with utf8(=utf8mb3) needs something like this:
func databaseStringForString(string: String, maxUTF16Length: Int = 255) -> String {
let start = string.startIndex
for index in start..<string.endIndex {
if string[start..<index.successor()].utf16.count > maxUTF16Length {
return string[start..<index]
}
}
return string
}
let dbu16 = databaseStringForString(s, maxUTF16Length: 4) //->"abc"

NSCharacterSet.characterIsMember() with Swift's Character type

Imagine you've got an instance of Swift's Character type, and you want to determine whether it's a member of an NSCharacterSet. NSCharacterSet's characterIsMember method takes a unichar, so we need to get from Character to unichar.
The only solution I could come up with is the following, where c is my Character:
let u: unichar = ("\(c)" as NSString).characterAtIndex(0)
if characterSet.characterIsMember(u) {
dude.abide()
}
I looked at Character but nothing leapt out at me as a way to get from it to unichar. This may be because Character is more general than unichar, so a direct conversion wouldn't be safe, but I'm only guessing.
If I were iterating a whole string, I'd do something like this:
let s = myString as NSString
for i in 0..<countElements(myString) {
let u = s.characterAtIndex(i)
if characterSet.characterIsMember(u) {
dude.abide()
}
}
(Warning: The above is pseudocode and has never been run by anyone ever.) But this is not really what I'm asking.

My understanding is that unichar is a typealias for UInt16. A unichar is just a number.
I think that the problem that you are facing is that a Character in Swift can be composed of more than one unicode "characters". Thus, it cannot be converted to a single unichar value because it may be composed of two unichars. You can decompose a Character into its individual unichar values by casting it to a string and using the utf16 property, like this:
let c: Character = "a"
let s = String(c)
var codeUnits = [unichar]()
for codeUnit in s.utf16 {
codeUnits.append(codeUnit)
}
This will produce an array - codeUnits - of unichar values.
EDIT: Initial code had for codeUnit in s when it should have been for codeUnit in s.utf16
You can tidy things up and test for whether or not each individual unichar value is in a character set like this:
let char: Character = "\u{63}\u{20dd}" // This is a 'c' inside of an enclosing circle
for codeUnit in String(char).utf16 {
if NSCharacterSet(charactersInString: "c").characterIsMember(codeUnit) {
dude.abide()
} // dude will abide() for codeUnits[0] = "c", but not for codeUnits[1] = 0x20dd (the enclosing circle)
}
Or, if you are only interested in the first (and often only) unichar value:
if NSCharacterSet(charactersInString: "c").characterIsMember(String(char).utf16[0]) {
dude.abide()
}
Or, wrap it in a function:
func isChar(char: Character, inSet set: NSCharacterSet) -> Bool {
return set.characterIsMember(String(char).utf16[0])
}
let xSet = NSCharacterSet(charactersInString: "x")
isChar("x", inSet: xSet) // This returns true
isChar("y", inSet: xSet) // This returns false
Now make the function check for all unichar values in a composed character - that way, if you have a composed character, the function will only return true if both the base character and the combining character are present:
func isChar(char: Character, inSet set: NSCharacterSet) -> Bool {
var found = true
for ch in String(char).utf16 {
if !set.characterIsMember(ch) { found = false }
}
return found
}
let acuteA: Character = "\u{e1}" // An "a" with an accent
let acuteAComposed: Character = "\u{61}\u{301}" // Also an "a" with an accent
// A character set that includes both the composed and uncomposed unichar values
let charSet = NSCharacterSet(charactersInString: "\u{61}\u{301}\u{e1}")
isChar(acuteA, inSet: charSet) // returns true
isChar(acuteAComposed, inSet: charSet) // returns true (both unichar values were matched
The last version is important. If your Character is a composed character you have to check for the presence of both the base character ("a") and the combining character (the acute accent) in the character set or you will get false positives.

I would treat the Character as a String and let Cocoa do all the work:
func charset(cset:NSCharacterSet, containsCharacter c:Character) -> Bool {
let s = String(c)
let ix = s.startIndex
let ix2 = s.endIndex
let result = s.rangeOfCharacterFromSet(cset, options: nil, range: ix..<ix2)
return result != nil
}
And here's how to use it:
let cset = NSCharacterSet.lowercaseLetterCharacterSet()
let c : Character = "c"
let ok = charset(cset, containsCharacter:c) // true

Do it all in a one liner:
validCharacterSet.contains(String(char).unicodeScalars.first!)
(Swift 3)

Due to changes in Swift 3.0, matt's answer no longer works, so here is working version (as extension):
private extension NSCharacterSet {
func containsCharacter(c: Character) -> Bool {
let s = String(c)
let ix = s.startIndex
let ix2 = s.endIndex
let result = s.rangeOfCharacter(from: self as CharacterSet, options: [], range: ix..<ix2)
return result != nil
}
}

Swift 3.0 changes means you actually don't need to be bridging to NSCharacterSet anymore, you can use Swift's native CharacterSet.
You could do something similar to Jiri's answer directly:
extension CharacterSet {
func contains(_ character: Character) -> Bool {
let string = String(character)
return string.rangeOfCharacter(from: self, options: [], range: string.startIndex..<string.endIndex) != nil
}
}
or do:
func contains(_ character: Character) -> Bool {
let otherSet = CharacterSet(charactersIn: String(character))
return self.isSuperset(of: otherSet)
}
Note: the above crashes and doesn't work due to https://bugs.swift.org/browse/SR-3667. Not sure CharacterSet gets the kind of love it needs.