German character ß uppercased in SS - swift

I figured out, that "ß" is converted to "SS" when using uppercased(). But I want compare if two strings are equal without being case sensitive.
So when comparing "gruß" with "GRUß" it should be the same as well as when comparing "gru" with "Gru".
There is no uppercase "ß" in german language! Because I don't know which other characters aren't available in the corresponding languages, I could not filter all the caracters, which have no 1:1 uppercased opponent.
What can I do?

Use caseInsensitiveCompare() instead of converting the strings
to upper or lowercase:
let s1 = "gruß"
let s2 = "GRUß"
let eq = s1.caseInsensitiveCompare(s2) == .orderedSame
print(eq) // true
This compares the strings in a case-insensitive way according to
the Unicode standard.
There is also localizedCaseInsensitiveCompare() which does
a comparison according to the current locale, and
s1.compare(s2, options: .caseInsensitive, locale: ...)
for a case-insensitive comparison according to an arbitrary given
locale.

Well, "GRUß" is non-sensical in the first place for the reasons you state. You can't throw arbitrary data at a computer and expect it to process it in a sane way :-) If you have to deal with invalid input, you should probably have a preprocessing phase which cleans up crap you know about.
Having said that, this works (Swift 3.0):
let grusz = "GRUß"
let gruss = "GRUSS"
if grusz.compare(gruss, options: .caseInsensitive) == .orderedSame {
print("MATCHES")
}

Related

Solving words that when print match but when you use .contain does not match

A common issue many of the developers might face while building an app with localization (especially when it involves Arabic or an RTL-supported language), is that the search would not result as expected. An example of the issue is:
print(listOfName.contains(InputName))); //prints false
To overcome this issue, I tried comparing the encoded search strings (in both languages) and encoded result strings then only realized that somehow some special characters had been added. For my instance, the characters were RTL[226, 128, 143] and LTR[226, 128, 142]. Before I actually encoded the strings, both search and result were identical or equals to the same. After knowing the extra added characters, I did the following:
var InputNameEncode = InputName.encode
var rightToLeftMark = utf8.decode([226, 128, 143]);
var leftToRightMark = utf8.decode([226, 128, 142]);
InputNameEncode = InputNameEncode.replaceAll(rightToLeftMark, "");
InputNameEncode = InputNameEncode.replaceAll(leftToRightMark, "");
inputName = InputNameEncode.decode
print(listOfName.contains(InputName))); //prints true
As mentioned earlier, in my case the extra characters were RTL and LTR. In your case, you may find something entirely different.
You can know what each encoded set of characters represents on this page: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=8192&number=128&utf8=dec
Flutter Version: 1.22.6

Count leading tabs in Swift string

I need to count the number of leading tabs in a Swift string. I know there are fairly simple solutions (e.g. looping over the string until a non-tab character is encountered) but I am looking for a more elegant solution.
I have attempted to use a regex such as ^\\t* along with the .numberOfMatches method but this detects all the tab characters as one match. For example, if the string has three leading tabs then that method just returns 1. Is there a way to write a regex that treats each individual tab character as a single match?
Also open to other ways of approaching this without using a regex.
Here is a non-regex solution
let count = someString.prefix(while: {$0 == "\t"}).count
You may use
\G\t
See the regex demo.
Here,
\G - matches a string start position or end of the previous match position, and
\t - matches a single tab.
Swift test:
let string = "\t\t123"
let regex = try! NSRegularExpression(pattern: "\\G\t", options: [])
let numberOfOccurrences = regex.numberOfMatches(in: string, range: NSRange(string.startIndex..., in: string))
print(numberOfOccurrences) // => 2

Swift 3 replacingOccurrences exact

I'm having trouble with my replacingOccurrences function. I have a string like so:
let x = "john, johnny, johnney"
What I need to do is remove only "john"
So I have this code:
y = "john"
x = x.replacingOccurrences(of: y, with: " ", options: NSString.CompareOptions.literal, range: nil)
The problem I get is that this removes All instances of john... Also I thought about setting the range, however, that wont really solve my problem because the names could be in a different order.
Is there a CompareOptions func that I'm overlooking for "exact" or do I need to create a regular expression?
Sincerely,
Denis Angell
You should use a regular expression to match the exact word, you just need to put the desired word between \b. Try like this:
let x = "johnny, john, johnney"
let y = "john"
let z = x.replacingOccurrences(of: "\\b\(y)\\b", with: " ", options: .regularExpression) // "johnny, , johnney"
let x = "johnny &john johnney"
let y = "&john"
let z = x.replacingOccurrences(of: y + "\\b", with: " ", options: .regularExpression)
A regular expression is probably your best option because there is no general rule for "exact" that would work in all languages. In english, and most romance languages, you would define "exact" as having no alphabetical characters before or after the word to replace.
A regular expression could express that relatively simply for the english language as there are no complex letter combinations and a-z is pretty much all of it. In other languages, you would have to worry about diacritical marks (french, other latin languages), in Arabic you'd need to take into account the combined letter A (Aleph) at the beginning of some words, and so on.

Is there a clean way to specify character literals in Swift?

Swift seems to be trying to deprecate the notion of a string being composed of an array of atomic characters, which makes sense for many uses, but there's an awful lot of programming that involves picking through datastructures that are ASCII for all practical purposes: particularly with file I/O. The absence of a built in language feature to specify a character literal seems like a gaping hole, i.e. there is no analog of the C/Java/etc-esque:
String foo="a"
char bar='a'
This is rather inconvenient, because even if you convert your strings into arrays of characters, you can't do things like:
let ch:unichar = arrayOfCharacters[n]
if ch >= 'a' && ch <= 'z' {...whatever...}
One rather hacky workaround is to do something like this:
let LOWCASE_A = ("a" as NSString).characterAtIndex(0)
let LOWCASE_Z = ("z" as NSString).characterAtIndex(0)
if ch >= LOWCASE_A && ch <= LOWCASE_Z {...whatever...}
This works, but obviously it's pretty ugly. Does anyone have a better way?
Characters can be created from Strings as long as those Strings are only made up of a single character. And, since Character implements ExtendedGraphemeClusterLiteralConvertible, Swift will do this for you automatically on assignment. So, to create a Character in Swift, you can simply do something like:
let ch: Character = "a"
Then, you can use the contains method of an IntervalType (generated with the Range operators) to check if a character is within the range you're looking for:
if ("a"..."z").contains(ch) {
/* ... whatever ... */
}
Example:
let ch: Character = "m"
if ("a"..."z").contains(ch) {
println("yep")
} else {
println("nope")
}
Outputs:
yep
Update: As #MartinR pointed out, the ordering of Swift characters is based on Unicode Normalization Form D which is not in the same order as ASCII character codes. In your specific case, there are more characters between a and z than in straight ASCII (ä for example). See #MartinR's answer here for more info.
If you need to check if a character is in between two ASCII character codes, then you may need to do something like your original workaround. However, you'll also have to convert ch to an unichar and not a Character for it to work (see this question for more info on Character vs unichar):
let a_code = ("a" as NSString).characterAtIndex(0)
let z_code = ("z" as NSString).characterAtIndex(0)
let ch_code = (String(ch) as NSString).characterAtIndex(0)
if (a_code...z_code).contains(ch_code) {
println("yep")
} else {
println("nope")
}
Or, the even more verbose way without using NSString:
let startCharScalars = "a".unicodeScalars
let startCode = startCharScalars[startCharScalars.startIndex]
let endCharScalars = "z".unicodeScalars
let endCode = endCharScalars[endCharScalars.startIndex]
let chScalars = String(ch).unicodeScalars
let chCode = chScalars[chScalars.startIndex]
if (startCode...endCode).contains(chCode) {
println("yep")
} else {
println("nope")
}
Note: Both of those examples only work if the character only contains a single code point, but, as long as we're limited to ASCII, that shouldn't be a problem.
If you need C-style ASCII literals, you can just do this:
let chr = UInt8(ascii:"A") // == UInt8( 0x41 )
Or if you need 32-bit Unicode literals you can do this:
let unichr1 = UnicodeScalar("A").value // == UInt32( 0x41 )
let unichr2 = UnicodeScalar("é").value // == UInt32( 0xe9 )
let unichr3 = UnicodeScalar("😀").value // == UInt32( 0x1f600 )
Or 16-bit:
let unichr1 = UInt16(UnicodeScalar("A").value) // == UInt16( 0x41 )
let unichr2 = UInt16(UnicodeScalar("é").value) // == UInt16( 0xe9 )
All of these initializers will be evaluated at compile time, so it really is using an immediate literal at the assembly instruction level.
The feature you want was proposed to be in Swift 5.1, but that proposal was rejected for a few reasons:
Ambiguity
The proposal as written, in the current Swift ecosystem, would have allowed for expressions like 'x' + 'y' == "xy", which was not intended (the proper syntax would be "x" + "y" == "xy").
Amalgamation
The proposal was two in one.
First, it proposed a way to introduce single-quote literals into the language.
Second, it proposed that these would be convertible to numerical types to deal with ASCII values and Unicode codepoints.
These are both good proposals, and it was recommended that this be split into two and re-proposed. Those follow-up proposals have not yet been formalized.
Disagreement
It never reached consensus whether the default type of 'x' would be a Character or a Unicode.Scalar. The proposal went with Character, citing the Principle of Least Surprise, despite this lack of consensus.
You can read the full rejection rationale here.
The syntax might/would look like this:
let myChar = 'f' // Type is Character, value is solely the unicode U+0066 LATIN SMALL LETTER F
let myInt8: Int8 = 'f' // Type is Int8, value is 102 (0x66)
let myUInt8Array: [UInt8] = [ 'a', 'b', '1', '2' ] // Type is [UInt8], value is [ 97, 98, 49, 50 ] ([ 0x61, 0x62, 0x31, 0x32 ])
switch someUInt8 {
case 'a' ... 'f': return "Lowercase hex letter"
case 'A' ... 'F': return "Uppercase hex letter"
case '0' ... '9': return "Hex digit"
default: return "Non-hex character"
}
It also looks like you can use the following syntax:
Character("a")
This will create a Character from the specified single character string.
I have only tested this in Swift 4 and Xcode 10.1
Why do I exhume 7 year old posts? Fun I guess? Seriously though, I think I can add to the discussion.
It is not a gaping hole, or rather, it is a deliberate gaping hole that explicitly discourages conflating a string of text with a sequence of ASCII bytes.
You absolutely can pick apart a String. A String implements BidirectionalCollection and has many ways to manipulate the atoms. See: https://developer.apple.com/documentation/swift/string.
But you have to get used to the more generalized notion of a String. It can be picked apart from the User perspective, which is a sequence of grapheme clusters, each (usually) which a visually separable appearance, or from the encoding perspective, which can be one of several (UTF32, UTF16, UTF8).
At the risk of overanalyzing the wording of your question:
A data structure is conceptual, and independent of encoding in storage
A data structure encoded as an ASCII string is just one kind of ASCII string
By design the encoding of ASCII values 0-127 will have an identical encoding in UTF-8, so loading that stream with a UTF8 API is fine
A data structure encoded as a string where fields of the structure have UTF-8 Unicode string values is not an ASCII string, but a UTF-8 string itself
A string is either ASCII-encoded or not; "for practical purposes" isn't a meaningful qualifier. A UTF-8 database field where 99.99% of the text falls in the ASCII range (where encodings will match), but occasionally doesn't, will present some nasty bug opportunities.
Instead of a terse and low-level equivalence of fixed-width integers and English-only text, Swift has a richer API that forces more explicit naming of the involved categories and entities. If you want to deal with ASCII, there's a name (method) for that, and if you want to deal with human sub-categories, there's a name for that, too, and they're totally independent of one another. There is a strong move away from ASCII and the English-centric string handling model of C. This is factual, not evangelizing, and it can present an irksome learning curve.
(This is aimed at new-comers, acknowledging the OP probably has years of experience with this now.)
For what you're trying to do there, consider:
let foo = "abcDeé#¶œŎO!##"
foo.forEach { c in
print((c.isASCII ? "\(c) is ascii with value \(c.asciiValue ?? 0); " : "\(c) is not ascii; ")
+ ((c.isLetter ? "\(c) is a letter" : "\(c) is not a letter")))
}
b is ascii with value 98; b is a letter
c is ascii with value 99; c is a letter
D is ascii with value 68; D is a letter
e is ascii with value 101; e is a letter
é is not ascii; é is a letter
# is ascii with value 64; # is not a letter
¶ is not ascii; ¶ is not a letter
œ is not ascii; œ is a letter
Ŏ is not ascii; Ŏ is a letter
O is ascii with value 79; O is a letter
! is ascii with value 33; ! is not a letter
# is ascii with value 64; # is not a letter
# is ascii with value 35; # is not a letter

What does it mean that string and character comparisons in Swift are not locale-sensitive?

I started learning Swift language and I am very curious What does it mean that string and character comparisons in Swift are not locale-sensitive? Does it mean that all the characters are stored in Swift like UTF-8 characters?
(All code examples updated for Swift 3 now.)
Comparing Swift strings with < does a lexicographical comparison
based on the so-called "Unicode Normalization Form D" (which can be computed with
decomposedStringWithCanonicalMapping)
For example, the decomposition of
"ä" = U+00E4 = LATIN SMALL LETTER A WITH DIAERESIS
is the sequence of two Unicode code points
U+0061,U+0308 = LATIN SMALL LETTER A + COMBINING DIAERESIS
For demonstration purposes, I have written a small String extension which dumps the
contents of the String as an array of Unicode code points:
extension String {
var unicodeData : String {
return self.unicodeScalars.map {
String(format: "%04X", $0.value)
}.joined(separator: ",")
}
}
Now lets take some strings, sort them with <:
let someStrings = ["ǟψ", "äψ", "ǟx", "äx"].sorted()
print(someStrings)
// ["a", "ã", "ă", "ä", "ǟ", "b"]
and dump the Unicode code points of each string (in original and decomposed
form) in the sorted array:
for str in someStrings {
print("\(str) \(str.unicodeData) \(str.decomposedStringWithCanonicalMapping.unicodeData)")
}
The output
äx 00E4,0078 0061,0308,0078
ǟx 01DF,0078 0061,0308,0304,0078
ǟψ 01DF,03C8 0061,0308,0304,03C8
äψ 00E4,03C8 0061,0308,03C8
nicely shows that the comparison is done by a lexicographic ordering of the Unicode
code points in the decomposed form.
This is also true for strings of more than one character, as the following example
shows. With
let someStrings = ["ǟψ", "äψ", "ǟx", "äx"].sorted()
the output of above loop is
äx 00E4,0078 0061,0308,0078
ǟx 01DF,0078 0061,0308,0304,0078
ǟψ 01DF,03C8 0061,0308,0304,03C8
äψ 00E4,03C8 0061,0308,03C8
which means that
"äx" < "ǟx", but "äψ" > "ǟψ"
(which was at least unexpected for me).
Finally let's compare this with a locale-sensitive ordering, for example swedish:
let locale = Locale(identifier: "sv") // svenska
var someStrings = ["ǟ", "ä", "ã", "a", "ă", "b"]
someStrings.sort {
$0.compare($1, locale: locale) == .orderedAscending
}
print(someStrings)
// ["a", "ă", "ã", "b", "ä", "ǟ"]
As you see, the result is different from the Swift < sorting.
Changing the locale can change the alphabetical order, e.g. a case-sensitive comparison can appear case-insensitive because of the locale, or more generally, the alphabetical order of two strings is different.
Lexicographical ordering and locale-sensitive ordering can be different. You can see an example of it in this question:
Sorting scala list equivalent to C# without changing C# order
In that specific case the locale-sensitive ordering placed _ before 1, whereas in a lexicographical ordering it's the opposite.
Swift comparison uses lexicographical ordering.