Split a string on all characters except some with a regular expression - swift

I have to split a long string with lyrics to a song into lines and then, for each line, split them into words. I'm going to hold this information in a 2 dimensional array.
I've seen some similar questions and they have been solved using [NSRegularExpression] (https://developer.apple.com/documentation/foundation/nsregularexpression)
but I can't seem to find any regular expression that equals "everything except something" which is what I want to split on when splitting a string into words.
More specifically I want to split on Everything except alphanumerics or ' or -. In Java this regular expression is [^\\w'-]+
Below is the string, followed by my Swift code to attempt to achieve this task (I just split on whitespace instead of actually splitting on words with "[^\w'-]+" as I can't figure out how to do it.
1 Is this the real life?
2 Is this just fantasy?
3 Caught in a landslide,
4 No escape from reality.
5
6 Open your eyes,
7 Look up to the skies and see,
8 I'm just a poor boy, I need no sympathy,
9 Because I'm easy come, easy go,
10 Little high, little low,
11 Any way the wind blows doesn't really matter to me, to me.
12
13 Mama, just killed a man,
(etc.)
let lines = s?.components(separatedBy: "\n")
var all_words = [[String]]()
for i in 0..<lines!.count {
let words = lines![i].components(separatedBy: " ")
let new_words = words.filter {$0 != ""}
all_words.append(new_words)
}

I suggest to use a reverse pattern, [\w'-]+, to match the strings you need and use the matches matching function.
Your code will look like:
for i in 0..<lines!.count {
let new_words = matches(for: "[\\w'-]+", in: lines![i])
all_words.append(new_words)
}
The following line of code:
print(matches(for: "[\\w'-]+", in: "11 Any way the wind blows doesn't really matter to me, to me."))
yields ["11", "Any", "way", "the", "wind", "blows", "doesn\'t", "really", "matter", "to", "me", "to", "me"].

One simple solution is to replace the sequences with a special character first and then split on that character:
let words = string
.replacingOccurrences(of: "[^\\w'-]+", with: "|", options: .regularExpression)
.split(separator: "|")
print(words)
However, if you can, use the system function to enumerate words.

Related

Count leading tabs in Swift string

I need to count the number of leading tabs in a Swift string. I know there are fairly simple solutions (e.g. looping over the string until a non-tab character is encountered) but I am looking for a more elegant solution.
I have attempted to use a regex such as ^\\t* along with the .numberOfMatches method but this detects all the tab characters as one match. For example, if the string has three leading tabs then that method just returns 1. Is there a way to write a regex that treats each individual tab character as a single match?
Also open to other ways of approaching this without using a regex.
Here is a non-regex solution
let count = someString.prefix(while: {$0 == "\t"}).count
You may use
\G\t
See the regex demo.
Here,
\G - matches a string start position or end of the previous match position, and
\t - matches a single tab.
Swift test:
let string = "\t\t123"
let regex = try! NSRegularExpression(pattern: "\\G\t", options: [])
let numberOfOccurrences = regex.numberOfMatches(in: string, range: NSRange(string.startIndex..., in: string))
print(numberOfOccurrences) // => 2

How to remove spaces from a string in Swift?

I have the need to remove leading and trailing spaces around a punctuation character.
For example: Hello , World ... I 'm a newbie iOS Developer.
And I'd like to have: > Hello, World... I'm a newbie iOS Developer.
How can I do this? I tried to get components of the string and enumerate it by sentences. But that is not what I need
Rob's answer is great, but you can trim it down quite a lot by taking advantage of the \p{Po} regular expression class. Getting rid of the spaces around punctuation then becomes a single regular expression replace:
import Foundation
let input = "Hello , World ... I 'm a newbie iOS Developer."
let result = input.replacingOccurrences(of: "\\s*(\\p{Po}\\s?)\\s*",
with: "$1",
options: [.regularExpression])
print(result) // "Hello, World... I'm a newbie iOS Developer."
Rob's answer also tries to trim leading/trailing spaces, but your input doesn't have any of those. If you do care about that you can just call result.trimmingCharacters(in: .whitespacesAndNewlines) on the result.
Here's an explanation for the regular expression. Removing the double-escapes it looks like
\s*(\p{Po}\s?)\s*
This is comprised of the following components:
\s* - Match zero or more whitespace characters (and throw them away)
(…) - Capturing group. Anything inside this group is preserved by the replacement (the $1 in the replacement refers to this group).
\p{Po} - Match a single character in the "Other_Punctuation" unicode category. This includes things like ., ', and …, but excludes things like ( or -.
\s? - Match a single optional whitespace character. This preserves the space after periods (or ellipses).
\s* - Once again, match zero or more whitespace characters (and throw them away). This is what turns your , World into , World.
For Swift 3 or 4 you can use :
let trimmedString = string.trimmingCharacters(in: .whitespaces)
This is a really wonderful problem and a shame that it isn't easier to do in Swift today (someday it will be, but not today).
I kind of hate this code, but I'm getting on a plane for 20 hours, and don't have time to make it nicer. This may at least get you started using NSMutableString. It'd be nice to work in String, and Swift hates regular expressions, so this is kind of hideous, but at least it's a start.
import Foundation
let input = "Hello, World ... I 'm a newbie iOS Developer."
let adjustments = [
(pattern: "\\s*(\\.\\.\\.|\\.|,)\\s*", replacement: "$1 "), // elipsis or period or comma has trailing space
(pattern: "\\s*'\\s*", replacement: "'"), // apostrophe has no extra space
(pattern: "^\\s+|\\s+$", replacement: ""), // remove leading or trailing space
]
let mutableString = NSMutableString(string: input)
for (pattern, replacement) in adjustments {
let re = try! NSRegularExpression(pattern: pattern)
re.replaceMatches(in: mutableString,
options: [],
range: NSRange(location: 0, length: mutableString.length),
withTemplate: replacement)
}
mutableString // "Hello, World... I'm a newbie iOS Developer."
Regular expressions can be very confusing when you first encounter them. A few hints at reading these:
The specific language Foundation uses is described by ICU.
Backslash (\) means "the next character is special" for a regex. But inside a Swift string, backslash means "the next character is special" of the string. So you have to double them all.
\s means "a whitespace character"
\s* means "zero or more whitespace characters"
\s+ means "one or more whitespace characters"
$1 means "the thing we matched in parentheses"
| means "or"
^ means "start of string"
$ means "end of string"
. means "any character" so to mean "an actual dot" you have to type "\\." in a Swift string.
Notice that I check for both "..." and "." in the same regular expression. You kind of have to do something like that, or else the "." will match three times inside the "...". Another approach would be to first replace "..." with "…" (the single ellipsis character, typed on a Mac by pressing Opt-;). Then "…" is a one-character punctuation. (You could also decide to re-expand all ellipsis back to dot-dot-dot at the end of the process.)
Something like this is probably how I'd do it in real life, get it done and shipped, but it may be worth the pain/practice to try to build this as a character-by-character state machine, walking one character at a time, and keeping track of your current state.
You can try something like
string.replacingOccurrences(of: " ,", with: ",") for every punctuation...
Interesting problem; here's my stab at a non-Regex approach:
func correct(input: String) -> String {
typealias Correction = (punctuation: String, replacement: String)
let corrections: [Correction] = [
(punctuation: "...", replacement: "... "),
(punctuation: "'", replacement: "'"),
(punctuation: ",", replacement: ", "),
]
var transformed = input
for correction in corrections {
transformed = transformed
.components(separatedBy: correction.punctuation)
.map({ $0.trimmingCharacters(in: .whitespaces) })
.joined(separator: correction.replacement)
}
return transformed
}
let testInput = "Hello , World ... I 'm a newbie iOS Developer."
let testOutput = correct(input: testInput)
// Hello, World... I'm a newbie iOS Developer.
If you were doing this manually by processing characters arrays, you would merely need to check the previous and next characters around spaces. You can achieve the same result using functional style programming with zip, filter and map:
let testInput = "Hello , World ... I 'm a newbie iOS Developer."
let punctuation = Set(".\',")
let previousNext = zip( [" "] + testInput, String(testInput.dropFirst()) + [" "] )
let filteredChars = zip(Array(previousNext),testInput)
.filter{ $1 != " "
|| !($0.0 != " " && punctuation.contains($0.1))
}
let filteredInput = String(filteredChars.map{$1})
print(testInput) // Hello , World ... I 'm a newbie iOS Developer.
print(filteredInput) // Hello, World... I'm a newbie iOS Developer.
Swift 4, 4.2 and 5
let str = " Akbar Code "
let trimmedString = str.trimmingCharacters(in: .whitespaces)

Swift 3 replacingOccurrences exact

I'm having trouble with my replacingOccurrences function. I have a string like so:
let x = "john, johnny, johnney"
What I need to do is remove only "john"
So I have this code:
y = "john"
x = x.replacingOccurrences(of: y, with: " ", options: NSString.CompareOptions.literal, range: nil)
The problem I get is that this removes All instances of john... Also I thought about setting the range, however, that wont really solve my problem because the names could be in a different order.
Is there a CompareOptions func that I'm overlooking for "exact" or do I need to create a regular expression?
Sincerely,
Denis Angell
You should use a regular expression to match the exact word, you just need to put the desired word between \b. Try like this:
let x = "johnny, john, johnney"
let y = "john"
let z = x.replacingOccurrences(of: "\\b\(y)\\b", with: " ", options: .regularExpression) // "johnny, , johnney"
let x = "johnny &john johnney"
let y = "&john"
let z = x.replacingOccurrences(of: y + "\\b", with: " ", options: .regularExpression)
A regular expression is probably your best option because there is no general rule for "exact" that would work in all languages. In english, and most romance languages, you would define "exact" as having no alphabetical characters before or after the word to replace.
A regular expression could express that relatively simply for the english language as there are no complex letter combinations and a-z is pretty much all of it. In other languages, you would have to worry about diacritical marks (french, other latin languages), in Arabic you'd need to take into account the combined letter A (Aleph) at the beginning of some words, and so on.

Remove substring from a string knowing first and last characters in Swift

Having a string like this:
let str = "In 1273, however, they lost their son in an accident;[2] the young Theobald was dropped by his nurse over the castle battlements.[3]"
I'm looking for a solution of removing all appearances of square brackets and anything that between it.
I was trying using a String's method: replacingOccurrences(of:with:), but it requires the exact substring it needs to be removed, so it doesn't work for me.
You can use:
let updated = str.replacingOccurrences(of: "\\[[^\\]]+\\]", with: "", options: .regularExpression)
The regular expression (without the required escapes needed in a Swift string is:
\[[^\]+]\]
The \[ and \] look for the characters [ and ]. They have a backslash to remove the normal special meaning of those characters in a regular expression.
The [^]] means to match any character except the ] character. The + means match 1 or more.
You can create a while loop to get the lowerBound of the range of the first string and the upperBound of the range of the second string and create a range from that. Next just remove the subrange of your string and set the new startIndex for the search.
var str = "In 1273, however, they lost their son in an accident;[2] the young Theobald was dropped by his nurse over the castle battlements.[3]"
var start = str.startIndex
while let from = str.range(of: "[", range: start..<str.endIndex)?.lowerBound,
let to = str.range(of: "]", range: from..<str.endIndex)?.upperBound,
from != to {
str.removeSubrange(from..<to)
start = from
}
print(str) // "In 1273, however, they lost their son in an accident; the young Theobald was dropped by his nurse over the castle battlements."

Getting IndexOutOfBounds Exception while search for a subtring

I have a string like
var word = "banana"
and a sentence like var sent = "the monkey is holding a banana which is yellow"
sent1 = "banana!!"
I want to search banana in sent and then write to a file in the following way:
the monkey is holding a
banana
which is yellow
I'm doing it in the following way:
var before = sent.substring(0, sent.indexOf(word))
var after = sent.substring(sent.indexOf(word) + word.length)
println(before)
println(after)
This works fine but when I do the same for sent1, then it gives me IndexOutOfBoundsException. I think it is because there is nothing before banana in sent1. How to deal with this?
You can split based on the word and you will get an array with everything before and after the word.
val search = sent.split(word)
search: Array[String] = Array("the monkey is holding a ", " which is yellow")
This works in the "banana!!!" case:
"banana!!".split(word)
res5: Array[String] = Array("", !!)
Now you can write the three lines to a file in your favorite way:
println(search(0))
println(word)
println(search(1))
What if you had more than one occurrence of the word? .split understands regular expressions, so you could improve the previous solution with something like this:
string
.replaceAll("\\s+(?=banana)|(?<=banana)\\s+")
.foreach(println)
\\s means a whitespace character
(?=<word>) means "followed by <word>"
(?<=<word>) means "preceded by <word>"
So, this would split your string into pieces, using any spaces either preceded or followed by the "banana", and not the word itself. The actual word ends up in the list, just like the other parts of the string, so you don't need to print it out explicitly
This regex trick is called "positive look-around" ( ?= is look-ahead, ?<= is look-behind) in case you are wondering.