I am trying to parse a string with a regex, I am getting some problems trying to extract all the information in substrings. I am almost done, but I am stacked at this point:
For a string like this:
[00/0/00, 00:00:00] User: This is the message text and any other stuff
I can parse Date, User and Message in Swift with this code:
let line = "[00/0/00, 00:00:00] User: This is the message text and any other stuff"
let result = line.match("(.+)\\s([\\S ]*):\\s(.*\n(?:[^-]*)*|.*)$")
extension String {
func match(_ regex: String) -> [[String]] {
let nsString = self as NSString
return (try? NSRegularExpression(pattern: regex, options: []))?.matches(in: self, options: [], range: NSMakeRange(0, count)).map { match in
(0..<match.numberOfRanges).map { match.range(at: $0).location == NSNotFound ? "" : nsString.substring(with: match.range(at: $0)) }
} ?? []
}
}
The resulting array is something like this:
[["[00/0/00, 00:00:00] User: This is the message text and any other stuff","[00/0/00, 00:00:00]","User","This is the message text and any other stuff"]]
Now my problem is this, if the message has a ':' on it, the resulting array is not following the same format and breaks the parsing function.
So I think I am missing some cases in the regex, can anyone help me with this? Thanks in advance.
In the pattern, you are making use of parts that are very broad matches.
For example, .+ will first match until the end of the line, [\\S ]* will match either a non whitespace char or a space and [^-]* matches any char except a -
The reason it could potentially break is that the broad matches first match until the end of the string. As a single : is mandatory in your pattern, it will backtrack from the end of the string until it can match a : followed by a whitespace, and then tries to match the rest of the pattern.
Adding another : in the message part, may cause the backtracking to stop earlier than you would expect making the message group shorter.
You could make the pattern a bit more precise, so that the last part can also contain : without breaking the groups.
(\[[^][]*\])\s([^:]*):\s(.*)$
(\[[^][]*\]) Match the part from an opening till closing square bracket [...] in group 1
\s Match a whitespace char
([^:]*): Match any char except : in group 2, then match the expected :
\s(.*) Match a whitespace char, and capture 0+ times any char in group 3
$ End of string
Regex demo
Related
I have an array of strings like:
"Foo", "Foo1", "Foo$", "$Foo", "1Foo", "1$", "20$", "1$Foo", "12$$", etc.
My required format is [Any number without dots][Must end with single $ symbol] (I mean, 1$ and 20$ from the above array)
I have tried like the below way, but it's not working.
func isValidItem(_ item: String) -> Bool {
let pattern = #"^[0-9]$"#
return (item.range(of: pattern, options: .regularExpression) != nil)
}
Can some one help me on this? Also, please share some great links to learn about the regex patterns if you have any.
Thank you
You can use
func isValidItem(_ item: String) -> Bool {
let pattern = #"^[0-9]+\$\z"#
return (item.range(of: pattern, options: .regularExpression) != nil)
}
let arr = ["Foo", "Foo1", "Foo$", "$Foo", "1Foo", "1$", "20$", "1$Foo", "12$$"]
print(arr.filter {isValidItem($0)})
// => ["1$", "20$"]
Here,
^ - matches start of a line
[0-9]+ - one or more ASCII digits (note that Swift regex engine is ICU and \d matches any Unicode digits in this flavor, so [0-9] is safer if you need to only match digits from the 0-9 range)
\$ - a $ char
\z - the very end of string.
See the online regex demo ($ is used instead of \z since the demo is run against a single multiline string, hence the use of the m flag at regex101.com).
How can I match a break line from OCR text using regex?
For example I have this text:
"NAME JESUS LASTNAME"
I want to find a match with NAME and then get the next two lines
if (line.text.range(of: "^NAME+\\n", options: .regularExpression) != nil){
let name = line.text
print(name)
}
You can use a positive look behind to find NAME followed by a new line, and try to match a line followed by any text that ends on a new line or the end of a string "(?s)(?<=NAME\n).*\n.*(?=$|\n)":
For more info about the regex above you can check this
Playground testing:
let str = "NAME\nJESUS\nLASTNAME"
let pattern = "(?s)(?<=NAME\n).*\n.*(?=$|\n)"
if let range = str.range(of: pattern, options: .regularExpression) {
let text = String(str[range])
print(text)
}
This will print
JESUS
LASTNAME
You can use
(?m)(?<=^NAME\n).*\n.*
See the regex demo. Details:
(?m) - a multiline option making ^ match start of a line
(?<=^NAME\n) - a positive lookbehind that matches a location that is immediately preceeded with start of a line, NAME and then a line feed char
.*\n.* - two subsequent lines (.* matches zero or more chars other than line break chars as many as possible).
See the Swift fiddle:
import Foundation
let line_text = "NAME\nJESUS\nLASTNAME"
if let rng = line_text.range(of: #"(?m)(?<=^NAME\n).*\n.*"#, options: .regularExpression) {
print(String(line_text[rng]))
}
// => JESUS
// LASTNAME
Hello I am having trouble using the Non-Capture group feature of regex in NSRegularExpressions
Heres some code to capture matches:
func matches(for regex: String, in text: String) -> [String] {
do {
let regex = try NSRegularExpression(pattern: regex);
let results = regex.matches(in: text,
range: NSRange(text.startIndex..., in: text));
return results.map {
String(text[Range($0.range, in: text)!]);
};
} catch let error {
print("invalid regex: \(error.localizedDescription)")
return [];
};
};
So now moving onto the regex, I have a string of text that is in the form workcenter:WDO-POLD should be very easy to make this work but the regex string ((?:workcenter:)(.{0,20})) does not return what I need
I get no errors on running but I get a return of the same string that I input - I am trying to retrieve the value that would be after workcenter: which is (.{0,20})
The first problem is with your regular expression. You do not want the outer capture group. Change your regular expression to:
(?:workcenter:)(.{0,20}) <-- outer capture group removed
The next problem is with how you are doing the mapping. You are accessing the full range of the match and not the desired capture group. Since you have a generalized function for handling any regular expression, it's hard to deal with all possibilities but the following change solves your immediate example and should work with regular expressions that have no capture group as well as those with one capture group.
Update your mapping line to:
return results.map {
regex.numberOfCaptureGroups == 0 ?
String(text[Range($0.range, in: text)!]) :
String(text[Range($0.range(at: 1), in: text)!])
}
This checks how many capture groups are in your regular expression. If none, it returns the full match. But if there is 1 or more, it returns just the value of the first capture group.
You can also get your original mapping to work if you change your regular expression to:
(?<=workcenter:)(.{0,20})
There's a much simpler solution here.
You have a lot of extra groups. Remove the outermost and no need for the non-capture group. Just use workcenter:(.{0,20}). Then you can reference the desired capture group with $1.
And no need for NSRegularExpression in this case. Use a simple string replacement.
let str = "workcenter:WDO-POLD"
let res = str.replacingOccurrences(of: "workcenter:(.{0,20})", with: "$1", options: .regularExpression)
This gives WDO-POLD.
I am trying to determine whether an input string contains "n't" or "not".
For example, if the input were:
let part = "Hi, I can't be found!"
I want to find the presence of the negation.
I have tried input.contains, .range, and NSRegularExpression. All of these succeed in finding "not", but fail to find "n't". I have tried escaping the character as well.
'//REGEX:
let negationPattern = "(?:n't|[Nn]ot)"
do {
let regex = try NSRegularExpression(pattern: negationPattern)
let results = regex.matches(in: text,range: NSRange(part.startIndex..., in: part))
print("results are \(results)")
negation = (results.count > 0)
} catch let error {
print("invalid regex: \(error.localizedDescription)")
}
//.CONTAINS
if part.contains("not") || part.contains("n't"){
print("negation present in part")
negation = true
}
//.RANGE (showing .regex option; also tried without)
if part.lowercased().range(of:"not", options: .regularExpression) != nil || part.lowercased().range(of:"n't", options: .regularExpression) != nil {
print("negation present in part")
negation = true
}
Here is a picture:
This is a bit tricky, and the screenshot is actually what gives it away: your regex pattern has a plain single quote in it, but the input text has a "smart" or "curly" apostrophe in it. The difference is subtle:
Regular: '
Smart: ’
Lots of text fields will automatically replace regular single quotes with "smart" apostrophes when they think it's appropriate. Your regex, however, only matches the plain single quote, as evidenced by this tiny test:
func isNegation(input text: String) -> Bool {
let negationPattern = "(?:n't|[Nn]ot)"
let regex = try! NSRegularExpression(pattern: negationPattern)
let matches = regex.matches(in: text,range: NSRange(text.startIndex..., in: text))
return matches.count > 0
}
for input in ["not", "n't", "n’t"] {
print("\"\(input)\" is negation: \(isNegation(input: input) ? "YES" : "NO")")
}
This prints:
"not" is negation: YES
"n't" is negation: YES
"n’t" is negation: NO
If you want to continue using a regex for this problem, you'll need to modify it to match this kind of punctuation character, and avoid assuming all your input text includes "plain" single quotes.
I am stuck at getting a string from html body
<html><head>
<title>Uaeexchange Mobile Application</title></head><body>
<div id='ourMessage'>
49.40:51.41:50.41
</div></body></html>
I Would like to get the string containing 49.40:51.41:50.41 . I don't want to do it by string advance or index. Can I get this string by specifying I need only numbers,dot(.) and colon(:) in swift. I mean some numbers and some special characters?
I tried
let stringArray = response.componentsSeparatedByCharactersInSet(
NSCharacterSet.decimalDigitCharacterSet().invertedSet)
let newString = stringArray.joinWithSeparator("")
print("Trimmed\(newString)and count\(newString.characters.count)")
but this obviously trims away dot and colon too. any suggestions friends?
The simple answer to your question is that you need to include "." & ":" in the set that you want to keep.
let response: String = "<html><head><title>Uaeexchange Mobile Application</title></head><body><div id='ourMessage'>49.40:51.41:50.41</div></body></html>"
var s: CharacterSet = CharacterSet.decimalDigits
s.insert(charactersIn: ".:")
let stringArray: [String] = response.components(separatedBy: s.inverted)
let newString: String = stringArray.joined(separator: "")
print("Trimmed '\(newString)' and count=\(newString.characters.count)")
// "Trimmed '49.40:51.41:50.41' and count=17\n"
Without more information on what else your response might be, I can't really give a better answer, but fundamentally this is not a good solution. What if the response had been
<html><head><title>Uaeexchange Mobile Application</title></head><body>
<div id='2'>Some other stuff: like this</div>
<div id='ourMessage'>49.40:51.41:50.41</div>
</body></html>
Using a replace/remove solution to this is a hack, not an algorithm - it will work until it doesn't.
I think you should probably be looking for the <div id='ourMessage'> and reading from there to the next <, but again, we'd need more information on the specification of the format of the response.
I'd recommend to use an HTML parser, nevertheless this is a simple solution with regular expression:
let extractedString = response.replacingOccurrences(of: "[^\\d:.]+", with: "", options: .regularExpression)
Or the positive regex search which is more code but also more reliable:
let pattern = ">\\s?([\\d:.]+)\\s?<"
let regex = try! NSRegularExpression(pattern: pattern)
if let match = regex.firstMatch(in: response, range: NSMakeRange(0, response.utf8.count)) {
let range = match.rangeAt(1)
let startIndex = response.index(response.startIndex, offsetBy: range.location)
let endIndex = response.index(startIndex, offsetBy: range.length)
let extractedString = response.substring(with: startIndex..<endIndex)
print(extractedString)
}
While the simple (negative) regex search removes all characters which don't match digits, dots and colons the positive search considers also the closing (>) and opening tags (<) around the desired result so an accidental digit, dot or colon doesn't match the pattern.
You can also use the String.replacingOccurrences() method in other ways, without regex, as follows:
import Foundation
var response: String = "<html><head><title>Uaeexchange Mobile Application</title></head><body><div id='ourMessage'>49.40:51.41:50.41</div></body></html>"
let charsNotToBeTrimmed = (0...9).map{String($0)} + ["." ,":"] // you can add any character you want here, that's the advantage
for i in response.characters{
if !charsNotToBeTrimmed.contains(String(i)){
response = response.replacingOccurrences(of: String(i), with: "")
}
}
print(response)
Basically, this creates an array of characters which should not be trimmed and if a character is not out there, it gets removed in the for-loop
But you have to be warned that what you're trying to do isn't quite right...