Match and extract href info using regex - swift

I am trying to make a regex that match and extract href link information in more than one case, for example both with double, single and no quotation mark in Swift.
A regex to match href and extract info <a href=https://www.google.com>Google</a>.
Google
<a href='https://www.google.com'>Google</a>
I have found this regex, but it only works with double quotation:
<a href="([^"]+)">([^<]+)<\/a>
Result:
Match 1: Google
Group 1: https://www.google.com
Group 2: Google
What I want is to detect all of the three ways that I provided with the sample text.
Note: I know that regex shouldn't be used for parsing HTML, but I am using it for a very small use case so it's fine.

assuming there is no other attribute in anchor tags in the file you wish to parse, you can use the following regex : /<a href=('|"|)([^'">]+)\1>([^<]+)<\/a>/$2 $3/gm.
It first captures either single quote, double quote or nothing and then \1 recalls that capturing group, watch it live here on regex101.

Answer is already in comments but posting this since the approach is bit different.
In swift 5.7+ & iOS 16+ u can use regexBuilder for this.
import RegexBuilder
var link1 = "A regex to match href and extract info <a href=https://www.google.com>Google</a>."
var link2 = "Google"
var link3 = "<a href='https://www.google.com'>Google</a>"
let regex = Regex {
Capture {
"https://www."
ZeroOrMore(.word)
"."
ZeroOrMore(.word)
}
}
if let result1 = try? regex.firstMatch(in: link1) {
print("link: \(result1.output.1)")
}
if let result2 = try? regex.firstMatch(in: link2) {
print("link: \(result2.output.1)")
}
if let result3 = try? regex.firstMatch(in: link3) {
print("link: \(result3.output.1)")
}
This work well for the above 3 provided strings. But depend on the scenarios u might need to change the implementation.

Related

How can i get "src" string from RSS in Swift?

I have the RSS page with the html tag like this:
<description>
<![CDATA[
<a href='https://www.24h.com.vn/bong-da/psg-trao-than-dong-mbappe-sieu-luong-bong-chi-kem-messi-real-vo-mong-c48a1112120.html' title='PSG trao thần đồng Mbappe siĂªu lÆ°Æ¡ng bổng: Chỉ kĂ©m Messi, Real vỡ má»™ng'><img width='130' height='100' src='https://image.24h.com.vn/upload/4-2019/images/2019-12-27/1577463916-359-thumbnail.jpg' alt='PSG trao thần đồng Mbappe siĂªu lÆ°Æ¡ng bổng: Chỉ kĂ©m Messi, Real vỡ má»™ng' title='PSG trao thần đồng Mbappe siĂªu lÆ°Æ¡ng bổng: Chỉ kĂ©m Messi, Real vỡ má»™ng' /></a><br />PSG trong ná»— lá»±c giữ chĂ¢n “sĂ¡t thủ†Kylian Mbappe, sẵn sĂ ng tăng lÆ°Æ¡ng khổng lồ - má»™t Ä‘á»™ng thĂ¡i nhằm xua Ä‘uổi Real Madrid.
]]>
</description>
Please help me how can i get the value of src to show the image. I also try Getting img url from RSS feed swift but it doesn't work. Here is my code to get src (the code always run to image = "nil"):
let regex: NSRegularExpression = try! NSRegularExpression(pattern: "<img.*?src=\"([^\"]*)\"", options: .caseInsensitive)
let range = NSMakeRange(0, description.count)
if let textCheck = regex.firstMatch(in: description, options: .withoutAnchoringBounds, range: range) {
let text = (description as NSString).substring(with: textCheck.range(at: 1))
image = text
} else {
image = "nil"
}
Thank for your helping !
You need to change your regex to be able to match single-quotes as well, not just double quotes, since the html string you're trying to parse contains single quotes, not double quotes like the one in the linked Q&A.
let regex: NSRegularExpression = try! NSRegularExpression(pattern: "<img.*?src=[\"\']([^\"\']*)[\"\']", options: .caseInsensitive)
If you are sure you only need to match single quotes, you can simplify the pattern by replacing [\"\'] with \'. Currently, the regex pattern will match both single and double quotes.

NSRegularExpressions - Non Capture Group not working

Hello I am having trouble using the Non-Capture group feature of regex in NSRegularExpressions
Heres some code to capture matches:
func matches(for regex: String, in text: String) -> [String] {
do {
let regex = try NSRegularExpression(pattern: regex);
let results = regex.matches(in: text,
range: NSRange(text.startIndex..., in: text));
return results.map {
String(text[Range($0.range, in: text)!]);
};
} catch let error {
print("invalid regex: \(error.localizedDescription)")
return [];
};
};
So now moving onto the regex, I have a string of text that is in the form workcenter:WDO-POLD should be very easy to make this work but the regex string ((?:workcenter:)(.{0,20})) does not return what I need
I get no errors on running but I get a return of the same string that I input - I am trying to retrieve the value that would be after workcenter: which is (.{0,20})
The first problem is with your regular expression. You do not want the outer capture group. Change your regular expression to:
(?:workcenter:)(.{0,20}) <-- outer capture group removed
The next problem is with how you are doing the mapping. You are accessing the full range of the match and not the desired capture group. Since you have a generalized function for handling any regular expression, it's hard to deal with all possibilities but the following change solves your immediate example and should work with regular expressions that have no capture group as well as those with one capture group.
Update your mapping line to:
return results.map {
regex.numberOfCaptureGroups == 0 ?
String(text[Range($0.range, in: text)!]) :
String(text[Range($0.range(at: 1), in: text)!])
}
This checks how many capture groups are in your regular expression. If none, it returns the full match. But if there is 1 or more, it returns just the value of the first capture group.
You can also get your original mapping to work if you change your regular expression to:
(?<=workcenter:)(.{0,20})
There's a much simpler solution here.
You have a lot of extra groups. Remove the outermost and no need for the non-capture group. Just use workcenter:(.{0,20}). Then you can reference the desired capture group with $1.
And no need for NSRegularExpression in this case. Use a simple string replacement.
let str = "workcenter:WDO-POLD"
let res = str.replacingOccurrences(of: "workcenter:(.{0,20})", with: "$1", options: .regularExpression)
This gives WDO-POLD.

get substrings from string

I have the following string from a server:
I agree with the <a>((http://example.com)) Terms of Use</a> and I've read the <a>((http://example2.com)) Privacy</a>
now I want to show it like this in a label:
I agree with the Terms of Use and I've read the Privacy
I tried to cut of the ((http://example.com)) from the string and save it in another String. I need the link because the text should be clickable later.
I tried this to get the text that I want:
//the link:
let firstString = "(("
let secondString = "))"
let link = (text.range(of: firstString)?.upperBound).flatMap { substringFrom in
(text.range(of: secondString, range: substringFrom..<text.endIndex)?.lowerBound).map { substringTo in
String(text[substringFrom..<substringTo])
}
}
//the new text
if let link = link {
newString = text.replacingOccurrences(of: link, with: kEmptyString)
}
I got this from here: Swift Get string between 2 strings in a string
The problem with this is that it only removes the text inside the (( )) brackets. The brackets are still there. I tried to play with the offset of the indexes but this doesn't changed anything. Moreover this solution works if there's only one link in the text. If there are multiple links I think they should be stored and I have to loop through the text. But I don't know how to do this. I tried many things but I don't get this working. Is there maybe an easier way to get what I want to do?
You can use a regular expression to do a quick search replace.
let text = "I agree with the <a>((http://example.com)) Terms of Use</a> and I've read the <a>((http://example2.com)) Privacy</a>"
let resultStr = text.replacingOccurrences(of: "<a>\\(\\(([^)]*)\\)\\) ", with: "<a href=\"$1\">", options: .regularExpression, range: nil)
print(resultStr)
Output:
I agree with the Terms of Use and I've read the Privacy
You can use something like this to get the links:
let s = "I agree with the ((http://example.com)) Terms of Use and I've read the ((http://example2.com)) Privacy"
let firstDiv = s.split(separator: "(") // ["I agree with the ", "http://example.com)) Terms of Use and I\'ve read the ", "http://example2.com)) Privacy"]
let mid = firstDiv[1] // http://example.com)) Terms of Use and I've read the
let link1 = mid.split(separator: ")")[0] // http://example.com
let link2 = firstDiv[2].split(separator: ")")[0] // http://example2.com

How to replace a substring with a link(http) in swift 3?

I have a string and substring(http) and I want to replace that substring but I don't know when that substring will end. I mean want to check it until one space is not coming and after that I want to replace it.
I am checking that if my string contains http which is also a string then I want to replace it when space will come.
Here below is my example :-
let string = "Hello.World everything is good http://www.google.com By the way its good".
This is my string It can be dynamic also I mean in this above string http is there, so I want to replace "http://www.google.com" to "website".
So it would be
string = "Hello.World everything is good website By the way its good"
A possible solution is Regular Expression
The pattern searches for http:// or https:// followed one or more non-whitespace characters up to a word boundary.
let string = "Hello.World everything is good http://www.google.com By the way its good"
let trimmedString = string.replacingOccurrences(of: "https?://\\S+\\b", with: "website", options: .regularExpression)
print(trimmedString)
Split each words, replace and join back should solve this.
// split into array
let arr = string.components(separatedBy: " ")
// do checking and join
let newStr = arr.map { word in
return word.hasPrefix("http") ? "website" : word
}.joined(separator: " ")
print(newStr)

How can we remove every characters other than numbers, dot and colon in swift?

I am stuck at getting a string from html body
<html><head>
<title>Uaeexchange Mobile Application</title></head><body>
<div id='ourMessage'>
49.40:51.41:50.41
</div></body></html>
I Would like to get the string containing 49.40:51.41:50.41 . I don't want to do it by string advance or index. Can I get this string by specifying I need only numbers,dot(.) and colon(:) in swift. I mean some numbers and some special characters?
I tried
let stringArray = response.componentsSeparatedByCharactersInSet(
NSCharacterSet.decimalDigitCharacterSet().invertedSet)
let newString = stringArray.joinWithSeparator("")
print("Trimmed\(newString)and count\(newString.characters.count)")
but this obviously trims away dot and colon too. any suggestions friends?
The simple answer to your question is that you need to include "." & ":" in the set that you want to keep.
let response: String = "<html><head><title>Uaeexchange Mobile Application</title></head><body><div id='ourMessage'>49.40:51.41:50.41</div></body></html>"
var s: CharacterSet = CharacterSet.decimalDigits
s.insert(charactersIn: ".:")
let stringArray: [String] = response.components(separatedBy: s.inverted)
let newString: String = stringArray.joined(separator: "")
print("Trimmed '\(newString)' and count=\(newString.characters.count)")
// "Trimmed '49.40:51.41:50.41' and count=17\n"
Without more information on what else your response might be, I can't really give a better answer, but fundamentally this is not a good solution. What if the response had been
<html><head><title>Uaeexchange Mobile Application</title></head><body>
<div id='2'>Some other stuff: like this</div>
<div id='ourMessage'>49.40:51.41:50.41</div>
</body></html>
Using a replace/remove solution to this is a hack, not an algorithm - it will work until it doesn't.
I think you should probably be looking for the <div id='ourMessage'> and reading from there to the next <, but again, we'd need more information on the specification of the format of the response.
I'd recommend to use an HTML parser, nevertheless this is a simple solution with regular expression:
let extractedString = response.replacingOccurrences(of: "[^\\d:.]+", with: "", options: .regularExpression)
Or the positive regex search which is more code but also more reliable:
let pattern = ">\\s?([\\d:.]+)\\s?<"
let regex = try! NSRegularExpression(pattern: pattern)
if let match = regex.firstMatch(in: response, range: NSMakeRange(0, response.utf8.count)) {
let range = match.rangeAt(1)
let startIndex = response.index(response.startIndex, offsetBy: range.location)
let endIndex = response.index(startIndex, offsetBy: range.length)
let extractedString = response.substring(with: startIndex..<endIndex)
print(extractedString)
}
While the simple (negative) regex search removes all characters which don't match digits, dots and colons the positive search considers also the closing (>) and opening tags (<) around the desired result so an accidental digit, dot or colon doesn't match the pattern.
You can also use the String.replacingOccurrences() method in other ways, without regex, as follows:
import Foundation
var response: String = "<html><head><title>Uaeexchange Mobile Application</title></head><body><div id='ourMessage'>49.40:51.41:50.41</div></body></html>"
let charsNotToBeTrimmed = (0...9).map{String($0)} + ["." ,":"] // you can add any character you want here, that's the advantage
for i in response.characters{
if !charsNotToBeTrimmed.contains(String(i)){
response = response.replacingOccurrences(of: String(i), with: "")
}
}
print(response)
Basically, this creates an array of characters which should not be trimmed and if a character is not out there, it gets removed in the for-loop
But you have to be warned that what you're trying to do isn't quite right...