capturing parentheses using regex on iphone - iphone

Trying to get the URL out of some HTML I'm parsing (on iPhone) using 'capturing parentheses' to just group the bit I'm interested in.
I now have this:
NSString *imageHtml; //a string with some HTML in it
NSRegularExpression* innerRegex = [[NSRegularExpression alloc] initWithPattern:#"href=\"(.*?)\"" options:NSRegularExpressionCaseInsensitive|NSRegularExpressionDotMatchesLineSeparators error:nil];
NSTextCheckingResult* firstMatch = [innerRegex firstMatchInString:imageHtml options:0 range:NSMakeRange(0, [imageHtml length])];
[innerRegex release];
if(firstMatch != nil)
{
newImage.detailsURL =
NSLog(#"found url: %#", [imageHtml substringWithRange:firstMatch.range]);
}
The only thing it lists is the full match (so: href="http://tralalala.com" instead of http://tralalala.com
How can I force it to only return my first capturing parentheses match?

Regex groups work by capturing the whole match in group 0, then all groups in the regex will start at index 1. NSTextCheckingResult stores these groups as ranges. Since your regex requires at least one group the following will work.
NSString *imageHtml = #"href=\"http://tralalala.com\""; //a string with some HTML in it
NSRegularExpression* innerRegex = [[NSRegularExpression alloc] initWithPattern:#"href=\"(.*?)\"" options:NSRegularExpressionCaseInsensitive|NSRegularExpressionDotMatchesLineSeparators error:nil];
NSTextCheckingResult* firstMatch = [innerRegex firstMatchInString:imageHtml options:0 range:NSMakeRange(0, [imageHtml length])];
[innerRegex release];
if(firstMatch != nil)
{
//The ranges of firstMatch will provide groups,
//rangeAtIndex 1 = first grouping
NSLog(#"found url: %#", [imageHtml substringWithRange:[firstMatch rangeAtIndex:1]]);
}

You need pattern something like this:
(?<=href=\")(.*?)(?=\")

Related

regex to find hashtags in tweet not working correctly

I am trying to build a function to find a hashtags in tweest. And surround them with an HTML <a> tag. so that I can link to them. Here is what I do.
NSError* error = nil;
NSRegularExpression* regex = [NSRegularExpression regularExpressionWithPattern:#"(?:\\s|\\A)[##]+([A-Za-z0-9-_]+)" options:0 error:&error];
NSArray* matches = [regex matchesInString:tweetText options:0 range:NSMakeRange(0, [tweetText length])];
for ( NSTextCheckingResult* match in matches )
{
NSString* matchText = [tweetText substringWithRange:[match range]];
NSString *matchText2 = [matchText stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceCharacterSet]];
NSString *search = [matchText2 stringByReplacingOccurrencesOfString:#"#"
withString:#""];
NSString *searchHTML= [NSString stringWithFormat:#"<a href='https://twitter.com/search?q=%%23%#'>%#</a>",search,matchText];
tweetText = [tweetText stringByReplacingOccurrencesOfString:matchText
withString:searchHTML];
NSLog(#"match: %#", tweetText);
}
Before I execute this function, the tweetText is looped through another function to find the URL. so the tweet can contain the following. <a href='http://google.be' target='_blank'>http://google.be</a>
Now sometimes it places another tag around other links and not only around the hashtags.
Can somebody help me with this.
TIP
I am trying to transform the following JAVA code into OBJ-C
String patternStr = "(?:\\s|\\A)[##]+([A-Za-z0-9-_]+)"
Pattern pattern = Pattern.compile(patternStr)
Matcher matcher = pattern.matcher(tweetText)
String result = "";
// Search for Hashtags
while (matcher.find()) {
result = matcher.group();
result = result.replace(" ", "");
String search = result.replace("#", "");
String searchHTML="<a href='http://search.twitter.com/search?q=" + search + "'>" + result + "</a>"
tweetText = tweetText.replace(result,searchHTML);
}
EDIT
Gers, we kijken er al naar uit! “#GersPardoel: We zitten in België straks naar Genk!!<a href='<a href<a href='https://twitter.com/search?q=%23='http'>='http</a>s://twitter.com/search?q=%23https:/'>https:/</a>/twitter.com/search?q=%23engaan'> #engaan</a>” #GOS12 #genk #fb
The problem is that you're modifying your tweetText variable (tweetText = ...) as you're looping through matches. Imagine what happens the next time code enters the loop? The substringWithRange will not work properly since it was created on the original string. Try to rectify the problem and if you're unable to do it, check the solution here: http://pastebin.com/DyQqtRzA
EDIT: Adding solution here:
NSError* error = nil;
NSRegularExpression* regex = [NSRegularExpression regularExpressionWithPattern:#"(?:\\s|\\A)[##]+([A-Za-z0-9-_]+)" options:0 error:&error];
NSArray* matches = [regex matchesInString:tweetText options:0 range:NSMakeRange(0, [tweetText length])];
NSString* processedString = [[tweetText copy] autorelease];
for ( NSTextCheckingResult* match in matches )
{
NSString* matchText = [tweetText substringWithRange:[match range]];
NSString *matchText2 = [matchText stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceCharacterSet]];
NSString *search = [matchText2 stringByReplacingOccurrencesOfString:#"#"
withString:#""];
NSString *searchHTML= [NSString stringWithFormat:#"<a href='https://twitter.com/search?q=%%23%#'>%#</a>",search,matchText];
processedString = [processedString stringByReplacingOccurrencesOfString:matchText
withString:searchHTML];
NSLog(#"match: %#", processedString);
}

Match NSArray of characters Objective-C

I have to match the number of occurrences of n special characters in a string.
I thought to create an array with all these chars (they are 20+) and create a function to match each of them.
I just have the total amount of special characters in the string, so I can make some math count on them.
So in the example:
NSString *myString = #"My string #full# of speci#l ch#rs & symbols";
NSArray *myArray = [NSArray arrayWithObjects:#"#",#"#",#"&",nil];
The function should return 5.
Would it be easier match the characters that are not in the array, take the string length and output the difference between the original string and the one without special chars?
Is this the best solution?
NSString *myString = #"My string #full# of speci#l ch#rs & symbols";
//even in first continuous special letters it contains -it will return 8
//NSString *myString = #"#&#My string #full# of speci#l ch#rs & symbols";
NSArray *arr=[myString componentsSeparatedByCharactersInSet:[NSMutableCharacterSet characterSetWithCharactersInString:#"##&"]];
NSLog(#"resulted string : %# \n\n",arr);
NSLog(#"count of special characters : %i \n\n",[arr count]-1);
OUTPUT:
resulted string : (
"My string ",
full,
" of speci",
"l ch",
"rs ",
" symbols"
)
count of special characters : 5
You should utilize an NSRegularExpression, its perfect for your scenario. You can create one like this:
NSError *error = NULL;
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"(#|&)" options:NSRegularExpressionCaseInsensitive error:&error];
NSUInteger numberOfMatches = [regex numberOfMatchesInString:string options:0 range:NSMakeRange(0, [string length])];
Caveat: I ripped the code from the Apple Developer site. And I'm no regex guru so you will have to tweak the pattern. But you get the gist.
You should look also at NSRegularExpression:
- (NSUInteger)numberOfCharacters:(NSArray *)arr inString:(NSString *)str {
NSMutableString *mutStr = #"(";
for(i = 0; i < [arr count]; i++) {
[mutStr appendString:[arr objectAtIndex:i]];
if(i+1 < [arr count]) [mutStr appendString:#"|"];
}
[mutStr appendString:#")"];
NSRegularExpression *regEx = [NSRegularExpression regularExpressionWithPattern:mutStr options:NSRegularExpressionCaseInsensitive error:nil];
NSUInteger *occur = [regExnumberOfMatchesInString:str options:0 range:NSMakeRange(0, [string length])];
[mutStr release];
return occur;
}
Usage example:
NSString *myString = #"My string #full# of speci#l ch#rs & symbols";
NSArray *myArray = [NSArray arrayWithObjects:#"#",#"#",#"&",nil];
NSLog(#"%d",[self numberOfCharacters:myArray inString:myString]); // will print 5

Stripping URLs from image data using NSRegularExpression

I want to strip image URLs from lots of differently formed HTML.
I have this already:
NSRegularExpression *regex = [[NSRegularExpression alloc]
initWithPattern:#"(?<=img src=\").*?(?=\")"
options:NSRegularExpressionCaseInsensitive error:nil];
this works fine if the HTML is formed like <img src="someurl.jpg" alt="" .../> , but this isn't always the case, sometimes the there are other attributes before src which is doesn't pick up.
Its a difficult thing to do with regular expressions. You are generally better off with using an XMLParser and XPath. However, if the HTML isn't very valid (even if you use TidyHTML), you can find that XPath just won't work very well.
If you must look for images using regular expressions, I would suggest something like:
<\\s*?img\\s+[^>]*?\\s*src\\s*=\\s*([\"\'])((\\\\?+.)*?)\\1[^>]*?>
So assuming you have rawHTML in a string with the same name, use:
NSRegularExpression* regex = [[NSRegularExpression alloc] initWithPattern:#"<\\s*?img\\s+[^>]*?\\s*src\\s*=\\s*([\"\'])((\\\\?+.)*?)\\1[^>]*?>" options:NSRegularExpressionCaseInsensitive error:nil];
NSArray *imagesHTML = [regex matchesInString:rawHTML options:0 range:NSMakeRange(0, [rawHTML length])];
[regex release];
If you want to get out the actual image URL from the source then I'd use something like (run over the output from previous regex):
(?i)\\b((?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:'\".,<>?«»“”‘’]
Yeah, I know, crazy! But you did ask :-)
Credit: That final regex is from John Gruber/Daring Fireball.
This is some code I've written in the past that returns an array of NSString url's of images. I use it when trying (as a last resort) to get image URL's from very broken HTML:
- (NSArray *)extractSuitableImagesFromRawHTMLEntry:(NSString *)rawHTML {
NSMutableArray *images = [[NSMutableArray alloc] init];
if(rawHTML!=nil&&[rawHTML length]!=0) {
NSRegularExpression* regex = [[NSRegularExpression alloc] initWithPattern:#"<\\s*?img\\s+[^>]*?\\s*src\\s*=\\s*([\"\'])((\\\\?+.)*?)\\1[^>]*?>" options:NSRegularExpressionCaseInsensitive error:nil];
NSArray *imagesHTML = [regex matchesInString:rawHTML options:0 range:NSMakeRange(0, [rawHTML length])];
[regex release];
for (NSTextCheckingResult *image in imagesHTML) {
NSString *imageHTML = [rawHTML substringWithRange:image.range];
NSRegularExpression* regex2 = [[NSRegularExpression alloc] initWithPattern:#"(?i)\\b((?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:'\".,<>?«»“”‘’]))" options:NSRegularExpressionCaseInsensitive error:nil];
NSArray *imageSource=[regex2 matchesInString:imageHTML options:0 range:NSMakeRange(0, [imageHTML length])];
[regex2 release];
NSString *imageSourceURLString=nil;
for (NSTextCheckingResult *result in imageSource) {
NSString *str=[imageHTML substringWithRange:result.range];
//DebugLog(#"url is %#",str);
if([str hasPrefix:#"http"]) {
//strip off any crap after file extension
//find jpg
NSRange r1=[str rangeOfString:#".jpg" options:NSBackwardsSearch&&NSCaseInsensitiveSearch];
if(r1.location==NSNotFound) {
//find jpeg
NSRange r2=[str rangeOfString:#".jpeg" options:NSBackwardsSearch&&NSCaseInsensitiveSearch];
if(r2.location==NSNotFound) {
//find png
NSRange r3=[str rangeOfString:#".png" options:NSBackwardsSearch&&NSCaseInsensitiveSearch];
if(r3.location==NSNotFound) {
break;
} else {
imageSourceURLString=[str substringWithRange:NSMakeRange(0, r3.location+r3.length)];
}
} else {
//jpeg was found
imageSourceURLString=[str substringWithRange:NSMakeRange(0, r2.location+r2.length)];
break;
}
} else {
//jpg was found
imageSourceURLString=[str substringWithRange:NSMakeRange(0, r1.location+r1.length)];
break;
}
}
}
if(imageSourceURLString==nil) {
//DebugLog(#"No image found.");
} else {
DebugLog(#"*** image found: %#", imageSourceURLString);
NSURL *imageURL=[NSURL URLWithString:imageSourceURLString];
if(imageURL!=nil) {
[images addObject:imageURL];
}
}
}
}
return [images autorelease];
}

Text extraction with NSRegularExpression

Given a NSString *test = #"...href="/functions?q=KEYWORD\x26amp...";
How can I extract the word KEYWORD from the string using NSRegularExpression?
I have tried with the following NSRegularExpression on iOS SDK 4.2 but it is not able to find the text. Does the following code looks okay?
NSRegularExpression *testRegex = [NSRegularExpression regularExpressionWithPattern:#"(?<=href=\"\\/functions\\?q=).+?(?=\\x26amp])" options:0 error:nil];
NSRange result = [testRegex rangeOfFirstMatchInString:test options:0 range:NSMakeRange(0, [test length])];
You have a stray "]" in your regex, right before the end, which is probably causing a problem. You also need to use four slashes to match a slash in the input string. (Double it to escape it in the C string, and then double again to escape it in the regex). I'd suggest two things. First, pass something in the error parameter and take a look at in it in the debugger. Second, I'm not a big fan of lookahead/lookbehind expressions. I think this style is more readable:
NSString *regexStr = #"href=\"\\/functions\\?=(.+?)\\\\x26amp";
NSError *error;
NSRegularExpression *testRegex = [NSRegularExpression regularExpressionWithPattern:regexStr options:0 error:&error];
if( testRegex == nil ) NSLog( #"Error making regex: %#", error );
NSTextCheckingResult *result = [testRegex firstMatchInString:test options:0 range:NSMakeRange(0, [test length])];
NSRange range = [result rangeAtIndex:1];

NSRegularExpression and capture groups on iphone

I need a little kickstart on regex on the iphone.
Basically I have a list of dates in a private MediaWiki in the form of
*185 BC: SOME EVENT HERE
*2001: SOME OTHER EVENT MUCH LATER
I now want to parse that into an Object that has a NSDate property and a -say- NSString property.
I have this so far: (rawContentString contains the mediawiki syntax of the page)
NSString* regexString =#"\\*( *[0-9]{1,}.*): (.*)";
NSRegularExpressionOptions options = NSRegularExpressionCaseInsensitive;
NSError* error = NULL;
NSRegularExpression* regex = [NSRegularExpression regularExpressionWithPattern:regexString options:options error:&error];
if (error) {
NSLog(#"%#", [error description]);
}
NSArray* results = [regex matchesInString:rawContentString options:0 range:NSMakeRange(0, [rawContentString length])];
for (NSTextCheckingResult* result in results) {
NSString* resultString = [rawContentString substringWithRange:result.range];
NSLog(#"%#",resultString);
}
unfortunately I think the regex is not working the way I hope and I dont know how to capture the matched date and text.
Any help would be great.
BTW: there is not by any chance a regex Pattern compilation for MediaWiki Syntax out there somewhere ?
Thanks in advance
Heiko
*
My issue was that I was using matchesInString and I needed to use firstMatchInString because it returns multiple ranges in a single NSTextCheckingResult.
This is counter intuitive, but it worked.
I got the answer from http://snipplr.com/view/63340/
My Code (to parse credit card track data):
NSRegularExpression *track1Pattern = [NSRegularExpression regularExpressionWithPattern:#"%.(.+?)\\^(.+?)\\^([0-9]{2})([0-9]{2}).+?\\?." options:NSRegularExpressionCaseInsensitive error:&error];
NSTextCheckingResult *result = [track1Pattern firstMatchInString:trackString options:NSMatchingReportCompletion range:NSMakeRange(0, trackString.length)];
self.cardNumber = [trackString substringWithRange: [result rangeAtIndex:1]];
self.cardHolderName = [trackString substringWithRange: [result rangeAtIndex:2]];
self.expirationMonth = [trackString substringWithRange: [result rangeAtIndex:3]];
self.expirationYear = [trackString substringWithRange: [result rangeAtIndex:4]];
As for the regex, i think something around these lines:
\*([ 0-9]{1,}.*):(.*)
should work better to what you need. You're not escaping the first *, and why is there a * in the first group statement?