iPhone - Reg Exp for url validity - iphone

I have a chat view, where users can send urls to one another.
In case of a url, I want to let the user press on the link and open a web view.
I'm using IFTweetLabel which uses RegexKitLite.
Currently the only support available is if the url starts with http/https.
I want to support links without the http, for example : www.nytimes.com , and even without the "www" , nytimes.com. (and bunch of other extentions).
This is the http/s prefix reg exp :
#"([hH][tT][tT][pP][sS]?:\\/\\/[^ ,'\">\\]\\)]*[^\\. ,'\">\\]\\)])
Can someone tell me the other regular expressions I need to answer my other requirements.
I tried using This one, but adding it to objective c code generates a lot of issues.
Thanks

The following is John Grubers URL Matching Regex:
(?i)\b(?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’])
The following is a regex I came up with by blending a few other regexes I had around and a good chunk of Grubers regex:
(?i)\b(?:(?:[a-z][\w\-]+://(?:\S+?(?::\S+?)?\#)?)|(?:(?:[a-z0-9\-]+\.)+[a-z]{2,4}))(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]*\)))*\))*(?<![\s`!()\[\]{};:'".,<>?«»“”‘’])
The following is a sample program that demonstrates, via RegexKitLite, what each regex matches against the sample text of:
Did you see
http://www.stackoverflow.com? Or
http://www.stackoverflow.com/?
And then there is
www.stackoverflow.com/, along with
www.stackoverflow.com/index.
Maybe something like stackoverflow.com
with extra stackoverflow.com? Or
"stackoverflow.com"?
Perhaps jobs.stackoverflow.com, or
'http://twitter.com/#!/CHOCKENBERRY',
the CHOCKLOCK!!
File
#file:///Users/johne/rkl/rkl.html#RegexKitLiteCookbook?
Maybe
http://www.yahoo.com/index///i.html!
http://www.yahoo.com/////xyz.html?!
The code:
#import <Foundation/Foundation.h>
#import "RegexKitLite.h"
int main(int argc, char *argv[]) {
NSAutoreleasePool *pool = [[NSAutoreleasePool alloc] init];
NSString *urlRegex = #"(?i)\\b(?:(?:[a-z][\\w\\-]+://(?:\\S+?(?::\\S+?)?\\#)?)|(?:(?:[a-z0-9\\-]+\\.)+[a-z]{2,4}))(?:[^\\s()<>]+|\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]*\\)))*\\))*(?<![\\s`!()\\[\\]{};:'\".,<>?«»“”‘’])";
// John Gruber's URL matching regex from http://daringfireball.net/2010/07/improved_regex_for_matching_urls
NSString *gruberURLRegex = #"(?i)\\b(?:[a-z][\\w-]+:(?:/{1,3}|[a-z0-9%])|www\\d{0,3}[.]|[a-z0-9.\\-]+[.][a-z]{2,4}/)(?:[^\\s()<>]+|\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\))+(?:\\(([^\\s()<>]+|(\\([^\\s()<>]+\\)))*\\)|[^\\s`!()\\[\\]{};:'\".,<>?«»“”‘’])";
NSString *urlString = #"Did you see http://www.stackoverflow.com? Or http://www.stackoverflow.com/?\n\nAnd then there is www.stackoverflow.com/, along with www.stackoverflow.com/index.\n\nMaybe something like stackoverflow.com with extra stackoverflow.com? Or \"stackoverflow.com\"?\n\nPerhaps jobs.stackoverflow.com, or 'http://twitter.com/#!/CHOCKENBERRY', the CHOCKLOCK!!\n\nFile #file:///Users/johne/rkl/rkl.html#RegexKitLiteCookbook?\n\nMaybe http://www.yahoo.com/index///i.html! http://www.yahoo.com/////xyz.html?!";
NSLog(#"String :\n\n%#\n\n", urlString);
NSLog(#"Matches: %#\n", [urlString componentsMatchedByRegex:urlRegex]);
NSLog(#"Gruber URL Regex Matches: %#\n", [urlString componentsMatchedByRegex:gruberURLRegex]);
[pool release]; pool = NULL;
return(0);
}
Compile with:
shell% gcc -o url url.m RegexKitLite.m -framework Foundation -licucore
When run:
shell% ./url
2011-05-27 20:32:58.204 url[25520:903] String :
Did you see http://www.stackoverflow.com? Or http://www.stackoverflow.com/?
And then there is www.stackoverflow.com/, along with www.stackoverflow.com/index.
Maybe something like stackoverflow.com with extra stackoverflow.com? Or "stackoverflow.com"?
Perhaps jobs.stackoverflow.com, or 'http://twitter.com/#!/CHOCKENBERRY', the CHOCKLOCK!!
File #file:///Users/johne/rkl/rkl.html#RegexKitLiteCookbook?
Maybe http://www.yahoo.com/index///i.html! http://www.yahoo.com/////xyz.html?!
2011-05-27 20:32:58.211 url[25520:903] Matches: (
"http://www.stackoverflow.com",
"http://www.stackoverflow.com/",
"www.stackoverflow.com/",
"www.stackoverflow.com/index",
"stackoverflow.com",
"stackoverflow.com",
"stackoverflow.com",
"jobs.stackoverflow.com",
"http://twitter.com/#!/CHOCKENBERRY",
"file:///Users/johne/rkl/rkl.html#RegexKitLiteCookbook",
"http://www.yahoo.com/index///i.html",
"http://www.yahoo.com/////xyz.html"
)
2011-05-27 20:32:58.213 url[25520:903] Gruber URL Regex Matches: (
"http://www.stackoverflow.com",
"http://www.stackoverflow.com/",
"www.stackoverflow.com/",
"www.stackoverflow.com/index",
"http://twitter.com/#!/CHOCKENBERRY",
"file:///Users/johne/rkl/rkl.html#RegexKitLiteCookbook",
"http://www.yahoo.com/index///i.html",
"http://www.yahoo.com/////xyz.html"
)
EDIT 2011/05/27: Made a minor change to the regex to fix a problem where it wasn't matching ( ) parenthesis correctly.
EDIT 2011/05/27: Found some additional corner cases that the regex above didn't handle well. Updated regex:
(?i)\b(?:[a-z][\w\-]+://(?:\S+?(?::\S+?)?\#)?)?(?:(?:(?<!:/|\.)(?:(?:[a-z0-9\-]+\.)+[a-z]{2,4}(?![a-z]))|(?<=://)/))(?:(?:[^\s()<>]+|\((?:[^\s()<>]+|(?:\([^\s()<>]*\)))*\))*)(?<![\s`!()\[\]{};:'".,<>?«»“”‘’])
... as an Obj-C string:
#"(?i)\\b(?:[a-z][\\w\\-]+://(?:\\S+?(?::\\S+?)?\\#)?)?(?:(?:(?<!:/|\\.)(?:(?:[a-z0-9\\-]+\\.)+[a-z]{2,4}(?![a-z]))|(?<=://)/))(?:(?:[^\\s()<>]+|\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]*\\)))*\\))*)(?<![\\s`!()\\[\\]{};:'\".,<>?«»“”‘’])";
The OP also asked for how to make sure the trailing TLD was "valid". Here's the same regex, in Obj-C string form, with all the the currently valid TLDs (as of 2011/05/27):
#"(?i)\\b(?:[a-z][\\w\\-]+://(?:\\S+?(?::\\S+?)?\\#)?)?(?:(?:(?<!:/|\\.)(?:(?:[a-z0-9\\-]+\\.)+(?:(ac|ad|ae|aero|af|ag|ai|al|am|an|ao|aq|ar|arpa|as|asia|at|au|aw|ax|az|ba|bb|bd|be|bf|bg|bh|bi|biz|bj|bm|bn|bo|br|bs|bt|bv|bw|by|bz|ca|cat|cc|cd|cf|cg|ch|ci|ck|cl|cm|cn|co|com|coop|cr|cu|cv|cx|cy|cz|de|dj|dk|dm|do|dz|ec|edu|ee|eg|er|es|et|eu|fi|fj|fk|fm|fo|fr|ga|gb|gd|ge|gf|gg|gh|gi|gl|gm|gn|gov|gp|gq|gr|gs|gt|gu|gw|gy|hk|hm|hn|hr|ht|hu|id|ie|il|im|in|info|int|io|iq|ir|is|it|je|jm|jo|jobs|jp|ke|kg|kh|ki|km|kn|kp|kr|kw|ky|kz|la|lb|lc|li|lk|lr|ls|lt|lu|lv|ly|ma|mc|md|me|mg|mh|mil|mk|ml|mm|mn|mo|mobi|mp|mq|mr|ms|mt|mu|museum|mv|mw|mx|my|mz|na|name|nc|ne|net|nf|ng|ni|nl|no|np|nr|nu|nz|om|org|pa|pe|pf|pg|ph|pk|pl|pm|pn|pr|pro|ps|pt|pw|py|qa|re|ro|rs|ru|rw|sa|sb|sc|sd|se|sg|sh|si|sj|sk|sl|sm|sn|so|sr|st|su|sv|sy|sz|tc|td|tel|tf|tg|th|tj|tk|tl|tm|tn|to|tp|tr|travel|tt|tv|tw|tz|ua|ug|uk|us|uy|uz|va|vc|ve|vg|vi|vn|vu|wf|ws|xn--0zwm56d|xn--11b5bs3a9aj6g|xn--3e0b707e|xn--45brj9c|xn--80akhbyknj4f|xn--90a3ac|xn--9t4b11yi5a|xn--clchc0ea0b2g2a9gcd|xn--deba0ad|xn--fiqs8s|xn--fiqz9s|xn--fpcrj9c3d|xn--fzc2c9e2c|xn--g6w251d|xn--gecrj9c|xn--h2brj9c|xn--hgbk6aj7f53bba|xn--hlcj6aya9esc7a|xn--j6w193g|xn--jxalpdlp|xn--kgbechtv|xn--kprw13d|xn--kpry57d|xn--lgbbat1ad8j|xn--mgbaam7a8h|xn--mgbayh7gpa|xn--mgbbh1a71e|xn--mgbc0a9azcg|xn--mgberp4a5d4ar|xn--o3cw4h|xn--ogbpf8fl|xn--p1ai|xn--pgbs0dh|xn--s9brj9c|xn--wgbh1c|xn--wgbl6a|xn--xkc2al3hye2a|xn--xkc2dl3a5ee0h|xn--yfro4i67o|xn--ygbi2ammx|xn--zckzah|xxx|ye|yt|za|zm|zw))(?![a-z]))|(?<=://)/))(?:(?:[^\\s()<>]+|\\((?:[^\\s()<>]+|(?:\\([^\\s()<>]*\\)))*\\))*)(?<![\\s`!()\\[\\]{};:'\".,<>?«»“”‘’])";

This will match both http://example.org and www.example.org.
#"(([hH][tT][tT][pP][sS]?:\\/\\/|www\\.)[^ ,'\">\\]\\)]*\\.[^\\. ,'\">\\]\\)]{2,6})
Although i added a "match group", so check the match/search result returned by the RegExp so the right parameters are re-inserted in the right place.
If you could post the entire code snippet, it would be easier.
RegExp explanation:
(
(
[hH][tT][tT][pP][sS]?:\/\/ # Match HTTP/http (and hTtP :)
| # OR
www\. # www<literal DOT>
)
[^ ,'\">\]\)]* # Match at least 1 character that are not any of space, comma, apostrophe, quotation mark, "more than", "right square bracket", "right parenthese"
\. # Match <literal DOT>
[^\. ,'\">\]\)]{2,6} # Match 2-6 characters that are not any of dot, space, comma, apostrophe, quotation mark, "more than", "right square bracket", "right parenthese"
)

You don't want to use a regular expression for this.
You want an NSDataDetector, and it'll find them all for you.

Related

wxWidgets wrong substring

I am trying to extract a substring out of some html code in wxWidgets but I can't get my method working properly.
content of to_parse:
[HTML CODE]
<html><head></head><body><font face="Segue UI" size=2 .....<font face="Segoe UI"size="2" color="#000FFF"><font face="#DFKai-SB" ... <b><u> the text </u></b></font></font></font></body></html>
[/HTML CODE] (sorry about the format)
wxString to_parse = SOStream.GetString();
size_t spos = to_parse.find_last_of("<font face=",wxString::npos);
size_t epos = to_parse.find_first_of("</font>",wxString::npos);
wxString retstring(to_parse.Mid(spos,epos));
wxMessageBox(retstring); // Output: always ---> tml>
As there are several font face tags in the HTML the to_parse variable I would like to find the postion of the last <"font face= and the postion of the first <"/font>" close tag.
For some reason, only get the same to me unexpected output tml>
Can anyone spot the reason why?
The methods find_{last,first}_of() don't do what you seem to think they do, they behave in the same way as std::basic_string<> methods of the same name and find the first (or last) character of the string you pass to them, see the documentation.
If you want to search for a substring, use find().
Thank you for the answer. Yes you were right, I must have somehow been under the impression that Substring() / substr() / Mid() takes two wxStrings as parameters, which isn't the case.
wxString to_parse = SOStream.GetString();
to_parse = to_parse.Mid(to_parse.find("<p ")); disregarts everything before "<p "
to_parse = to_parse.Remove(to_parse.find("</p>")); removes everything after "</p>"
wxMessageBox(to_parse); // so we are left with everything between "<p" and "</p>"

NSRegularExpression omitting certain character

So I had the following regex:
#"(#|#)\\S+"
however the \S here includes # and # as well. How do I make this regex so that it's \S but not including # or #?
Basically I want a non white space character excluding # and #
Try (#|#)[^\s##]+. [^\s##] will match everything except space characters, # and #.
And remember to double escape \ when put in the objective-c string literal.
To exclude a set of characters u simply have to add ^ before it..so do it like ^(#|#)
you can use the following :-
NSStirng *string=#"Your String";
NSRegularExpression *regex = [NSRegularExpression regularExpressionWithPattern:#"(#|#|\s)" options:NSRegularExpressionCaseInsensitive error:&error];
NSString *modifiedString = [regex stringByReplacingMatchesInString:string options:0
range:NSMakeRange(0, [string length]) withTemplate:#""];
As discussed on the other thread where you originally asked this question,
#"(#|#)\\w+"
will do what I think you want. If you really want every character except #, #, and whitespace, then
#"[##][^##\\s]+"
should do it. Both of these will take your string:
#"#baz#marroon#red#blue #big#cat#dog"
and if you use matchesInString:options:range it give you:
#"#baz"
#"#marroon"
#"#red"
#"#blue"
#"#big"
#"#cat"
#"#dog"
If this is not what you want, you should give us the input string and what you want as output, and we can tell you how to get it.

How to use regular expression in iPhone app to separate string by , (comma)

I have to read .csv file which has three columns. While parsing the .csv file, I get the string in this format Christopher Bass,\"Cry the Beloved Country Final Essay\",cbass#cgs.k12.va.us. I want to store the values of three columns in an Array, so I used componentSeparatedByString:#"," method! It is successfully returning me the array with three components:
Christopher Bass
Cry the Beloved Country Final Essay
cbass#cgs.k12.va.us
but when there is already a comma in the column value, like this
Christopher Bass,\"Cry, the Beloved Country Final Essay\",cbass#cgs.k12.va.us
it separates the string in four components because there is a ,(comma) after the Cry:
Christopher Bass
Cry
the Beloved Country Final Essay
cbass#cgs.k12.va.us
so, How can I handle this by using regular expression. I have "RegexKitLite" classes but which regular expression should I use. Please help!
Thanks-
Any regular expression would probably turn out with the same problem, what you need is to sanitize your entries or strings, either by escaping your commas or by highlighting strings this way: "My string". Otherwise you will have the same problem. Good luck.
For your example you would probably need to do something like:
\"Christopher Bass\",\"Cry\, the Beloved Country Final Essay\",\"cbass#cgs.k12.va.us\"
That way you could use a regexp or even the same method from the NSString class.
Not related at all, but the importance of sanitizing strings: http://xkcd.com/327/ hehehe.
How about this:
componentsSeparatedByRegex:#",\\\"|\\\","
This should split your string whereever " and , appear together in either order, resulting in a three-member array. This of course assumes that the second element in the string is always enclosed in parentheses, and the characters " and , never appear consecutively within the three components.
If either of these assumptions is incorrect, other methods to identify string components may be used, but it should be made clear that no generic solution exists. If the three component strings can contain " and , anywhere, not even a limited solution is possible in such cases:
Doe, John,\"\"Why Unescaped Strings Suck\", And Other Development Horror Stories\",Doe, John <john.doe#dev.null>
Hopefully there is nothing like the above in your CSV data. If there is, the data is basically unusable, and you should look into a better CSV exporter.
The regex you're searching for is: \\"(.*)\\"[ ^,]*|([^,]*),
in ObjC: (('\"' && string_1 && '\"' && 0-n spaces) || string_2 except comma) && comma
NSString *str = #"Christopher Bass,\"Cry, the Beloved Country ,Final Essay\",cbass#cgs.k12.va.us,som";
NSString *regEx = #"\\\"(.*)\\\"[ ^,]*|([^,]*),";
NSMutableArray *split = [[str componentsSeparatedByRegex:regEx] mutableCopy];
[split removeObject:#""]; // because it will print always both groups even if the other is empty
NSLog(#"%#", split);
// OUTPUT:
2012-02-07 17:42:18.778 tmpapp[92170:c03] (
"Christopher Bass",
"Cry, the Beloved Country ,Final Essay",
"cbass#cgs.k12.va.us",
som
)
RegexKitLite will add both strings to the array, therefore you will end up with empty objects for your array. removeObject:#"" will delete those but if you need to maintain true empty values (eg. your source has val,,ue) you have to modify the code to the following:
str = [str stringByReplacingOccurrencesOfRegex:regEx withString:#"$1$2∏"];
NSArray *split = [str componentsSeparatedByString:#"∏"];
$1 and $2 are those two strings mentioned above, ∏ is in this case a character which will most likely never appear in normal text (and is easy to remember: option-shift-p).
The last part looks like it will never contain a comma. Neither will the first one as far as I can see...
What about splitting the string like this:
NSArray *splitArr = [str componentsSeparatedByString:#","];
NSString *nameStr = [splitArr objectAtIndex:0];
NSString *emailStr = [splitArr lastObject];
NSString *contentStr = #"";
for(int i=1; i<[splitArr count]-1; ++i) {
contentStr = [contentStr stringByAppendingString:[splitArr objectAtIndex:i]];
}
This will use the first and last string as is, and combine the rest into the content.
Kind of a hack, but a name and an email address will never contain a comma, right?
Is the title guarantied to have the quotation marks? And is it the only component that can have them? Because then componentSeparatedByString:#"\"" should get you this:
Christopher Bass,
Cry, the Beloved Country Final Essay
,cbass#cgs.k12.va.us
Then use componentSeparatedByString:#"," or substringFrom/ToIndex: to get rid of the two commas in the first and last component.
Here's a solution using substring:
NSString* input = #"Christopher Bass,\"Cry, the Beloved Country Final Essay\",cbass#cgs.k12.va.us";
NSArray* split = [input componentsSeparatedByString:#"\""];
NSString* part1 = [split objectAtIndex:0];
NSString* part2 = [split objectAtIndex:1];
NSString* part3 = [split objectAtIndex:2];
part1 = [part1 substringToIndex:[part1 length] - 1];
part3 = [part3 substringFromIndex:1];
NSLog(part1);
NSLog(part2);
NSLog(part3);

NSXMLParser and the "£" symbol

Hey all, slight problem when i read in an XML form.
NSXMLParse correctly see's the "£" symbol but its prints out the unicode, \U00a3.
I am just reading it to a string.
[pre_Currency appendString:[self cleanString:string]];
CleanString removes \n - \t and i even added parsing out the unicode and replace it with the Char symbol for the "£".
Oddly enough a NSLog here print a "£" symbol, but when it didEndElement i add it to the dictionary,
[number setObject:[self cleanString:pre_Currency] forKey:#"pre_currency"];
It add it as a unicode Char.
Cant understand why, looking at google theres very little aimed at parsing unicode chars.
I dont know but might be useful to you,if you use the stringByReplacingOccurrencesOfString method
NSString *specialChars=#"YOur string with special characters."
specialChars=[specialChars stringByReplacingOccurrencesOfString:#"\U00a3" withString:#"£"];
Thanks
Yes, it happens for symbols and for other languages fonts in they are not in english, I also got reply here that we will have to decode it as follows:
int c = ... /* your 4 text digit unicode ordinal converted to an integer */
charString = [ NSString stringWithFormat:#"%C", c ];
Original link - How to display Text in Arabic in UIlabel

How a get a part of the string from main String in Objective C

I have mainString from which i need to get the part of the string after finding a keyword.
NSString *mainString = "Hi how are you GET=dsjghdsghghdsjkghdjkhsg";
now I need to get the string after the keyword "GET=".
Waiting for a reply.
Have a look at the NSString documentation.
Assuming your string really is so totally straightforward, you could do something like this:
NSArray *components = [mainString componentsSeparatedByString: #"GET="];
NSString *stringYouWant = [components objectAtIndex: 1];
Obviously, this performs absolutely no error checking and makes a number of assumptions about the actual contents of mainString, but it should get you started.
Note, also, that the code is somewhat defensive in that it assumes that you are looking for GET= and not separating on =. Either way is a hack in terms of parsing, but... hey... hacks are sometimes the right answer.
You can use a regex via RegexKitLite:
NSString *mainString = #"Hi how are you GET=dsjghdsghghdsjkghdjkhsg";
NSString *matchedString = [mainString stringByMatching:#"GET=(.*)" capture:1L];
// matchedString == #"dsjghdsghghdsjkghdjkhsg";
The regex used, GET=(.*), basically says "Look for GET=, and then grab everything after that". The () specifies a capture group, which are useful for extracting just part of a match. Capture groups begin at 1, with capture group 0 being "the entire match". The part inside the capture group, .*, says "Match any character (the .) zero or more times (the *)".
If the string, in this case mainString, is not matched by the regex, then matchedString will be NULL.
You can get the location of the first occurrence of = and then just take a substring of mainString from the location of = to the end of the string.