how can I remove, with NSRegularExpression, the entire head-tag in a HTML file. Can some one give me a regex?
Thanks in advance,
Ph99Ph
There is none! HTML is a type-2 language and thus not parsable with a regular expression (type-3).
See this wiki article in case of doubt.
Lots of people use regex for parsing/editing HTML. This works quite well in simple cases but is utterly error prone.
This being said: You should have fairly reliable results with this regex:
<head>.+?</head>
This requires "." to also match line breaks. If it doesn't, then use this:
<head>(?:.|\n|\r)+?</head>
Again: This is error prone, don't do it.
What you should use is an XML parser such as NSXMLParser.
Please see the accepted answer at RegEx match open tags except XHTML self-contained tags. Or any version of this exact same question posted each day since the beginning of Stack Overflow.
In short, you cannot reliably parse HTML with Regular Expressions. RegEx is simply not advanced enough because of the complexities of HTML.
use something like this :
result = System.Text.RegularExpressions.Regex.Replace(result,
#"<( )*head([^>])*>", "<head>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
#"(<( )*(/)( )*head( )*>)", "</head>",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
result = System.Text.RegularExpressions.Regex.Replace(result,
"(<head>).*(</head>)", " ",
System.Text.RegularExpressions.RegexOptions.IgnoreCase);
Related
I have tried to minify a json file with str_replace() but that doesn't work well as I used it.
//I want to minify a json file with php.
//Here I am trying to replace ", {" with ",{"
//$result = preg_replace('/abc/', 'def', $string); # Replace all 'abc' with 'def'
$new = preg_replace('/, {/', ',{', $new); //doesn't work.. why?
As for the specific issue, { is a special character in regular expressions and you need to escape it. See the Meta-characters section of PCRE syntax in the PHP manual. So change the first argument to '/, \{/'. Never mind, as #Hugo demonstrated, it should work, and without telling us how your approach failed, we can't help more.
More importantly, this is terribly error-prone. What about a JSON string like ['hello, {name}']. Your attempt will incorrectly "minify" the part inside the quotes and turn it into ['hello,{name}']. Not a critical bug in this case, but might be more severe in other cases. Handling string literals properly is a pain, the simplest solution to actually minify JSON strings is to do json_encode(json_decode($json)), since PHP by default does not pretty print or put unnecessary whitespace into JSON.
And finally, maybe you don't really need to do this. If you are doing this to save HTTP traffic or something, just make sure your server gzips responses, caches properly, etc.
Managed to narrow the code down a lot more:
http://pastebin.com/J40Atm9m
Sorry to be a pain but I really thought I had it cracked by using uri_escape in the GetQueryString subroutine but now I'm really out of ideas otherwise I wouldn't ask.
Any insights are much appreciated.
Martin
That is a lot of code. A reduced test case would be helpful.
Rather than read all of it, I'm going to assume that this is what you are doing:
You get raw data
You put raw data in a URI
You encode the URI for HTML
You put the encoded URI in the HTML
If so, then what you missed is this:
You need to encode the data for the URI.
HTML::Escape isn't supposed to escape "#" because "#" isn't unsafe for HTML.The problem is that you're not URI-escaping your data before you're putting it into a URI; use URI::Escape for that.
"artistName":"Travie McCoy", "collectionName":"Billionaire (feat. Bruno Mars) - Single", "trackName":"Billionaire (feat. Bruno Mars)",
i wish to get the artist name so Travie McCoy from within that code using regex, please not i am using regexkitlite for the iphone sdk if this changes things.
Thanks
"?artistName"?\s*:\s*"([^"]*)("|$) should do the trick. It even handles some variations in the string:
White space before and after the :
artistName with and without the quotes
missing " at the end of the artist name if it is the last thing on the line
But there will be many more variations in the input you might encounter that this regex will not match.
Also you don’t want to use a regex for matching this for performance reasons. Right now you might only be interested in the artistName field. But some time later you will want information from the other fields. If you just change the field name in the regex you’ll have to match the whole string again. Much better to use a parser and transform the whole string into a dictionary where you can access the different fields easily. Parsing the whole string shouldn’t take much longer than matching the last key/value pair using a regex.
This looks like some kind of JSON, there are lots of good and complete parsers available. It isn’t hard to write one yourself though. You could write a simple recursive descent parser in a couple of hours. I think this is something every programmer should have done at least once.
\"?artistName\"?\s*:\s*\"([^\"]*)(\"|$)
Thats for objective c
i am parsing an html page, let's say this page lists all players in a football team and those who are seniors will be bolded. I can't parse the file line by line and look for the strong tag because in my real example the pattern is much more complex and span multiple lines.
Something like this:
<strong>Senior:</strong> John Smith
Junior: Joe Smith
<strong>Senior:</strong> Mike Johnson
and so on. How do I write a perl regex to get the names of all seniors?
Thanks
The reason you're having difficulty writing a regex to do this is because it's the wrong tool for the job. You should use a real HTML parser like HTML::Parser, HTML::TokeParser, or HTML::TreeBuilder.
I can't give a specific example because I doubt that's exactly what your HTML looks like. Your sample appears to be missing some punctuation or additional tags.
You don't have to parse a file line by line -- you can read in the entire file at once, if it's small, or you can parse it paragraph by paragraph, using whatever separator you like.
The two magic things you need to do this are 1. set the "line separator" variable, $/ (see perldoc perlvar), to be something else than a newline, and 2. enable multi-line regular expression matching with the /s modifier (see perldoc perlre).
Alternatively, you should use an HTML parser, which is what you would have to do if you are attempting to find things like nested tags.
You have to provide a specific example.
Perl regular expressions can be occasionally used for HTML parsing, but only when you know specifically how the page looks like and that it's not too complex.
If you don't know exactly or it is too complex, use the parsers that cjm links.
It's not clear from your example how the end of the senior name is going to be determined, but something like this:
my #seniors = $filecontents =~ m!<strong>Senior:</strong>\s*([^<]+)!g;
Is there a method built in to NSString that tokenizes the string and searches the beginning of each token? the compare method seems to only do the beginning of a string, and using rangeOfString isn't really sufficient because it doesn't have knowledge of tokens. Right now I'm thinking the best way to do this is to call
[myString componentsSeparatedByString:#" "]
and then loop over the resulting array, calling compare on each component of the string. Is this built-in and I just missed it?
Using CFStringTokenizer for, um, tokenizing strings will be more robust than splitting on #" ", but searching the results is still left up to you.
You may want to look into RegexKit Lite:
http://regexkit.sourceforge.net/#RegexKitLite/
Although it's a third party library, it's basically a very small (one class) wrapper built around the built-in fairly powerful regular expression engine.
It seems like this would be more useful since you could have non-capturing expressions match around the token-separators and then the capturing portion include or not include the text you are looking for along with the remaining text between tokens. If you have not used regular expressions much before, you'll want to read some kind of reference but just be aware you can separate out matching patterns from content you want to see with a cryptic but very powerful syntax.
I'm also not sure you can use CFStringTokenizer on the iPhone since the iPhone specific doc set has no reference for it.