Safe to use Regex for this? (HTML) - iphone

I'm parsing some HTML, and I need to get all html in the body tag. My target string will always look something like this:
<body><div><img src="" />text etc</div></body>
However, I just need:
<div><img src="" />text etc</div>
My target string will always begin and end with those body tags. However, there is the repeated warning of not use Regex to parse HTML, but I do not have any viable solutions for that available, besides Regex at the moment.
Question: Are there any safe Regex(s) to use in this case? Or should I just forget it?

You didn't show us what your regex is, but it's not as safe as using DOM parsing if it's as simple as:
<body>(.*?)</body>
...because it's possible that </body> is contained in an attribute string or comment. If you're willing to take that risk, then you'll be fine. There's no reason you shouldn't be able to use DOM parsing and just get the text of the body, though, except it would probably be less efficient.
You could also skip the regex and just find the string indices of <body> and </body> and get the substring between them. That should be even faster.
By the way, this is not parsing the HTML; you're just extracting from the HTML

It's fine to use a RegEx in this case.
Having said that there are much easier ways to get the innerHTML of the body tag.
alert(document.body.innerHTML);
should give you exactly that with no RegEx...
or if you're using jQuery
$(body).html();

Related

SImpler way to render text as html

I maybe overlooking a function. In order to render text such as <div>test</div as html inside another tag, I would need several lines of code to name the outside tag, then set .innerHtml, then return the outside tag. Is there a shorter way? There are also confusing conversions with .render with this method.
ex.
val content = span(color := "blue").render
content.innerHtml = "<div>test</test>" // html is escaped
outsideTag.innerHtml = content.outerHtml
Assuming you're using Scalatags here, you may be looking for the raw() function...
I don't know scala.js that well, but as far as I understand it, a div tag is added to a span tag.
You should only add inline tags to other inline tags. So it's not a good idea to add a div to a span.
I think imho you can write:
outsideTag.innerHtml="<div color='blue'>test</div>";

How do you use in GWT UiBinder XML? Can you escape it?

In my mark-up I want to add a space ( ) between elements without always having to use CSS to do so. If I put in my markup, GWT throws errors. Is there a way around it?
For example:
<g:Label>One </g:Label><g:Label>Two</g:Label>
Should show:
One Two
And not:
OneTwo
As documented here, you just have to add this to the top of your XML file and it will work!
<!DOCTYPE ui:UiBinder SYSTEM "http://dl.google.com/gwt/DTD/xhtml.ent">
Note that the GWT compiler won't actually visit this URL to fetch the file, because a copy of it is baked into the compiler. However, your IDE may fetch it.
Rather than use a Label, which to me shouldn't allow character entities at all, I use an HTML widget. In order to set the content, though, I find I have to do it as the HTML attribute, not the body content (note that the uppercase HTML is important here, since the set method is setHTML, not setHtml)
<g:HTML HTML="One&nbsp;" />

Convert links in blockquotes to plain text

So, I've been asking a lot of Xpath questions recently.
Sorry, but I've only just started using it, and I'm working on a kind of hard project.
You see, at the moment I'm parsing HTML like this (not a copy and paste, just an example):
<span id="no153434"></span>
<blockquote>Text here.<br/>More text.<br/>Some more text.</blockquote>
And I'm using
//span[starts-with(#id, 'no')]/following::*[1][name()='blockquote']//node()
To get the text inside.
It's working fine, although it's very frustrating. I need to manually check for then manually combine the strings before and after the br, add a newline, and so on. But it stills works. Until there is a link in the text, that is. Then the code is like this:
<span id="no153434"></span>
<blockquote>Text here.<br/>Text.<br/><font class = "unkfunc">linkhere</font></blockquote>
I have absolutely NO idea where to go from here, as the link is included as a completely seperate item (twice) in the array. Atleast with the br I knew where it had to be moved to. Really contemplating giving up in this project after all this effort.
You can use this XPath to obtain text inside element: //span[starts-with(#id, 'no')]/following::*[1][name()='blockquote']//text()
So you receive following result:
Text here.
Text.
linkhere
If you want only text nodes and br:
//span
[starts-with(#id, 'no')]/
following::*[1][name()='blockquote']
//node()
[ count(.|..//text()) = count(..//text())
or
name()='br'
]
returns
Text here.
<br />
Text.
<br />
linkhere
The answer is to not use XPath for this kind of work.
Got it working 1,000,000x easier with Objective-C-HTML-Parser.

How to use unescape() function inside JavaScript?

I have a JSP page in which I have JavaScript function that will be called when a link is clicked. Now, when the value reaches the JavaScript function, the apostrophe is encoded.
Example:
Name#039;s
Before # there is &, which originally should be:
Name's
I have used the unescape() decode function, but nothing seems to work. In the end, I had to delete the characters and add the apostrophe. Does anyone know a fix for this? Is it that JSP doesn't support encoding for &? When I was writing the same encode value in this page, it changed the symbol to the apostrophe, which is what I wanted in my code.
Built-in Javascript function such as unescape(), decodeURIComponent() has nothing to do with the string you are working on, because the one you are looking to decode are HTML entites.
There are no HTML entites decoder available in Javascript, but since you are working with a browser, if the string is considered safe, you may do the following (in JQuery, for example)
var str = $('<p />').html(str).text();
It bascially insert the string as HTML to a <p> element and then extract the text within.
Edit: I just realize the JSP output you posted is not real HTML entities; To process the example given you should use the following, add & before every #1234; and make it Ӓ:
var str = $('<p />').html(str.replace(/\#(\d+)\;/g '&#$1;')).text();

Zend Framework Filter Input StripTags and "<3"

I'm currently using Zend_Filter_StripTags in a commenting system, but stuff kinda breaks when '<3' is entered. StripTags doesn't seem to be smart enough to realize that it's not an HTML tag, and creating the filter as "new Zend_Filter_StripTags(array('3'))" doesn't seem to work either.
Should I pass the input through a regexp first, or is there a way to get Zend_Filter_StripTags to straighten up and fly right?
Ended up writing a Zend_Filter class that was basically a wrapper for HTMLPurifier. Works perfectly, because HTMLPurifier is a LOT smarter than striptags.
I'm not familiar with Zend much, but if you want stuff like <3 to be allowed, just do htmlspecialchars instead of strip_tags on it.
What you want is Zend_Filter_HtmlEntites most likely.
See: Zend_Filter_HtmlEnties
The problem with htmlspecialchars and Zend_Filter_HtmlEntities is that if you're trying to strip out all html tags ( like 'a' and 'img', etc ), then instead of stripping them, you end up with that markup in your output.
Take comments on a blog for example. If you use htmlspecialchars or Zend_Filter_HtmlEntities, in a comment where someone tries to use html to enter a link you end up with that markup showing up when you display the comment. But if you use strip_tags or Zend_Filter_StripTags you end up mangling the comment, as neither is smart enough to realize that '<3' isn't a tag, and just strips everything from '<3' until the end of the comment ( or until it finds '>' ).
It would be nice if Zend had something like HTMLPurifier, where it actually checks and validates the input before stripping tags. This means that stuff like '<3' gets left alone, where as stuff like 'Awesome Site' becomes 'Awesome Site'.
This is a problem I'm trying to work around, and at the moment it seems like I'm going to end up writing my own Zend_Filter class that's basically a wrapper for HTMLPurifier.