how to parse malformed HTML using HTML::Parser - perl

I am trying to parse an HTML with meta-tag as :
<meta name="id" content=""12345.this.is.a.sample:id:required.67890"#abc.com">
The html::parser returns this "" empty value instead of the actual value required. This is my code depicting the start event handler:
sub start {
my ($self, $tagname, $attr, $attrseq, $origtext) = #_;
if ($tagname eq 'meta') {
print "Meta found: ", $attr->{ name }, $attr->{content}, "\n";
}
}
Any ideas on how to get the required value?

I think a quote from Charles Babbage is appropriate here:
On two occasions I have been asked, — “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?” […] I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
– Passages from the Life of a Philosopher (1864)
This is known as the Garbage-In, Garbage-Out (GIGO) principle. In your case, you have malformed HTML. If you feed that to an HTML parser, you'll necessarily get bogus output. The HTML standard is already quite lax to deal with all kinds of common errors, but your example is much more broken.
There is, of course, one solution: Don't treat your input as HTML, but as some derived format where your example happens to be legal input. You'd have to write a custom parser of your own or adapt an existing HTML parser to your needs, but that would work.
However, I think that fixing the source of the input would be easier than writing your own parser. All that is needed is the quotes inside the attribute to be escaped, or for the attribute to use single quotes:
<meta name="id" content=""12345.this.is.a.sample:id:required.67890"#abc.com">
<meta name="id" content='"12345.this.is.a.sample:id:required.67890"#abc.com'>

Okay, found out a way to get the content for this particular problem. The $origtext variable above gets the value of the argspec identifier text which is defined in the documentation as :
Text causes the source text (including markup element delimiters) to be passed.
So, basically,
print $origtext;
would give me the source text as output:
<meta name="id" content=""12345.this.is.a.sample:id:required.67890"#abc.com">
I can use a regex to exploit this value contained in $origtext and get the desired stuff.

Related

How do you use in GWT UiBinder XML? Can you escape it?

In my mark-up I want to add a space ( ) between elements without always having to use CSS to do so. If I put in my markup, GWT throws errors. Is there a way around it?
For example:
<g:Label>One </g:Label><g:Label>Two</g:Label>
Should show:
One Two
And not:
OneTwo
As documented here, you just have to add this to the top of your XML file and it will work!
<!DOCTYPE ui:UiBinder SYSTEM "http://dl.google.com/gwt/DTD/xhtml.ent">
Note that the GWT compiler won't actually visit this URL to fetch the file, because a copy of it is baked into the compiler. However, your IDE may fetch it.
Rather than use a Label, which to me shouldn't allow character entities at all, I use an HTML widget. In order to set the content, though, I find I have to do it as the HTML attribute, not the body content (note that the uppercase HTML is important here, since the set method is setHTML, not setHtml)
<g:HTML HTML="One&nbsp;" />

Safe to use Regex for this? (HTML)

I'm parsing some HTML, and I need to get all html in the body tag. My target string will always look something like this:
<body><div><img src="" />text etc</div></body>
However, I just need:
<div><img src="" />text etc</div>
My target string will always begin and end with those body tags. However, there is the repeated warning of not use Regex to parse HTML, but I do not have any viable solutions for that available, besides Regex at the moment.
Question: Are there any safe Regex(s) to use in this case? Or should I just forget it?
You didn't show us what your regex is, but it's not as safe as using DOM parsing if it's as simple as:
<body>(.*?)</body>
...because it's possible that </body> is contained in an attribute string or comment. If you're willing to take that risk, then you'll be fine. There's no reason you shouldn't be able to use DOM parsing and just get the text of the body, though, except it would probably be less efficient.
You could also skip the regex and just find the string indices of <body> and </body> and get the substring between them. That should be even faster.
By the way, this is not parsing the HTML; you're just extracting from the HTML
It's fine to use a RegEx in this case.
Having said that there are much easier ways to get the innerHTML of the body tag.
alert(document.body.innerHTML);
should give you exactly that with no RegEx...
or if you're using jQuery
$(body).html();

How to use unescape() function inside JavaScript?

I have a JSP page in which I have JavaScript function that will be called when a link is clicked. Now, when the value reaches the JavaScript function, the apostrophe is encoded.
Example:
Name#039;s
Before # there is &, which originally should be:
Name's
I have used the unescape() decode function, but nothing seems to work. In the end, I had to delete the characters and add the apostrophe. Does anyone know a fix for this? Is it that JSP doesn't support encoding for &? When I was writing the same encode value in this page, it changed the symbol to the apostrophe, which is what I wanted in my code.
Built-in Javascript function such as unescape(), decodeURIComponent() has nothing to do with the string you are working on, because the one you are looking to decode are HTML entites.
There are no HTML entites decoder available in Javascript, but since you are working with a browser, if the string is considered safe, you may do the following (in JQuery, for example)
var str = $('<p />').html(str).text();
It bascially insert the string as HTML to a <p> element and then extract the text within.
Edit: I just realize the JSP output you posted is not real HTML entities; To process the example given you should use the following, add & before every #1234; and make it Ӓ:
var str = $('<p />').html(str.replace(/\#(\d+)\;/g '&#$1;')).text();

encode url not encoding

I am working in a template in Moveable Type and would like to do the following:
Twitter
It all works but I'm worried that the current link or at some point even if I use a title mt tag that it might not be right for the browser address bar. I thought you could use encode_url="1" but it doesn't appear to encode my titles or links. For example: I have a title with spaces in it and the resulting code still has the spaces in it. Also for the example above shouldn't the http:// be encoded in a special way? Because it doesn't do it.
Am I doing something wrong here?
I just checked this code and it is outputting properly for me. I am using MT 4.34. I used the following template code in an index templated:
<mt:Var name="url" value="http://google.com/hello I have spaces">
<mt:Entries lastn="1">
Permalink: <mt:EntryPermalink encode_url="1"><br />
Fake URL: <mt:Var name="url" encode_url="1">
</mt:Entries>
And I got the following output:
Permalink: http%3A%2F%2Fwww.capndesign.com%2Farchives%2F2010%2F09%2Fthe_big_picture_scenes_from_china.php
Fake URL: http%3A%2F%2Fgoogle.com%2Fhello%20I%20have%20spaces
So I would confirm that you're using a current version of MT (4.34 or 5.x) that supports this modifier, because the spaces and special characters should definitely be getting replaced with HTML entities. I'd also try the code I provided above to see if you get the same output (except your permalink will obviously be different).

Ignore CDATA while xml parsing

I am new to iphone development.I want to ignore CDATA tag while parsing because it consider the HTML tag following it as text.Since i want to display the content alone ,i want my parser to ignore CDATA tag.My source code is
[CDATA[<br /><p class="author"><span class="by">By: </span>By Sydney Ember</p><br><p>In the week since an </p>]].
Is there any way to ignore CDATA tag?
Is there any way to parse my source twice so it displays only the content?
Please give me some sample code.Please help me out.Thanks.
If you treat the CDATA content as XML instead of CDATA then your parser will throw an error (since your HTML is a weird mix of XHTML and HTML and is not well formed).
If you want to get the HTML, then parse the XML, extract the text content of the node, then parse that text as HTML.
There is no way to ignore the CDATA tag - it's part of the xml spec and parsers should honour it.
If you don't like the idea of this answer to your earlier question, you could get the contents of the CDATA section and parse it as XML again. However, this is highly not recommended! You don't know that the contents of the CDATA are going to be valid xml (they're probably not).
If you can 100% guarentee that the CDATA section contains the form you have above, you could probably use some string manipulation to get the data out (i.e. string replace '<span class="by">By: </span>' with '') but again, this will almost certainly break if the CDATA contents change.
Where is the xml coming from? It's a better idea to talk to owner of the service and get them to send you instead of description something like
<description>
<author>By Sydney Ember</autho>
<text>In the week since an </text>
</description>
S