confused about xhtml5: no more `<?xml?>` and now mandatory `meta`? - encoding

I've been a longtime user of XHTML 1.0 Strict, and I'm now trying to switch to XHTML5 in my new projects.
I'm confused that <?xml version='1.0' encoding='utf-8'?> is no longer considered valid, for HTML5, by http://validator.w3.org/. Why is that? Isn't that what all xml documents are supposed to start with?
And when I remove the standard <?xml…, my document still doesn't validate: now it's missing the encoding. I don't like those meta tags, but are they now effectively mandatory, to specify the encoding, in order to be valid (X)HTML5?

An XML declaration is valid and validates in XHTML serialization of HTML5. The following rather minimal document validates:
<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title></title></head>
<body></body>
</html>
However, this only applies to XHTML serialization (XHTML syntax) of HTML5. In HTML serialization, it is not allowed. If you write the above document in a file and store it in a server that will send it with Content-Type: text/html (which normally happens if the filename ends with “.html”), then you get an error message:
Saw <?. Probable cause: Attempt to use an XML processing instruction in HTML.
(XML processing instructions are not supported in HTML.)
Here “HTML” means HTML serialization only.
Browsers do not care about an XML declaration in either syntax. In HTML syntax, it is just ignored, as a recoverable syntax error. In XHTML syntax, it does not matter, except for the encoding part.
Although XML 1.0 specification recommends (but does not require) an XML declaration, it would in practice matter (apart from the significance of encoding) only to software that is capable of processing different versions of XML. Browsers aren’t. And in addition to XML 1.0, there’s just XML 1.1, which is not used much. Besides, HTML5 is defined so that the XML version used in XHTML syntax is XML 1.0.
The encoding part may matter, but utf-8 is the default anyway for XML. If you use another encoding for some reason, then an XML declaration may be useful to prevent any conflicts. HTML5 CR says this in it discussion of encodings: “In XHTML, the XML declaration should be used for inline character encoding information, if necessary.” A meta tag cannot really help in XHTML when served with an XML content type, since the encoding has already been decided (by defaulting to UTF-8 or otherwise) when the tag is seen.
For HTML syntax, the <meta charset=...> tag may be used, but it is not needed for validity, and the encoding can be specified in HTTP headers (which override any meta tags). Using a meta tag may however be helpful, since a page might be saved locally, and then there won’t be any HTTP headers available when it is opened.

Related

HTML4.01 Strict ARIA Doctype for Validation via W3C Validator

Is there any <!doctype> for "HTML 4.01 Strict Markup + ARIA"?
Also am currently having HTML 4.01 Strict <!DOCTYPE>, I obviously get an error while validating my page with the W3C Validator.
Is there any solution to this problem?
Also when I use the Accessibilty Toolbar, I get the following error pointing to the line after the closing HTML tag:
Line 256, Column 1: character data is not allowed here
Content-Disposition: form-data; name="charset"
You have used character data somewhere it is not permitted to appear.
Mistakes that can cause this error include:
putting text directly in the body of the document without wrapping it in a container element (such as a <p>aragraph</p>), or
forgetting to quote an attribute value (where characters such as "%" and "/" are common, but cannot appear without surrounding quotes), or
using XHTML-style self-closing tags (such as <meta ... />) in HTML 4.01 or earlier. To fix, remove the extra slash ('/') character. For more information about the reasons for this, see Empty elements in SGML, HTML, XML, and XHTML.
There is no published DTD for HTML 4.01 + ARIA. It would be possible to write one, if only there were a stable, exact document that specifies the allowed ARIA attributes and the HTML 4.01 elements on which they may be used. Using WAI-ARIA in HTML is still a WD only, and calls itself “a practical guide”. And I’m not sure how it should be interpreted. I guess the simplest, and possibly the only realistic, approach would be to write a DTD that allows all ARIA attributes on all elements, with the same set of values for all elements, even thoughh this would violate many of the recommendations.
The other question seems to be unrelated, and should probably be asked as a separate question. And you should probably explain what accessibility toolbar you are referring to and what your HTML document contains.

How do you use in GWT UiBinder XML? Can you escape it?

In my mark-up I want to add a space ( ) between elements without always having to use CSS to do so. If I put in my markup, GWT throws errors. Is there a way around it?
For example:
<g:Label>One </g:Label><g:Label>Two</g:Label>
Should show:
One Two
And not:
OneTwo
As documented here, you just have to add this to the top of your XML file and it will work!
<!DOCTYPE ui:UiBinder SYSTEM "http://dl.google.com/gwt/DTD/xhtml.ent">
Note that the GWT compiler won't actually visit this URL to fetch the file, because a copy of it is baked into the compiler. However, your IDE may fetch it.
Rather than use a Label, which to me shouldn't allow character entities at all, I use an HTML widget. In order to set the content, though, I find I have to do it as the HTML attribute, not the body content (note that the uppercase HTML is important here, since the set method is setHTML, not setHtml)
<g:HTML HTML="One&nbsp;" />

Safari encodes already encoded URL on request

I do an HTTP GET request for a page using the following URL in Safari:
mysite.com/page.aspx?param=v%e5r
The page contains a form which posts back to itself.
The HTML form tag looks like this when output by asp.net:
<form method="post" action="page.aspx?param=v%u00e5r" id="aspnetForm" >
When Safari POSTs this back it somehow converts this URL to:
page.aspx?param=v%25u00e5r, i.e. it URL encodes the already URL encoded string, which is then double encoded and the output generated by this parameter is garbled (vår). I am able to get around this some places by URL decoding the parameter before printing it.
Firefox and even IE8 handles this fine. Is this a bug in WebKit or am I doing something wrong?
To summarise:
GET mysite.com/page.aspx?param=v%e5r
HTML: <form method="post" action="page.aspx?param=v%u00e5r" id="aspnetForm" >
POST mysite.com/page.aspx?param=v%25u00e5r
HTML: <form method="post" action="page.aspx?param=v%25u00e5r" id="aspnetForm" >
mysite.com/page.aspx?param=v%e5r
Whilst you can use encodings other than UTF-8 in the query part of a URL, it's inadvisable and will generally confuse a variety of scripts that assume UTF-8.
You really want to be producing forms in pages marked as being UTF-8, then accepting UTF-8 in your application and encoding the string vår (assuming that's what you mean) as param=v%C3%A5r.
page.aspx?param=v%u00e5r
Oh dear! That's very much wrong. %uXXXX is a JavaScript-escape()-style sequence only; it is wholly invalid to put in a URL. Safari is presumably trying to fix up the mistake by encoding the % that isn't followed by a two-digit hex sequence with a %25.
Is ASP.NET generating this? If so, that's highly disappointing. How are you creating the <form> tag? If you're encoding the parameter manually, maybe you need to specify an Encoding argument to HttpUtility.UrlEncode? ie. an Encoding.UTF8, or, if you really must have v%e5r, new Encoding(1252) (Windows code page 1252, Western European).

Facelets charset problem

In my earlier post there was a problem with JSF charset handling, but also the other part of the problem was MySQL connection parameters for inserting data into db. The problem was solved.
But, I migrated the same application from JSP to facelets and the same problem happened again. Characters from input fields are replaced when inserting to database (č is replaced with Ä), but data inserted into db from sql scripts with proper charset are displayed correctly. I'm still using registered filter and page templates are used with head meta tag as following:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-2">
If I insert into h:form tag the following attribute:
acceptcharset="iso-8859-2"
I get correct characters in Firefox, but not in IE7.
Is there anything else I should do to make it work?
Thanks in advance.
Add the following line to the filter:
response.setContentType("text/html;charset=ISO-8859-2");
Don't use acceptcharset attribute. IE has serious bugs with it.
Also, when you're using a <?xml?> declaration in top of Facelets XHTML page, ensure that it's using the desired charset or just remove the whole declaration, it's not strictly required.
<?xml version="1.0" encoding="ISO-8859-2"?>
i think you can see the implementation of org.springframework.web.filter.CharacterEncodingFilter
and you can start your tomcat by adding -Dfile.encoding=ISO-8859-2 as jvm arguments

Remove HTML entities

I am developing an application for iPhone.I need to remove html entities like ["<p>"] in a parsed xml response.Is there any direct way to remove all such entities.??
How is the data formatted? Are the HTML entities esacaped in the original XML, something like this:
<xml><content type="html"><p>A paragraph.</p></content></xml>
In this case you could just strip the tags with a regular expression.
Otherwise, I would suggest following the DTD of the XML file, and stripping all other tags under the assumption that they don't constitute part of the XML markup.