We are in the process of converting our Windows-1252 based webshop to Unicode. Unfortunately we currently have to use a middleware between the shop and the ERP which cannot handle UTF-8 (it will corrupt the characters).
We could use UTF-7 for passing the content through the middleware but I'd like to avoid having to convert all data before it enters and exits the middleware.
This is why I thought of using UTF-7 alltogehter. Is there a technical reason not to use UTF-7 on your website?
HTML5 forbids the support of UTF-7 by browsers :
Furthermore, authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU
encodings, which also fall into this category; these encodings were
never intended for use for Web content.
...
User agents must support the encodings defined in the WHATWG Encoding
standard. User agents should not support other encodings.
User agents must not support the CESU-8, UTF-7, BOCU-1 and SCSU
encodings. [CESU8] [UTF7] [BOCU1] [SCSU]
An extract from the list of character encodings supported by Firefox :
UTF-7 Obsolete since Gecko 5.0 Unicode Support removed for HTML5 compatibility.
Don't use UTF-7.
BTW having a middleware which supports UTF-7 but not UTF-8 looks strange. Maybe this middleware can handle the files as binary ? In any case your middleware might be a little too old to be in use now.
Current versions of Chrome, Firefox, and IE do not support UTF-7 at all (they render an UTF-7 encoded HTML document by displaying its source code as such, since they do not recognize any tags). This is a sufficient reason for not even considering the use of UTF-7 on the web.
Related
The Question: "Is supporting only the Unicode BMP sufficient to enable native Chinese / Japanese / Korean speakers to use an application in their native language?"
I'm most concerned with Japanese speakers right now, but I'm also interested in the answer for Chinese people as well. If an application only supported characters on the BMP - would it make the application unusable for Chinese/Japanese speakers (i.e. app did not allow data entry / display of supplemental characters)?
I'm not asking if the BMP is the only thing you would ever need for any kind of application (clearly not - especially for all language in the entire world). I'm asking for CJK speakers, in a professional context, for a modern kind of ordinary app that deals with general free text entry (including, names, places, etc.) - is the BMP generally enough?
Even if only supporting the BMP is not correct - would it be pretty close / "good enough"? Would the lack of supplemental characters in an application only be an occasional minor inconvenience; or would a Japanese speaker, for example, consider the application completely broken? Especially considering that they would always be able to work around the problem by spelling out problematic words with Hiragana/Katakana?
What about Chinese speakers who don't have a fallback option, would the lack of supplemental characters be considered a show-stopping problem?
I'm considering general professional context here - not social or gaming stuff. As an example, there's a lot of the emoticons on the supplemental planes - but I personally would not consider an English app that did not support Unicode emoticon characters to be "broken", at least for most professional use.
The application I'm dealing with right now is written in Java, but I think this question applies more generally. Knowing the answer will also help me (regardless of language) get a better handle on how much effort I'd have to go through with regard to font support.
EDIT
Clarification: by "supports only the BMP" - I intend that the application would handle supplemental characters gracefully.
Unsupported characters (including the BMP surrogate code blocks) would be dealt with similarly to how most applications deal with ASCII control codes and other undesirable characters - filtered/disallowed for data entry and "dealt with" for display if that were necessary (filtered out or replaced with the unicode replacement character).
For people who might be looking for an actual answer to the actual question: the application that prompted this question is now in production allowing only characters from the BMP (actually a limited subset).
Multiple international customers using Korean language in production - Japanese going live soon. Chinese is in planning (I have my doubts that the BMP will be sufficient for that, but we'll see I guess).
It's fine - no reported issues related to unsupported characters.
But that's just anecdotal evidence, really. Just because my customers were fine with it - that doesn't mean yours will be. For context, customers of the app are international companies, hundreds of employees using the application to process hundreds of thousands of their customers.
Unfortunately CJK support in Unicode is broken. The BMP is not enough to properly support CJK, but worse than that even if you do implement full support for all Unicode pages it is still broken.
The basic problem is that they tried to merge characters from all three languages that look kinda similar but are not really the same. The result is that they only look right if you select the correct font to display them. For example, a particular character will only look right to a Chinese person if you render it with a Chinese font, and only look right to a Japanese person if you render it with a Japanese font.
There is no universal font. There is no way to determine which language a character is supposed to be from, so you have to somehow guess which font to use. You can try to examine the system language or some other hack like that. You can't support two languages in the same document unless you have additional metadata. If you get raw Unicode strings without any indication of what language they are in, you are screwed.
It's a total disaster. You need to talk to your clients to figure out their needs and how they indicate to their systems what font to use for broken Unicode characters.
Edit: Also need to mention, some characters required for people's names are missing from Unicode. Later revisions are better, but of course you also need updated fonts to take advantage of them.
The majority of CJK codepoints are defined in the BMP, however CJK Ideographs are not. So if you do not need to support Ideographs, then the BMP is fine, otherwise it is not.
However, I would consider any implementation that does not recognize and process UTF-16 surrogates, even if it does not handle the Unicode codepoints they represent, to be broken.
Unless you are a fond developer or developing an operating systems you should not care about that, let the OS layer deal with it.
Just implement proper Unicode support in your application and allow the operating system to deal with how the characters are types and displayed.
If you are using custom fonts in your application you may be in trouble
In the end to answer your question: NO, Unicode support is not only BMP and you need to support Unicode.
I've been tasked to look into having an email function that we have in place which sends emails using uuencoding to something else more widely accepted. I guess there have been issues which recipients are not receiving attachments (.csv) files because it is uuencoded.
I'm guessing that we'd want to switch this to MIME encoding?
I wanted to get some suggestions, and perhaps some good starting places to look for something like this.
Yes, you'll want to switch to MIME. You should know, though, that MIME is not an encoding in the same way that UUEncode is an encoding. MIME is essentially an extension to the rfc822 message format.
You don't specify which language you plan to use, but I'd recommend looking at 1 of the 2 MIME libraries I've written as they are among the (if not the) fastest and most rfc-compliant libraries out there.
If you plan to use C or C++, take a look at GMime.
If you plan to use C#, take a look at MimeKit.
The only other decent MIME libraries I can recommend are libetpan (a very low-level C API) and vmime (an all-in-one C++ library which does MIME, IMAP, SMTP, POP3, etc).
The only "advantage" that libetpan has over GMime is that it implements its own data structures library that it uses internally instead of doing what I did with GMime, which is to re-use a widely available library called GLib. GLib is available on every platform, though, so it seemed pointless for me to reinvent the wheel - plus GLib offers a ref-counted object system which I made heavy use of. For some reason I can't figure out, people get hung up on depending on GLib, complaining "omg, a dependency!" as if they weren't already adding a dependency on a MIME library...
Oh... I guess if you are using Java, you should probably look at using JavaMail.
Beyond that, there are no other acceptable MIME libraries that I have ever seen. Seriously, 99% of them suffer from the same design and implementation flaws that I ranted about on a recent blog post. While the blog post is specifically about C# MIME parsers, the same holds true for all of the JavaScript, C, C++, Go, Python, Eiffel, etc implementations I've seen (and I've seen a lot of them).
For example, I was asked to look at a popular JavaScript MIME parser recently. The very first thing it did was to use strsplit() on the entire MIME message input string to split it by "\r\n". It then iterated through each of the lines strsplit()'ing again by ':', then it strsplit() address headers by ',', and so on... it literally screamed amateur hour. It was so bad that I could have cried (but I didn't, because I'm manly like that).
I am currently working on an application that supports multiple languages: English, Spanish, Russian, Polish, etc.
I have set up my SQL server database to have Unicode field types (nvarchar etc).
I am concerned now with setting the correct encoding on the HTML, text, XML files etc. I am aware that it needs to be UTF, but not sure if it's UTF-8, UTF-16 or UTF-32. Could someone explain the difference and which encoding is the best to go with?
If this is about something that is supposed to use web browsers, as it seems, then UTF-8 is the only reasonable choice, since it’s the only encoding that is widely supported in browsers. Regarding the ways to set the encoding, check out the W3C page Character encodings.
I have decided to develop a (Unicode) spell checker for my final year project for a south Asian language. I want to develop it as a plugin or a web service. But I need to decide a suitable development platform for it. (This will not just check for a dictionary file, morphological analysis / generation modules (a stemmer) will also be used).
Would java script be able to handle such processing with a fair response time?
Will I be able to process a large dictionary on client side?
Is there any better suggestions that you can make?
Javascript is not up to the task, at least not by itself; its Unicode support is too primitive, and in many parts, actually missing. For example, Javascript has no support for Unicode grapheme clusters.
If you use Java, then make sure you use the ICU libraries so that you can get all the whizbang Unicode properties you’ll need for text segmentation. The place where Java’s native Unicode processing breaks down is in its regex library, which is why Android JNIs over to the ICU C/C++ regex library. There are a lot of NLP tools written for Java, some of which you might find handy. Most of these that I am aware of though are for English or at least Western languages.
If you are willing to run part of your computation server-side via CGI instead of just client-side action, you are no longer bound by language choice. For example, you might combine Javascript on the client with Perl on the server, whose Unicode support is even better than Java’s. How that would meld together and how to get the performance and behavior you would want depends on just what you actually want to do.
Perl also has quite a good number of industry-standard NLP modules widely available for it, most of which already know to use Unicode, since like Java, Perl uses Unicode internally.
A brief slide presentation on using NLP tools in Perl for certain sorts of morphological analysis, namely stemming and lammatization, is available here. The presentation is known to work under Safari, Firefox, or Chrome, but not so well under Opera or Microsoft’s Internet Explorer.
I am not aware of any tools specifically targeting Asian languages, although Perl does support UAX#11 (East Asian Width) and UAX#14 (Unicode Linebreaking) via the Unicode::LineBreak module from CPAN, and Perl does come with a fully-compliant collation module (implementing UTS#10, the Unicocde Collation Algorithm) by way of the standard Unicode::Collate module, with locale support available from the also-standard Unicode::Collate::Locale module, where many Asian locales are supported. If you are using CJK languages, you may want access to the Unihan database, available via the Unicode::Unihan module from CPAN. Even more fundamentally, Perl has native support for Unicode extended grapheme clusters by way of its \X metacharacter in its builtin regex engine, which neither Java nor Javascript provides.
All this is the sort of thing you are likely to need, and find terribly lacking, in Javascript.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I am building a mobile app (hybrid mobile web app but with a native shell) with most users on the iphone (some on the blackberry) and am wondering if it should be written in html5 or xhtml?
Any insight would be great.
tl;dr: Use HTML5, because text/html XHTML is parsed as HTML5, and proper XHTML can fail spectacularly.
Current browsers don't actually support HTML4 or XHTML/1.x any more. They treat all documents as HTML5/XHTML5 (e.g. <video> will work even if you set HTML4 or XHTML/1.x DOCTYPE).
Your choice isn't really between HTML5 and XHTML, but between text/html and XML parsing mode and quirks and standards rendering modes. Browser engines are not aligned with versions of W3C specs.
The real choices are:
quirks vs standards mode. "Quirks" is emulation of IE5 bugs and box model. Quirks bites if you fail to put DOCTYPE or use one of obsolete DOCTYPEs (like HTML4 Transitional).
The obvious choice is to enable standards mode by putting (any) modern DOCTYPE in every document.
text/html vs application/xhtml+xml. The XML mode enables XHTML features that weren't in HTML (such as namespaces and self-closing syntax on all elements) and most importantly enables draconian error handling.
NB: it's not possible to enable XML mode from within a document. The only way to enable it is via real Content-Type HTTP header (for this purpose <meta> and DOCTYPE are ignored!)
The XML mode was supposed to be the best for mobiles in times of WAP and XHTML Basic, but in practice it turned out to be a fantasy!
If you use application/xhtml+xml mode your page will be completely inaccessible to many users of GSM connections!
There's some proxy software used by major mobile operators, at least in UK and Poland where I've tested it, which injects invalid HTML to everything that looks HTML-like, including properly served XHTML documents.
This means that your well-formed perfect XHTML will be destroyed in transit and user will see only XML parse error on their side. User won't be able to notify you about the problem, and since markup is malformed outside your server, it isn't something you could fix.
That's how all XML-mode (correctly served XHTML) pages look like on O2 UK:
(the page renders fine when loaded via Wi-Fi or VPN that prevents mobile operator from screwing up the markup)
HTML5 and XHTML are not exclusive choices. You can use both at once (XHTML 5) or you can use neither (HTML 4).
I wouldn't author documents to [X]HTML5 yet as the standard is not yet finished, never mind any implementations. The “HTML5” features we have available in some browsers are generally scripting extensions that don't affect HTML at a markup level at all.
My understanding is that neither the iPhone nor the Blackberry fully support HTML 5 yet. So unless you need some specific HTML 5 features I would stick with XHTML.
Pick any of them. XHTML is just an XML-language serialisation of HTML, so in reality, it's just DOM nodes encoded in a different way. (Maybe I could create a JSON-serialised version of HTML?) Really, the choice of SGML or XML serialisation depends on whether or not the device supports it. Apple uses WebKit, which fully supports XHTML.
Remember to send your XHTML as application/xhtml+xml or it won't be treated as XHTML!
Oh... and one other thing. All browsers that I know of support XHTML except IE.