Multibyte encoding in java - encoding

i have no idea of how to add multibyte encoding support and very little knowledge on multibyte languages.
Being working on a search engine, my application scans code in all programming languages.
Some sourcecode might have CJK encoding in their comments section.
For easiness sake, i take java as source-code sample and my application is also in java.
First thing, i want to write test cases to see if to-be-indexed source-code has CJK encoding and if it is encoded by my application.
I want my tests to fail if support not included so that can be added in future.
But i have no idea how to test it ,
how to entre CJK in input samples for unit test and what would be output in Java application console.

The presence of a Byte Order Mark might be of use, but they are optional. There are other methods for determining the encoding when UTF is used. This may be of use: Java : How to determine the correct charset encoding of a stream.

Related

Is it possible to represent characters beyond ASCII in DataMatrix 2D barcode? (unicode?)

The DataMatrix article on Wikipedia mentions that it supports only ASCII by default. It also mentions a special mode for Base256 encoding, which should be able to represent arbitrary byte values.
However all the barcode generator libraries that I tried so far support data to be entered as string and show errors for characters beyond ASCII (Onbarcode and Barcodelib). There is also no way how to enter byte[] which would be required for Base256 mode.
Is there a barcode generator library that supports Base256 mode? (preferably commercial library with support)
Converting the unicode string into Base64 and decoding from base64 after the data is scanned would be one approach, but is there anything else?
it is possible, although, it has some pitfalls:
1) it depends on which language you're writing your app (there are different bindings fo different DM-libraries across programming languages.
For example, there is pretty common library in *nix-related environment (almost all barcode scanners/generators on Maemo/MeeGo/Tizen, some WinPhone apps, KDE thingies, and so on, using it) called [libdmtx][1]. As far, as I tested, encodes and decodes messages contatining unicode pretty fine, but it doesn't properly mark encoded message ("Hey, other readers, it is unicode here!"), so, other libraries, such as [ZXing][2], as many proprietary scanners, decodes that unicode messages as ASCII.
As far, as I dicussed with [ZXing][2] author, proper mark would probably be an ECI segment (0d241 byte as first codeword, followed by "0d26" byte (for UTF-8)). Although, that is theoretical solution, based on such one for QR-codes and not standardized in any way for DataMatrix (and neither [libdmtx][1] nor [ZXing][2], do not yet support encoding with such markings, althought, there is some steps in that way.
So, TL;DR: If you plan to use that generated codes (with unicode messages) only between apps, that you're writing — you can freely use [libdmtx][1] for both encoding and decoding on both sides and it will work fine :) If not — try to look for [zxing][2] ports on your language (and make sure that port supports encoding).
1: github.com/dmtx/libdmtx
2: github.com/zxing/zxing

Antlr generated lexer hangs on unicode character of "supplementary plane" (antlr 3.4)

I'm parsing PHP code using an antlr Grammar and the antlr Ruby Target. One of the source file I have to parse actually contains translation, some of them making heavy use of Unicode character. The grammar seems to hang on one character from the "supplementary plane", namely U+10430.
I had a similar problem in the past due to the fact that the Ruby antlr target is quite old, and was not unicode compliant (well, Ruby was not, at the time). We had to bump RubyTarget.java getMaxCharValue from 0xFF (ascii) to 0xFFFF (unicode) to solve it. Now it seems that even this set is insufficient. Unicode states that characters outside this range may be represented using two UTF-16 characters, but how do antlr manage this ? Would bumping the getMaxCharValue again help (it did once, but I'm no fan of the "try" approach) ?
Thanks !
The reference Java target for ANTLR can only parse characters in the supplementary plane by using a UTF-16 surrogate pair in the grammar and using a UTF-16 encoding for your input stream. Other targets are created by members of the community and may or (as you saw the Ruby target) may not support the same range of characters.
Since there is no way to represent anything past 0xFFFE in the grammar itself, you'll be limited to the UTF-16 encoding even if you modify a target to support characters above 0xFF.

Understanding the terms - Character Encodings, Fonts, Glyphs

I am trying to understand this stuff so that I can effectively work on internationalizing a project at work. I have just started and very much like to know from your expertise whether I've understood these concepts correct. So far here is the dumbed down version(for my understanding) of what I've gathered from web:
Character Encodings -> Set of rules that tell the OS how to store characters. Eg., ISO8859-1,MSWIN1252,UTF-8,UCS-2,UTF-16. These rules are also called Code Pages/Character Sets which maps individual characters to numbers. Apparently unicode handles this a bit differently than others. ie., instead of a direct mapping from a number(code point) to a glyph, it maps the code point to an abstract "character" which might be represented by different glyphs.[ http://www.joelonsoftware.com/articles/Unicode.html ]
Fonts -> These are implementation of character encodings. They are files of different formats (True Type,Open Type,Post Script) that contain mapping for each character in an encoding to number.
Glyphs -> These are visual representation of characters stored in the font files.
And based on the above understanding I have the below questions,
1)For the OS to understand an encoding, should it be installed separately?. Or installing a font that supports an encoding would suffice?. Is it okay to use the analogy of a protocol say TCP used in a network to an encoding as it is just a set of rules. (which ofcourse begs the question, how does the OS understands these network protocols when I do not install them :-p)
2)Will a font always have the complete implementation of a code page or just part of it?. Is there a tool that I can use to see each character in a font(.TTF file?)[Windows font viewer shows how a style of the font looks like but doesn't give information regarding the list of characters in the font file]
3)Does a font file support multiple encodings?. Is there a way to know which encoding(s) a font supports?
I apologize for asking too many questions, but I had these in my mind for some time and I couldn't find any site that is simple enough for my understanding. Any help/links for understanding this stuff would be most welcome. Thanks in advance.
If you want to learn more, of course I can point you to some resources:
Unicode, writing systems, etc.
The best source of information would probably be this book by Jukka:
Unicode Explained
If you were to follow the link, you'd also find these books:
CJKV Information Processing - deals with Chinese, Japanese, Korean and Vietnamese in detail but to me it seems quite hard to read.
Fonts & Encodings - personally I haven't read this book, so I can't tell you if it is good or not. Seems to be on topic.
Internationalization
If you want to learn about i18n, I can mention countless resources. But let's start with book that will save you great deal of time (you won't become i18n expert overnight, you know):
Developing International Software - it might be 8 years old but this is still worth every cent you're going to spend on it. Maybe the programming examples regard to Windows (C++ and .Net) but the i18n and L10n knowledge is really there. A colleague of mine said once that it saved him about 2 years of learning. As far as I can tell, he wasn't overstating.
You might be interested in some blogs or web sites on the topic:
Sorting it all out - Michael Kaplan's blog, often on i18n support on Windows platform
Global by design - John Yunker is actively posting bits of i18n knowledge to this site
Internationalization (I18n), Localization (L10n), Standards, and Amusements - also known as i18nguy, the web site where you can find more links, tutorials and stuff.
Java Internationalization
I am afraid that I am not aware of many up to date resources on that topic (that is publicly available ones). The only current resource I know is Java Internationalization trail. Unfortunately, it is fairly incomplete.
JavaScript Internationalization
If you are developing web applications, you probably need also something related to i18n in js. Unfortunately, the support is rather poor but there are few libraries which help dealing with the problem. The most notable examples would be Dojo Toolkit and Globalize.
The prior is a bit heavy, although supports many aspects of i18n, the latter is lightweight but unfortunately many stuff is missing. If you choose to use Globalize, you might be interested in the latest Jukka's book:
Going Global with JavaScript & Globalize.js - I read this and as far I can tell, it is great. It doesn't cover the topics you were originally asking for but it is still worth reading, even for hands-on examples of how to use Globalize.
Apparently unicode handles this a bit differently than others. ie.,
instead of a direct mapping from a number(code point) to a glyph, it
maps the code point to an abstract "character" which might be
represented by different glyphs.
In the Unicode Character Encoding Model, there are 4 levels:
Abstract Character Repertoire (ACR) — The set of characters to be encoded.
Coded Character Set (CCS) — A one-to-one mapping from characters to integer code points.
Character Encoding Form (CEF) — A mapping from code points to a sequence of fixed-width code units.
Character Encoding Scheme (CES) — A mapping from code units to a serialized sequence of bytes.
For example, the character 𝄞 is represented by the code point U+1D11E in the Unicode CCS, the two code units D834 DD1E in the UTF-16 CEF, and the four bytes 34 D8 1E DD in the UTF-16LE CES.
In most older encodings like US-ASCII, the CEF and CES are trivial: Each character is directly represented by a single byte representing its ASCII code.
1) For the OS to understand an encoding, should it be installed
separately?.
The OS doesn't have to understand an encoding. You're perfectly free to use a third-party encoding library like ICU or GNU libiconv to convert between your encoding and the OS's native encoding, at the application level.
2)Will a font always have the complete implementation of a code page or just part of it?.
In the days of 7-bit (128-character) and 8-bit (256-character) encodings, it was common for fonts to include glyphs for the entire code page. It is not common today for fonts to include all 100,000+ assigned characters in Unicode.
I'll provide you with short answers to your questions.
It's generally not the OS that supports an encoding but the applications. Encodings are used to convert a stream of bytes to lists of characters. For example, in C# reading a UTF-8 string will automatically make it UTF-16 if you tell it to treat it as a string.
No matter what encoding you use, C# will simply use UTF-16 internally and when you want to, for example, print a string from a foreign encoding, it will convert it to UTF-16 first, then look up the corresponding characters in the character tables (fonts) and shows the glyphs.
I don't recall ever seeing a complete font. I don't have much experience with working with fonts either, so I cannot give you an answer for this one.
The answer to this one is in #1, but a short summary: fonts are usually encoding-independent, meaning that as long as the system can convert the input encoding to the font encoding you'll be fine.
Bonus answer: On "how does the OS understand network protocols it doesn't know?": again it's not the OS that handles them but the application. As long as the OS knows where to redirect the traffic (which application) it really doesn't need to care about the protocol. Low-level protocols usually do have to be installed, to allow the OS to know where to send the data.
This answer is based on my understanding of encodings, which may be wrong. Do correct me if that's the case!

how important is it that libraries treat utf16xx and utf32xx as equal peers to utf8?

does any signifigant interchange take place in formats other than ascii/utf8? are there any fields where utf16xx and utf32xx are used heavily? i ask as a writer of multiple libraries that work on unicode text, and the burden of supporting all five major variants is quite high compared to the perceived utility.
Windows and Java both treat Unicode as UTF-16 internally, and Python uses UTF-16 or UTF-32 depending on the platform. So more than just UTF-8 is important for these. These are just the cases I'm most familiar with, I'm sure there are others.
So, in my opinion, if you have a Unicode library, you should support UTF-16 and UTF-32. (I can't believe UTF-32 is too difficult, since there's no special processing involved besides byte ordering. Although, I'm not a Unicode library author :) )
One important point is XML: it can come in pretty much any encoding imaginable, but UTF-8 is by far the most common.
However, the XML spec says this:
All XML processors must accept the UTF-8 and UTF-16 encodings of Unicode
So if your application/library handles XML in any way it must support UTF-16 at least in that portion. Note that a conforming parser that converts the data to UTF-8 for processing would be enough here.
When it comes to interchange, I guess you are right that UTF-8 is prevalent. Some cases of using UTF-16 are various binary protocols such as DCOM, Java RMI and (maybe???) CORBA.
As for UTF-32 I've never heard of a case where it is used for interchange.

What are some common character encodings that a text editor should support?

I have a text editor that can load ASCII and Unicode files. It automatically detects the encoding by looking for the BOM at the beginning of the file and/or searching the first 256 bytes for characters > 0x7f.
What other encodings should be supported, and what characteristics would make that encoding easy to auto-detect?
Definitely UTF-8. See http://www.joelonsoftware.com/articles/Unicode.html.
As far as I know, there's no guaranteed way to detect this automatically (although the probability of a mistaken diagnosis can be reduced to a very small amount by scanning).
I don't know about encodings, but make sure it can support the multiple different line ending standards! (\n vs \r\n)
If you haven't checked out Mich Kaplan's blog yet, I suggest doing so: http://blogs.msdn.com/michkap/
Specifically this article may be useful: http://www.siao2.com/2007/04/22/2239345.aspx
There is no way how you can detect an encoding. The best thing you could do is something like IE and depend on letter distributions in different languages, as well as standard characters for a language. But that's a long shot at best.
I would advise getting your hands on some large library of character sets (check out projects like iconv) and make all of those available to the user. But don't bother auto-detecting. Simply allow the user to select his preference of a default charset, which itself would be UTF-8 by default.
Latin-1 (ISO-8859-1) and its Windows extension CP-1252 must definitely be supported for western users. One could argue that UTF-8 is a superior choice, but people often don't have that choice. Chinese users would require GB-18030, and remember there are Japanese, Russians, Greeks too who all have there own encodings beside UTF-8-encoded Unicode.
As for detection, most encodings are not safely detectable. In some (like Latin-1), certain byte values are just invalid. In UTF-8, any byte value can occur, but not every sequence of byte values. In practice, however, you would not do the decoding yourself, but use an encoding/decoding library, try to decode and catch errors. So why not support all encodings that this library supports?
You could also develop heuristics, like decoding for a specific encoding and then test the result for strange characters or character combinations or frequency of such characters. But this would never be safe, and I agree with Vilx- that you shouldn't bother. In my experience, people normally know that a file has a certain encoding, or that only two or three are possible. So if they see you chose the wrong one, they can easily adapt. And have a look at other editors. The most clever solution is not always the best, especially if people are used to other programs.
UTF-16 is not very common in plain text files. UTF-8 is much more common because it is back compatible with ASCII and is specified in standards like XML.
1) Check for BOM of various Unicode encodings. If found, use that encoding.
2) If no BOM, check if file text is valid UTF-8, reading until you reach a sufficient non-ASCII sample (since many files are almost all ASCII but may have a few accented characters or smart quotes) or the file ends. If valid UTF-8, use UTF-8.
3) If not Unicode it's probably current platform default codepage.
4) Some encodings are easy to detect, for example Japanese Shift-JIS will have heavy use of the prefix bytes 0x82 and 0x83 indicating hiragana and katakana.
5) Give user option to change encoding if program's guess turns out to be wrong.
Whatever you do, use more than 256 bytes for a sniff test. It's important to get it right, so why not check the whole doc? Or at least the first 100KB or so.
Try UTF-8 and obvious UTF-16 (lots of alternating 0 bytes), then fall back to the ANSI codepage for the current locale.