Is it possible to represent characters beyond ASCII in DataMatrix 2D barcode? (unicode?) - unicode

The DataMatrix article on Wikipedia mentions that it supports only ASCII by default. It also mentions a special mode for Base256 encoding, which should be able to represent arbitrary byte values.
However all the barcode generator libraries that I tried so far support data to be entered as string and show errors for characters beyond ASCII (Onbarcode and Barcodelib). There is also no way how to enter byte[] which would be required for Base256 mode.
Is there a barcode generator library that supports Base256 mode? (preferably commercial library with support)
Converting the unicode string into Base64 and decoding from base64 after the data is scanned would be one approach, but is there anything else?

it is possible, although, it has some pitfalls:
1) it depends on which language you're writing your app (there are different bindings fo different DM-libraries across programming languages.
For example, there is pretty common library in *nix-related environment (almost all barcode scanners/generators on Maemo/MeeGo/Tizen, some WinPhone apps, KDE thingies, and so on, using it) called [libdmtx][1]. As far, as I tested, encodes and decodes messages contatining unicode pretty fine, but it doesn't properly mark encoded message ("Hey, other readers, it is unicode here!"), so, other libraries, such as [ZXing][2], as many proprietary scanners, decodes that unicode messages as ASCII.
As far, as I dicussed with [ZXing][2] author, proper mark would probably be an ECI segment (0d241 byte as first codeword, followed by "0d26" byte (for UTF-8)). Although, that is theoretical solution, based on such one for QR-codes and not standardized in any way for DataMatrix (and neither [libdmtx][1] nor [ZXing][2], do not yet support encoding with such markings, althought, there is some steps in that way.
So, TL;DR: If you plan to use that generated codes (with unicode messages) only between apps, that you're writing — you can freely use [libdmtx][1] for both encoding and decoding on both sides and it will work fine :) If not — try to look for [zxing][2] ports on your language (and make sure that port supports encoding).
1: github.com/dmtx/libdmtx
2: github.com/zxing/zxing

Related

How to handle Unicode text with C/C++ servlets/handlers in G-Wan Web Server?

I'm planning to write a web application using C/C++ servlets/handlers for G-Wan web/app server. I would like my application to work with multiple languages including multibyte characters and hence am wondering how i should handle this in G-WAN servlets.
The xbuf_t structure seems to be char* as its underlying storage buffer for building HTTP response; and since char is a single byte, i would want to know how it would affect the text with unicode or multi-byte characters. I'm a bit reluctant to add heavy unicode libraries like IBM Unicode Library [ICU] and the likes.
Could someone explain me how others are dealing with this situation and if required what options are available for handling unicode, preferably with as little and small dependencies as possible?
The server response (called reply in servlet examples) can contain binary data so this is possible of course. There are examples that send dynamically pictures (GIF, PNG, JSON, etc.), so there's no limit to what you can send as a reply.
Without UNICODE, you are using xbuf_xcat() which acts like sprintf() with a dynamically growing buffer (the server reply).
What you should do is just build your UNICODE reply (with your favorite UNICODE library - ANSI C and almost all languages have one) and then copy it into the reply buffer with xbuf_ncat();
Of course, you can also use xbuf_ncat(); on-the-fly for each piece of data you build rather than for all the big buffer at the end of your servlet. Your choice.
Note that using UTF-8 may be (it depends on your application) a better choice than UNICODE because then most of your text might be able to use xbuf_xcat() (this is faster than a buffer copy).
You will only need to call xbuf_ncat(); for the non-ASCII characters.
The xbuf_xxx() functions could be modified to support UTF-8/UNICODE (with a flag to tell which encoding is used for example) but this will be for later.

Why are there different encoding types?

This is a noob question, but I wanna know why there are different encoding types and what are their differences (ie. ASCII, utf-8 and 16, base64, etc.)
Reasons are many I believe but the main point is: "How many characters you need to display (encode)?" If you live in US for example, you could go pretty far with ASCII. But in many counties we need characters like ä, å, ü etc. (If SO was ASCII only or you try to read this text as ASCII encoded text, you'd see some weird characters in the places of ä, å and ü.) Think also the China, Japan, Thailand and other "exotic" countires. Those weird figures on photos you may have seen around the world just might be letters, not pretty pictures.
As for the differences between different encoding types you need to see their specification. Here's something for UTF-8.
http://www.unicode.org/standard/standard.html
http://www.utf-8.com/
http://en.wikipedia.org/wiki/UTF-8#Compared_to_other_multi-byte_encodings
I'm not familiar with UTF-16. Here's some information about the differences.
http://en.wikipedia.org/wiki/Unicode
http://en.wikipedia.org/wiki/Unicode_plane
Base64 is used when there is a need to encode binary data that needs to be stored and transferred over media that are designed to deal with textual data. If you've ever made somesort of email system with PHP, you've probably encountered Base64.
http://en.wikipedia.org/wiki/Base64
http://www.phpeveryday.com/articles/PHP-Email-Using-Embedded-Images-in-HTML-Email-P113.html
Is short: To support computer program's user interface localizations to many different languages. (Programming languages still mainly consist of characters found in ASCII encoding, althought it's possible for example in Java to use UTF-8 encoding in variable names, and the source code file is usually stored as something else than ASCII encoded text, for example UTF-8 encoding.)
In short vol.2: Always when different people are trying to solve some problem from a specific point of view (or even without a point of view if it's even possible), results may be quite different. Quote from Joel's unicode article (link below): "Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255."
Thanks to Joachim and tchrist for all the info and discussion. Here's two articles I just read. (Both links are on the page I linked to earlier.) I'd forgotten most of the stuff from Joel's article since I last read it a few years back. Good introduction to the subject I hope. Mark Davis goes a little deeper.
http://www.joelonsoftware.com/articles/Unicode.html
http://www.icu-project.org/docs/papers/forms_of_unicode/
The real reason why there are so many variants is that the Unicode consortium came along too late.
In The Beginning memory and storage was expensive and using more than 8 (or sometimes only 7) bit of memory to store a single character was considered excessive. Thus pretty much all text was stored using 7 or 8 bit per character. Clearly, 8 bit are not enough memory to represent the characters of all human languages. It's barely enough to represent most characters used in a single language (and for some languages even that's not possible). Therefore many different character encodings where designed to allow different languages (English, German, Greek, Russian, ...) to encode their texts in 8 bits per characters. After all a single text file (and usually even a single computer system) will only ever used in a single language, right?
This led to a situation where there was no single agreed-upon mapping of characters to numbers of any kind. Many different, incompatible solutions where produced and no real central control existed. Some computer systems used ASCII, others used EBCDIC (or more precisely: one of the many variations of EBCDIC), ISO-8859-* (or one of its many derivatives) or any of a big list of encodings that are hardly heard about now.
Finally, the Unicode Consortium stepped up to the task to produce that single mapping (together with lots of auxiliary data that's useful but outside of the bounds of this answer).
When the Unicode consortium finally produced a fairly comprehensive list of characters that a computer might represent (together with a number of encoding schemes to encode them to binary data, depending on your concrete needs), the other character encoding schemes were already widely used. This slowed down the adoption of Unicode and its encodings (UTF-8, UTF-16) considerably.
These days, if you want to represent text, your best bet is to use one of the few encodings that can represent all Unicode characters. UTF-8 and UTF-16 together should suffice for 99% of all use cases, UTF-32 covers almost all the others. And just to be clear: all the UTF-* encodings can encode all valid Unicode characters. But due to the fact that UTF-8 and UTF-16 are variable-width encodings, they might not be ideal for all use cases. Unless you need to be able to interact with a legacy system that can't handle those encodings, there is rarely a reason to choose anything else these days.
The main reason is to be able to show more characters. When the internet was in it's infancy, noone really planned ahead thinking that one day there would be people using it from all countries and all languages around the world. So a small character set was good enough. Gradually it was revealed to be limited and English-centric, thus the demand for bigger character sets.

Understanding the terms - Character Encodings, Fonts, Glyphs

I am trying to understand this stuff so that I can effectively work on internationalizing a project at work. I have just started and very much like to know from your expertise whether I've understood these concepts correct. So far here is the dumbed down version(for my understanding) of what I've gathered from web:
Character Encodings -> Set of rules that tell the OS how to store characters. Eg., ISO8859-1,MSWIN1252,UTF-8,UCS-2,UTF-16. These rules are also called Code Pages/Character Sets which maps individual characters to numbers. Apparently unicode handles this a bit differently than others. ie., instead of a direct mapping from a number(code point) to a glyph, it maps the code point to an abstract "character" which might be represented by different glyphs.[ http://www.joelonsoftware.com/articles/Unicode.html ]
Fonts -> These are implementation of character encodings. They are files of different formats (True Type,Open Type,Post Script) that contain mapping for each character in an encoding to number.
Glyphs -> These are visual representation of characters stored in the font files.
And based on the above understanding I have the below questions,
1)For the OS to understand an encoding, should it be installed separately?. Or installing a font that supports an encoding would suffice?. Is it okay to use the analogy of a protocol say TCP used in a network to an encoding as it is just a set of rules. (which ofcourse begs the question, how does the OS understands these network protocols when I do not install them :-p)
2)Will a font always have the complete implementation of a code page or just part of it?. Is there a tool that I can use to see each character in a font(.TTF file?)[Windows font viewer shows how a style of the font looks like but doesn't give information regarding the list of characters in the font file]
3)Does a font file support multiple encodings?. Is there a way to know which encoding(s) a font supports?
I apologize for asking too many questions, but I had these in my mind for some time and I couldn't find any site that is simple enough for my understanding. Any help/links for understanding this stuff would be most welcome. Thanks in advance.
If you want to learn more, of course I can point you to some resources:
Unicode, writing systems, etc.
The best source of information would probably be this book by Jukka:
Unicode Explained
If you were to follow the link, you'd also find these books:
CJKV Information Processing - deals with Chinese, Japanese, Korean and Vietnamese in detail but to me it seems quite hard to read.
Fonts & Encodings - personally I haven't read this book, so I can't tell you if it is good or not. Seems to be on topic.
Internationalization
If you want to learn about i18n, I can mention countless resources. But let's start with book that will save you great deal of time (you won't become i18n expert overnight, you know):
Developing International Software - it might be 8 years old but this is still worth every cent you're going to spend on it. Maybe the programming examples regard to Windows (C++ and .Net) but the i18n and L10n knowledge is really there. A colleague of mine said once that it saved him about 2 years of learning. As far as I can tell, he wasn't overstating.
You might be interested in some blogs or web sites on the topic:
Sorting it all out - Michael Kaplan's blog, often on i18n support on Windows platform
Global by design - John Yunker is actively posting bits of i18n knowledge to this site
Internationalization (I18n), Localization (L10n), Standards, and Amusements - also known as i18nguy, the web site where you can find more links, tutorials and stuff.
Java Internationalization
I am afraid that I am not aware of many up to date resources on that topic (that is publicly available ones). The only current resource I know is Java Internationalization trail. Unfortunately, it is fairly incomplete.
JavaScript Internationalization
If you are developing web applications, you probably need also something related to i18n in js. Unfortunately, the support is rather poor but there are few libraries which help dealing with the problem. The most notable examples would be Dojo Toolkit and Globalize.
The prior is a bit heavy, although supports many aspects of i18n, the latter is lightweight but unfortunately many stuff is missing. If you choose to use Globalize, you might be interested in the latest Jukka's book:
Going Global with JavaScript & Globalize.js - I read this and as far I can tell, it is great. It doesn't cover the topics you were originally asking for but it is still worth reading, even for hands-on examples of how to use Globalize.
Apparently unicode handles this a bit differently than others. ie.,
instead of a direct mapping from a number(code point) to a glyph, it
maps the code point to an abstract "character" which might be
represented by different glyphs.
In the Unicode Character Encoding Model, there are 4 levels:
Abstract Character Repertoire (ACR) — The set of characters to be encoded.
Coded Character Set (CCS) — A one-to-one mapping from characters to integer code points.
Character Encoding Form (CEF) — A mapping from code points to a sequence of fixed-width code units.
Character Encoding Scheme (CES) — A mapping from code units to a serialized sequence of bytes.
For example, the character 𝄞 is represented by the code point U+1D11E in the Unicode CCS, the two code units D834 DD1E in the UTF-16 CEF, and the four bytes 34 D8 1E DD in the UTF-16LE CES.
In most older encodings like US-ASCII, the CEF and CES are trivial: Each character is directly represented by a single byte representing its ASCII code.
1) For the OS to understand an encoding, should it be installed
separately?.
The OS doesn't have to understand an encoding. You're perfectly free to use a third-party encoding library like ICU or GNU libiconv to convert between your encoding and the OS's native encoding, at the application level.
2)Will a font always have the complete implementation of a code page or just part of it?.
In the days of 7-bit (128-character) and 8-bit (256-character) encodings, it was common for fonts to include glyphs for the entire code page. It is not common today for fonts to include all 100,000+ assigned characters in Unicode.
I'll provide you with short answers to your questions.
It's generally not the OS that supports an encoding but the applications. Encodings are used to convert a stream of bytes to lists of characters. For example, in C# reading a UTF-8 string will automatically make it UTF-16 if you tell it to treat it as a string.
No matter what encoding you use, C# will simply use UTF-16 internally and when you want to, for example, print a string from a foreign encoding, it will convert it to UTF-16 first, then look up the corresponding characters in the character tables (fonts) and shows the glyphs.
I don't recall ever seeing a complete font. I don't have much experience with working with fonts either, so I cannot give you an answer for this one.
The answer to this one is in #1, but a short summary: fonts are usually encoding-independent, meaning that as long as the system can convert the input encoding to the font encoding you'll be fine.
Bonus answer: On "how does the OS understand network protocols it doesn't know?": again it's not the OS that handles them but the application. As long as the OS knows where to redirect the traffic (which application) it really doesn't need to care about the protocol. Low-level protocols usually do have to be installed, to allow the OS to know where to send the data.
This answer is based on my understanding of encodings, which may be wrong. Do correct me if that's the case!

Unicode Encoding and decoding issues in QRCode

I am trying to generate UTF-8 QRCode so that I can encore accents and Unicode characters.
To test it, I am using many decoding solution :
http://zxing.org/w/decode.jspx - The zxing project also used in Android
http://www.drhu.org/QRCode/QRDecoder.php - a PHP Decoder
http://zbar.sf.net - The ZBar bar code reader - OpenSource and C project for embedded
All of them give me always the same result.
You can try this image works well with Unicode Characters.
But if I am trying to use zxing or Google Chart API to generate the QRCode, I cannot decode it correctly.
I have tried this :
http://chart.apis.google.com/chart?cht=qr&chs=200x200&choe=SHIFT_JIS&chl=R%C3%A9my+Hubscher
http://chart.apis.google.com/chart?cht=qr&chs=200x200&choe=ISO-8859-1&chl=R%C3%A9my+Hubscher
http://chart.apis.google.com/chart?cht=qr&chs=200x200&choe=UTF-8&chl=R%C3%A9my+Hubscher
But all without success.
Do you know how I can do ? Do you know which encoding is used for the working image ?
The solution that comes up, is to encode the text in UTF-8 and add a BOM to specify that the string is actually in UTF-8.
Here it works :
http://chart.apis.google.com/chart?cht=qr&chs=200x200&choe=UTF-8&chl=%EF%BB%BFR%C3%A9my+Hubscher
Heuristics used by QR decoders often fails, BOM does not help
Most QR decoders use heuristics to automatically detect character encoding even if it is specified explicitly inside the QR code via the ECI extension.
It turned out that BOM helped to your decoder. But for most decoders, BOM does not help. As an example of a decoder that cannot display a proper UTF-8 string, take a Xiaomi phone with MIUI Global v11.0.3 (with their native scanner application). This phone cannot correctly show an UTF-8 QR code produced a link in your original question. Here is how it showed: R閙y Hubscher. With the BOM (using a link from your subsequent message) it showed this way: ?R閙y Hubscher (it just showed the BOM character as ?). But if you add a Chinese character like 日 before the string instead of BOM, Xiaomi will show the string correctly. Here is the link: chart.apis.google.com/chart?cht=qr&chs=200x200&choe=UTF-8&chl=%E6%97%A5R%C3%A9my%20Hubscher
Xiaomi correctly displays the string 日Rémy Hubscher from a QR code generated by this link.
Another example is “QR code reader & QR code Scanner” Android app by TWMobile. It did properly decode all the QR codes from all the links that you have provided. So you did not have to use BOM to make the scanner by TWMobile properly display the strings.
Why do QR decoders always use heuristics to detect character set even though these heuristics frequently fails as shown in your case? As you know, there are 4 modes of storing text in a QR code: (1) numeric, (2) alphanumeric, (3) 8-bit, and (4) Kanji. So, QR code standard does not inherently support UTF-8. To use UTF-8 encoding (instead of the default “ISO-8859-1” or “JIS8”) in the 8-bit string, the implementation has to insert an ECI (Extended Channel Interpretations) before that string. ECI is an optional, additional feature for a QR Code. Good point is that it was defined in earliest QR code standard at least in 2000. ECI enables data encoding using character sets other than the default. It also enables other data interpretations (e.g. compacted data using defined compression schemes) or other industry-specific requirements to be encoded. The ECI protocol is defined in a specification developed by AIM, Inc, and is not available for free but can be purchased for a fee. Unfortunately, not all QR decoders can handle the ECI protocol, even in such a basic thing as changing default encoding to UTF-8. And even for default encoding like “ISO-8859-1” (for a 8-bit string mode) or “Shift_JIS”(for Kanji mode), decoders still use heuristics to determine character set, because some applications that encode QR codes may not support ECI or specify incorrect character set.
Conclusion
Because of heuristics to automatically detect character set, QR decoders often fail do display the string properly, even when correct encoding is explicitly specified via ECI as it was in your case and the BOM character did not help as shown in the Xiaomi example. You have found a solution in your reply, but it did not help for Xiaomi. Some QR decoders use heuristics algorithms that are so dumb that even BOM does not help.
Although the BOM did help with your QR decoder, a better solution would be to stop using error-prone QR decoders that use heuristics even if the character encoding is explicitly specified via ECI.
Find a better QR decoder if a decoder cannot properly decode the text without BOM. The encoder that you have provided (using the links) is OK.

Unicode, UTF, ASCII, ANSI format differences

What is the difference between the Unicode, UTF8, UTF7, UTF16, UTF32, ASCII, and ANSI encodings?
In what way are these helpful for programmers?
Going down your list:
"Unicode" isn't an encoding, although unfortunately, a lot of documentation imprecisely uses it to refer to whichever Unicode encoding that particular system uses by default. On Windows and Java, this often means UTF-16; in many other places, it means UTF-8. Properly, Unicode refers to the abstract character set itself, not to any particular encoding.
UTF-16: 2 bytes per "code unit". This is the native format of strings in .NET, and generally in Windows and Java. Values outside the Basic Multilingual Plane (BMP) are encoded as surrogate pairs. These used to be relatively rarely used, but now many consumer applications will need to be aware of non-BMP characters in order to support emojis.
UTF-8: Variable length encoding, 1-4 bytes per code point. ASCII values are encoded as ASCII using 1 byte.
UTF-7: Usually used for mail encoding. Chances are if you think you need it and you're not doing mail, you're wrong. (That's just my experience of people posting in newsgroups etc - outside mail, it's really not widely used at all.)
UTF-32: Fixed width encoding using 4 bytes per code point. This isn't very efficient, but makes life easier outside the BMP. I have a .NET Utf32String class as part of my MiscUtil library, should you ever want it. (It's not been very thoroughly tested, mind you.)
ASCII: Single byte encoding only using the bottom 7 bits. (Unicode code points 0-127.) No accents etc.
ANSI: There's no one fixed ANSI encoding - there are lots of them. Usually when people say "ANSI" they mean "the default locale/codepage for my system" which is obtained via Encoding.Default, and is often Windows-1252 but can be other locales.
There's more on my Unicode page and tips for debugging Unicode problems.
The other big resource of code is unicode.org which contains more information than you'll ever be able to work your way through - possibly the most useful bit is the code charts.
Some reading to get you started on character encodings: Joel on Software:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
By the way - ASP.NET has nothing to do with it. Encodings are universal.