How to handle Unicode text with C/C++ servlets/handlers in G-Wan Web Server? - unicode

I'm planning to write a web application using C/C++ servlets/handlers for G-Wan web/app server. I would like my application to work with multiple languages including multibyte characters and hence am wondering how i should handle this in G-WAN servlets.
The xbuf_t structure seems to be char* as its underlying storage buffer for building HTTP response; and since char is a single byte, i would want to know how it would affect the text with unicode or multi-byte characters. I'm a bit reluctant to add heavy unicode libraries like IBM Unicode Library [ICU] and the likes.
Could someone explain me how others are dealing with this situation and if required what options are available for handling unicode, preferably with as little and small dependencies as possible?

The server response (called reply in servlet examples) can contain binary data so this is possible of course. There are examples that send dynamically pictures (GIF, PNG, JSON, etc.), so there's no limit to what you can send as a reply.
Without UNICODE, you are using xbuf_xcat() which acts like sprintf() with a dynamically growing buffer (the server reply).
What you should do is just build your UNICODE reply (with your favorite UNICODE library - ANSI C and almost all languages have one) and then copy it into the reply buffer with xbuf_ncat();
Of course, you can also use xbuf_ncat(); on-the-fly for each piece of data you build rather than for all the big buffer at the end of your servlet. Your choice.
Note that using UTF-8 may be (it depends on your application) a better choice than UNICODE because then most of your text might be able to use xbuf_xcat() (this is faster than a buffer copy).
You will only need to call xbuf_ncat(); for the non-ASCII characters.
The xbuf_xxx() functions could be modified to support UTF-8/UNICODE (with a flag to tell which encoding is used for example) but this will be for later.

Related

Why Base64 is used "only" to encode binary data?

I saw many resources about the usages of base64 in today's internet. As I understand it, all of those resources seem to spell out single usecase in different ways : Encode binary data in Base64 to avoid getting it misinterpreted/corrupted as something else during transit (by intermediate systems). But I found nothing that explains following :
Why would binary data be corrupted by intermediate systems? If I am sending an image from a server to client, any intermediate servers/systems/routers will simply forward data to next appropriate servers/systems/routers in the path to client. Why would intermediate servers/systems/routers need to interpret something that it receives? Any example of such systems which may corrupt/wrongly interpret data that it receives, in today's internet?
Why do we fear only binary data to be corrupted. We use Base64 because we are sure that those 64 characters can never be corrupted/misinterpreted. But by this same logic, any text characters that do not belong to base64 characters can be corrupted/misinterpreted. Why then, base64 is use only to encode binary data? Extending the same idea, when we use browser are javascript and HTML files transferred in base64 form?
There's two reasons why Base64 is used:
systems that are not 8-bit clean. This stems from "the before time" where some systems took ASCII seriously and only ever considered (and transferred) 7bits out of any 8bit byte (since ASCII uses only 7 bits, that would be "fine", as long as all content was actually ASCII).
systems that are 8-bit clean, but try to decode the data using a specific encoding (i.e. they assume it's well-formed text).
Both of these would have similar effects when transferring binary (i.e. non-text) data over it: they would try to interpret the binary data as textual data in a character encoding that obviously doesn't make sense (since there is no character encoding in binary data) and as a consequence modify the data in an un-fixable way.
Base64 solves both of these in a fairly neat way: it maps all possible binary data streams into valid ASCII text: the 8th bit is never set on Base64-encoded data, because only regular old ASCII characters are used.
This pretty much solves the second problem as well, since most commonly used character encodings (with the notable exception of UTF-16 and UCS-2, among a few lesser-used ones) are ASCII compatible, which means: all valid ASCII streams happen to also be valid streams in most common encodings and represent the same characters (examples of these encodings are the ISO-8859-* family, UTF-8 and most Windows codepages).
As to your second question, the answer is two-fold:
textual data often comes with some kind of meta-data (either a HTTP header or a meta-tag inside the data) that describes the encoding to be used to interpret it. Systems built to handle this kind of data understand and either tolerate or interpret those tags.
in some cases (notably for mail transport) we do have to use various encoding techniques to ensure text doesn't get mangles. This might be the use of quoted-printable encoding or sometimes even wrapping text data in Base64.
Last but not least: Base64 has a serious drawback and that's that it's inefficient. For every 3 bytes of data to encode, it produces 4 bytes of output, thus increasing the size of the data by ~33%. That's why it should be avoided when it's not necessary.
One of the use of BASE64 is to send email.
Mail servers used a terminal to transmit data. It was common also to have translation, e.g. \c\r into a single \n and the contrary. Note: Also there where no guarantee that 8-bit can be used (email standard is old, and it allowed also non "internet" email, so with ! instead of #). Also systems may not be fully ASCII.
Also \n\n. is considered as end of body, and mboxes uses also \n>From to mark start of new mail, so also when 8-bit flag was common in mail servers, the problems were not totally solved.
BASE64 was a good way to remove all problems: the content is just send as characters that all servers must know, and the problem of encoding/decoding requires just sender and receiver agreement (and right programs), without worrying of the many relay server in between. Note: all \c, \r, \n etc. are just ignored.
Note: you can use BASE64 also to encode strings in URL, without worrying about the interpretation of webbrowsers. You may see BASE64 also in configuration files (e.g. to include icons): special crafted images may not be interpreted as configuration. Just BASE64 is handy to encode binary data into protocols which were not designed for binary data.

Can someone explain the sequence of events that occurs in the encoding/decoding process?

I'm trying to solidify my understanding of encoding and decoding. I'm not sure how the sequence of events works in different settings:
When I type on my computer, is the computer(or whatever program I'm in), automatically decoding my letters in UTF-8(or whatever encoding is used)
When I save a file, is it automatically saving it using the encoding standard that was used to decode my text? Let's say I send over that document or dataset to someone, am I sending a bunch of 1s and 0s to them? and then their decoder is decoding it based on whatever default or encoding standard they specify?
How does code points play into this? Does my computer also have a default code point dictionary it uses?
If these above is true, how do I find out what kind of decoding/encoding my computer/program is using?
Sorry if this isn't clear, or if I'm misunderstanding/using terminology incorrectly.
There are a few ways that this can work, but here is one possibility.
First, yes, in a way, the computer "decodes" each letter you type into some encoding. Each time you press a key on your keyboard, you close a circuit, which signals to other hardware in your computer (e.g., a keyboard controller) that a key was pressed. This hardware then populates a buffer with information about the keyboard event (key up, key down, key repeat) and sends an interrupt to the CPU.
When the CPU receives the interrupt, it jumps to a hardware-defined location in memory and begins executing the code it finds there. This code often will examine which device sent the interrupt and then jump to some other location that has code to handle an interrupt sent by the particular device. This code will then read a "scan code" from the buffer on the device to determine which key event occurred.
The operating system then processes the scan code and delivers it to the application that is waiting for keyboard input. One way it can do this is by populating a buffer with the UTF-8-encoded character that corresponds to the key (or keys) that was pressed. The application would then read the buffer when it receives control back from the operating system.
To answer your second question, we first have to remember what happens as you enter data into your file. As you type, your application receives the letters (perhaps UTF-8-encoded, as mentioned above) corresponding to the keys that you press. Now, your application will need to keep track of which letters it has received so that it can later save the data you've entered to a file. One way that it can do this is by allocating a buffer when the program is started and then copying each character into the buffer as it is received. If the characters are delivered from the OS UTF-8-encoded, then your application could simply copy those bytes to the other buffer. As you continue typing, your buffer will continue to be populated by the characters that are delivered by the OS. When it's time to save your file, your application can ask the OS to write the contents of the buffer to a file or to send them over the network. Device drivers for your disk or network interface know how to send this data to the appropriate hardware device. For example, to write to a disk, you may have to write your data to a buffer, write to a register on the disk controller to signal to write the data in the buffer to the disk, and then repeatedly read from another register on the disk controller to check if the write is complete.
Third, Unicode defines a code point for each character. Each code point can be encoded in more than one way. For example, the code point U+004D ("Latin capital letter M") can be encoded in UTF-8 as 0x4D, in UTF-16 as 0x004D, or in UTF-32 as 0x0000004D (see Table 3-4 in The Unicode Standard). If you have data in memory, then it is encoded using some encoding, and there are libraries available that can convert from one encoding to another.
Finally, you can find out how your computer processes keyboard input by examining the device drivers. You could start by looking at some Linux drivers, as many are open source. Each program, however, can encode and decode data however it chooses to. You would have to examine the code for each individual program to understand how its encoding and decoding works.
It is a complex question, also because it depends on many things.
When I type on my computer, is the computer(or whatever program I'm in), automatically decoding my letters in UTF-8(or whatever encoding is used)
This is very complex. Some programs get the keyboard code (e.g. games), but most programs uses operating system services, to interpret keyboard codes (considering various keyboard layouts, but also modifying result according Shift, Control, etc.).
So, it depends on operating system and program about which encoding you get. For terminal programs, the locale of the process include also encoding of stdin/stdout (standard input and standard output). For graphical interfaces, you may get different encoding (according system encoding).
But UTF-8 is an encoding, so you used wrongly the word decoding in UTF-8.
When I save a file, is it automatically saving it using the encoding standard that was used to decode my text? Let's say I send over that document or dataset to someone, am I sending a bunch of 1s and 0s to them? and then their decoder is decoding it based on whatever default or encoding standard they specify?
This is the complex part. Many systems, and computer languages are old, so they were designed with just one system encoding. E.g. C language. So there is not really a decoding. Programs uses directly the encoding, and they hard code that letter A has a specific value. For computers, only the numeric value matter. Only when data is printed things are interpreted, and in a complex way (fonts, character size, ligatures, next line, ...). [And also if you use string functions, you explicitly tell program to uses the numbers as a string of characters].
Some languages (and HTML: you view a page generated by an external machine, so system encoding is not more the same) introduced the decoding part: internally in a program you have one single way to represent a string (e.g. with Unicode Code Points). But to have such uniform format, we need to decode strings (but so, now we can handle different encoding, and not being restricted to the encoding of the system).
If you save a file, it will have a sequences of bytes. To interpret (also known as decoding) you need to know which encoding has the file. In general you should know it, or give (e.g. as HTML) an out-of-band information ("the following file is UTF-8", e.g. in HTTP headers, or in extension, or in field definition of a database, or...). Some systems (Microsoft Windows) uses BOM (Byte order mark) to distinguish between UTF-16LE, UTF-16BE, UTF-8, and old system encoding (some people call it ANSI, but it is not ANSI, and it could be many different code pages).
The decoder: usually it should know the encoding, else either it use defaults, or it guess it. HTML has a list on step to perform to get an estimate. BOM method above could help. And some tools will check looking common combination of characters (in various languages). But this is still magic. Without BOM or out-of-band data, we can just estimate, and we get wrong often.
How does code points play into this? Does my computer also have a default code point dictionary it uses?
Code point is the base of Unicode. Every "character" has a code point: a fix number, with a description. This is abstract. In UTF-32 you use the same number for encoding (using 32bit integers), on all other encoding, you have a functions (or a map) from code point to encoded values (and also the way back). Code point is just a numeric value which describes the semantic (so the meaning) of a character. To transmit such information, usually we need an encoding (or just a escaping sequence, e.g. U+FFFF represent (as text) the BOM character).
If these above is true, how do I find out what kind of decoding/encoding my computer/program is using?
Nobody can answer: your computer will uses a lot of encoding.
MacOS, Unix, POSIX systems: modern systems (and not root account): they will use probably UTF-8. Root will probably use just ASCII (7-bit).
Windows: Internally it uses often UTF16. The output, it depends on the program, but nearly always it uses an 8-bit encoding (so not the UTF16). Windows can read and write several encoding. You can ask the system the default encoding (but programs could still write in UTF-8 or other encoding, if they want). Terminal and settings could gives you different default encoding on different programs.
For this reason, if you program in Windows, you should explicitly save files as UTF-8 (my recommendation), and possibly with BOM (but if you need interoperability with non-Windows machines, in such case, ignore BOM, but you should already know that such files must be UTF-8).

Is it possible to represent characters beyond ASCII in DataMatrix 2D barcode? (unicode?)

The DataMatrix article on Wikipedia mentions that it supports only ASCII by default. It also mentions a special mode for Base256 encoding, which should be able to represent arbitrary byte values.
However all the barcode generator libraries that I tried so far support data to be entered as string and show errors for characters beyond ASCII (Onbarcode and Barcodelib). There is also no way how to enter byte[] which would be required for Base256 mode.
Is there a barcode generator library that supports Base256 mode? (preferably commercial library with support)
Converting the unicode string into Base64 and decoding from base64 after the data is scanned would be one approach, but is there anything else?
it is possible, although, it has some pitfalls:
1) it depends on which language you're writing your app (there are different bindings fo different DM-libraries across programming languages.
For example, there is pretty common library in *nix-related environment (almost all barcode scanners/generators on Maemo/MeeGo/Tizen, some WinPhone apps, KDE thingies, and so on, using it) called [libdmtx][1]. As far, as I tested, encodes and decodes messages contatining unicode pretty fine, but it doesn't properly mark encoded message ("Hey, other readers, it is unicode here!"), so, other libraries, such as [ZXing][2], as many proprietary scanners, decodes that unicode messages as ASCII.
As far, as I dicussed with [ZXing][2] author, proper mark would probably be an ECI segment (0d241 byte as first codeword, followed by "0d26" byte (for UTF-8)). Although, that is theoretical solution, based on such one for QR-codes and not standardized in any way for DataMatrix (and neither [libdmtx][1] nor [ZXing][2], do not yet support encoding with such markings, althought, there is some steps in that way.
So, TL;DR: If you plan to use that generated codes (with unicode messages) only between apps, that you're writing — you can freely use [libdmtx][1] for both encoding and decoding on both sides and it will work fine :) If not — try to look for [zxing][2] ports on your language (and make sure that port supports encoding).
1: github.com/dmtx/libdmtx
2: github.com/zxing/zxing

Understanding the terms - Character Encodings, Fonts, Glyphs

I am trying to understand this stuff so that I can effectively work on internationalizing a project at work. I have just started and very much like to know from your expertise whether I've understood these concepts correct. So far here is the dumbed down version(for my understanding) of what I've gathered from web:
Character Encodings -> Set of rules that tell the OS how to store characters. Eg., ISO8859-1,MSWIN1252,UTF-8,UCS-2,UTF-16. These rules are also called Code Pages/Character Sets which maps individual characters to numbers. Apparently unicode handles this a bit differently than others. ie., instead of a direct mapping from a number(code point) to a glyph, it maps the code point to an abstract "character" which might be represented by different glyphs.[ http://www.joelonsoftware.com/articles/Unicode.html ]
Fonts -> These are implementation of character encodings. They are files of different formats (True Type,Open Type,Post Script) that contain mapping for each character in an encoding to number.
Glyphs -> These are visual representation of characters stored in the font files.
And based on the above understanding I have the below questions,
1)For the OS to understand an encoding, should it be installed separately?. Or installing a font that supports an encoding would suffice?. Is it okay to use the analogy of a protocol say TCP used in a network to an encoding as it is just a set of rules. (which ofcourse begs the question, how does the OS understands these network protocols when I do not install them :-p)
2)Will a font always have the complete implementation of a code page or just part of it?. Is there a tool that I can use to see each character in a font(.TTF file?)[Windows font viewer shows how a style of the font looks like but doesn't give information regarding the list of characters in the font file]
3)Does a font file support multiple encodings?. Is there a way to know which encoding(s) a font supports?
I apologize for asking too many questions, but I had these in my mind for some time and I couldn't find any site that is simple enough for my understanding. Any help/links for understanding this stuff would be most welcome. Thanks in advance.
If you want to learn more, of course I can point you to some resources:
Unicode, writing systems, etc.
The best source of information would probably be this book by Jukka:
Unicode Explained
If you were to follow the link, you'd also find these books:
CJKV Information Processing - deals with Chinese, Japanese, Korean and Vietnamese in detail but to me it seems quite hard to read.
Fonts & Encodings - personally I haven't read this book, so I can't tell you if it is good or not. Seems to be on topic.
Internationalization
If you want to learn about i18n, I can mention countless resources. But let's start with book that will save you great deal of time (you won't become i18n expert overnight, you know):
Developing International Software - it might be 8 years old but this is still worth every cent you're going to spend on it. Maybe the programming examples regard to Windows (C++ and .Net) but the i18n and L10n knowledge is really there. A colleague of mine said once that it saved him about 2 years of learning. As far as I can tell, he wasn't overstating.
You might be interested in some blogs or web sites on the topic:
Sorting it all out - Michael Kaplan's blog, often on i18n support on Windows platform
Global by design - John Yunker is actively posting bits of i18n knowledge to this site
Internationalization (I18n), Localization (L10n), Standards, and Amusements - also known as i18nguy, the web site where you can find more links, tutorials and stuff.
Java Internationalization
I am afraid that I am not aware of many up to date resources on that topic (that is publicly available ones). The only current resource I know is Java Internationalization trail. Unfortunately, it is fairly incomplete.
JavaScript Internationalization
If you are developing web applications, you probably need also something related to i18n in js. Unfortunately, the support is rather poor but there are few libraries which help dealing with the problem. The most notable examples would be Dojo Toolkit and Globalize.
The prior is a bit heavy, although supports many aspects of i18n, the latter is lightweight but unfortunately many stuff is missing. If you choose to use Globalize, you might be interested in the latest Jukka's book:
Going Global with JavaScript & Globalize.js - I read this and as far I can tell, it is great. It doesn't cover the topics you were originally asking for but it is still worth reading, even for hands-on examples of how to use Globalize.
Apparently unicode handles this a bit differently than others. ie.,
instead of a direct mapping from a number(code point) to a glyph, it
maps the code point to an abstract "character" which might be
represented by different glyphs.
In the Unicode Character Encoding Model, there are 4 levels:
Abstract Character Repertoire (ACR) — The set of characters to be encoded.
Coded Character Set (CCS) — A one-to-one mapping from characters to integer code points.
Character Encoding Form (CEF) — A mapping from code points to a sequence of fixed-width code units.
Character Encoding Scheme (CES) — A mapping from code units to a serialized sequence of bytes.
For example, the character 𝄞 is represented by the code point U+1D11E in the Unicode CCS, the two code units D834 DD1E in the UTF-16 CEF, and the four bytes 34 D8 1E DD in the UTF-16LE CES.
In most older encodings like US-ASCII, the CEF and CES are trivial: Each character is directly represented by a single byte representing its ASCII code.
1) For the OS to understand an encoding, should it be installed
separately?.
The OS doesn't have to understand an encoding. You're perfectly free to use a third-party encoding library like ICU or GNU libiconv to convert between your encoding and the OS's native encoding, at the application level.
2)Will a font always have the complete implementation of a code page or just part of it?.
In the days of 7-bit (128-character) and 8-bit (256-character) encodings, it was common for fonts to include glyphs for the entire code page. It is not common today for fonts to include all 100,000+ assigned characters in Unicode.
I'll provide you with short answers to your questions.
It's generally not the OS that supports an encoding but the applications. Encodings are used to convert a stream of bytes to lists of characters. For example, in C# reading a UTF-8 string will automatically make it UTF-16 if you tell it to treat it as a string.
No matter what encoding you use, C# will simply use UTF-16 internally and when you want to, for example, print a string from a foreign encoding, it will convert it to UTF-16 first, then look up the corresponding characters in the character tables (fonts) and shows the glyphs.
I don't recall ever seeing a complete font. I don't have much experience with working with fonts either, so I cannot give you an answer for this one.
The answer to this one is in #1, but a short summary: fonts are usually encoding-independent, meaning that as long as the system can convert the input encoding to the font encoding you'll be fine.
Bonus answer: On "how does the OS understand network protocols it doesn't know?": again it's not the OS that handles them but the application. As long as the OS knows where to redirect the traffic (which application) it really doesn't need to care about the protocol. Low-level protocols usually do have to be installed, to allow the OS to know where to send the data.
This answer is based on my understanding of encodings, which may be wrong. Do correct me if that's the case!

Character Encoding Issue

I'm using an API that processes my files and presents optimized output, but some special characters are not preserved, for example:
Input: äöü
Output: äöü
How do I fix this? What encoding should I use?
Many thanks for your help!
It really depend what processing you are done to your data. But in general, one powerful technique is to convert it to UTF-8 by Iconv, for example, and pass it through ASCII-capable API or functions. In general, if those functions don't mess with data they don't understand as ASCII, then the UTF-8 is preserved -- that's a nice property of UTF-8.
I am not sure what language you're using, but things like this occur when there is a mismatch between the encoding of the content when entered and encoding of the content when read in.
So, you might want to specify exactly what encoding to read the data. You may have to play with the actual encoding you need to use
string.getBytes("UTF-8")
string.getBytes("UTF-16")
string.getBytes("UTF-16LE")
string.getBytes("UTF-16BE")
etc...
Also, do some research about the system where this data is coming from. For example, web services from ASP.NET deliver the content as UTF-16LE, but Java uses UTF-16BE encoding. When these two system talk to each other with extended characters, they might not understand each other exactly the same way.