Unicode Encoding and decoding issues in QRCode - unicode

I am trying to generate UTF-8 QRCode so that I can encore accents and Unicode characters.
To test it, I am using many decoding solution :
http://zxing.org/w/decode.jspx - The zxing project also used in Android
http://www.drhu.org/QRCode/QRDecoder.php - a PHP Decoder
http://zbar.sf.net - The ZBar bar code reader - OpenSource and C project for embedded
All of them give me always the same result.
You can try this image works well with Unicode Characters.
But if I am trying to use zxing or Google Chart API to generate the QRCode, I cannot decode it correctly.
I have tried this :
http://chart.apis.google.com/chart?cht=qr&chs=200x200&choe=SHIFT_JIS&chl=R%C3%A9my+Hubscher
http://chart.apis.google.com/chart?cht=qr&chs=200x200&choe=ISO-8859-1&chl=R%C3%A9my+Hubscher
http://chart.apis.google.com/chart?cht=qr&chs=200x200&choe=UTF-8&chl=R%C3%A9my+Hubscher
But all without success.
Do you know how I can do ? Do you know which encoding is used for the working image ?

The solution that comes up, is to encode the text in UTF-8 and add a BOM to specify that the string is actually in UTF-8.
Here it works :
http://chart.apis.google.com/chart?cht=qr&chs=200x200&choe=UTF-8&chl=%EF%BB%BFR%C3%A9my+Hubscher

Heuristics used by QR decoders often fails, BOM does not help
Most QR decoders use heuristics to automatically detect character encoding even if it is specified explicitly inside the QR code via the ECI extension.
It turned out that BOM helped to your decoder. But for most decoders, BOM does not help. As an example of a decoder that cannot display a proper UTF-8 string, take a Xiaomi phone with MIUI Global v11.0.3 (with their native scanner application). This phone cannot correctly show an UTF-8 QR code produced a link in your original question. Here is how it showed: R閙y Hubscher. With the BOM (using a link from your subsequent message) it showed this way: ?R閙y Hubscher (it just showed the BOM character as ?). But if you add a Chinese character like 日 before the string instead of BOM, Xiaomi will show the string correctly. Here is the link: chart.apis.google.com/chart?cht=qr&chs=200x200&choe=UTF-8&chl=%E6%97%A5R%C3%A9my%20Hubscher
Xiaomi correctly displays the string 日Rémy Hubscher from a QR code generated by this link.
Another example is “QR code reader & QR code Scanner” Android app by TWMobile. It did properly decode all the QR codes from all the links that you have provided. So you did not have to use BOM to make the scanner by TWMobile properly display the strings.
Why do QR decoders always use heuristics to detect character set even though these heuristics frequently fails as shown in your case? As you know, there are 4 modes of storing text in a QR code: (1) numeric, (2) alphanumeric, (3) 8-bit, and (4) Kanji. So, QR code standard does not inherently support UTF-8. To use UTF-8 encoding (instead of the default “ISO-8859-1” or “JIS8”) in the 8-bit string, the implementation has to insert an ECI (Extended Channel Interpretations) before that string. ECI is an optional, additional feature for a QR Code. Good point is that it was defined in earliest QR code standard at least in 2000. ECI enables data encoding using character sets other than the default. It also enables other data interpretations (e.g. compacted data using defined compression schemes) or other industry-specific requirements to be encoded. The ECI protocol is defined in a specification developed by AIM, Inc, and is not available for free but can be purchased for a fee. Unfortunately, not all QR decoders can handle the ECI protocol, even in such a basic thing as changing default encoding to UTF-8. And even for default encoding like “ISO-8859-1” (for a 8-bit string mode) or “Shift_JIS”(for Kanji mode), decoders still use heuristics to determine character set, because some applications that encode QR codes may not support ECI or specify incorrect character set.
Conclusion
Because of heuristics to automatically detect character set, QR decoders often fail do display the string properly, even when correct encoding is explicitly specified via ECI as it was in your case and the BOM character did not help as shown in the Xiaomi example. You have found a solution in your reply, but it did not help for Xiaomi. Some QR decoders use heuristics algorithms that are so dumb that even BOM does not help.
Although the BOM did help with your QR decoder, a better solution would be to stop using error-prone QR decoders that use heuristics even if the character encoding is explicitly specified via ECI.
Find a better QR decoder if a decoder cannot properly decode the text without BOM. The encoder that you have provided (using the links) is OK.

Related

It's not need encoding/decoding to communicate system using other UNICODE encoding?

I just wonder.
When two system use other Unicode encoding (one-UTF-8, other-UTF32), it doesn't need to encode or decode?
I think it's necessary. But, we call just UNICODE in visual studio.
Then, If it's necessary (other application should encode or decode), it is not standard. Isn't it?

What happens if you set your integration package to Unicode?

I'm importing data from flat-files (text files). I do not know which encoding they will use, it may be unicode, or it may be ASCII. What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues? I am using SSIS 2012.
What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues?
The encoding that Microsoft misleadingly call “Unicode” is actually UTF-16LE, an encoding based around two-byte code units.
UTF-16LE is not compatible with ASCII (or any of the locale-specific ANSI code pages) so if you read a file this is actually encoded in an ASCII superset you will get unreadable nonsense.
There's no magic ‘do the right thing’ option for reading characters from files, you have to know what encoding was used to create them. If you can see an encoded Byte Order Mark on the front of the data that usually allows you to make a good guess, but otherwise you're on your own.

Is it possible to represent characters beyond ASCII in DataMatrix 2D barcode? (unicode?)

The DataMatrix article on Wikipedia mentions that it supports only ASCII by default. It also mentions a special mode for Base256 encoding, which should be able to represent arbitrary byte values.
However all the barcode generator libraries that I tried so far support data to be entered as string and show errors for characters beyond ASCII (Onbarcode and Barcodelib). There is also no way how to enter byte[] which would be required for Base256 mode.
Is there a barcode generator library that supports Base256 mode? (preferably commercial library with support)
Converting the unicode string into Base64 and decoding from base64 after the data is scanned would be one approach, but is there anything else?
it is possible, although, it has some pitfalls:
1) it depends on which language you're writing your app (there are different bindings fo different DM-libraries across programming languages.
For example, there is pretty common library in *nix-related environment (almost all barcode scanners/generators on Maemo/MeeGo/Tizen, some WinPhone apps, KDE thingies, and so on, using it) called [libdmtx][1]. As far, as I tested, encodes and decodes messages contatining unicode pretty fine, but it doesn't properly mark encoded message ("Hey, other readers, it is unicode here!"), so, other libraries, such as [ZXing][2], as many proprietary scanners, decodes that unicode messages as ASCII.
As far, as I dicussed with [ZXing][2] author, proper mark would probably be an ECI segment (0d241 byte as first codeword, followed by "0d26" byte (for UTF-8)). Although, that is theoretical solution, based on such one for QR-codes and not standardized in any way for DataMatrix (and neither [libdmtx][1] nor [ZXing][2], do not yet support encoding with such markings, althought, there is some steps in that way.
So, TL;DR: If you plan to use that generated codes (with unicode messages) only between apps, that you're writing — you can freely use [libdmtx][1] for both encoding and decoding on both sides and it will work fine :) If not — try to look for [zxing][2] ports on your language (and make sure that port supports encoding).
1: github.com/dmtx/libdmtx
2: github.com/zxing/zxing

In Corona SDK how to reverse a unicode string?

I knew that Lua does not fully support unicode however there should be a workaround to solve this problem?
string.reverse will not work with unicode so the following example will not work
print(string.reverse("أحمد"))
any help on that?
Corona SDK seems to be using UTF-8 as encoding.
If you want to reverse all Unicode code points in a string, instead of all bytes, you can use that code:
function utf8reverse(str)
return str:gsub("([\194-\244][\128-\191]+)", string.reverse):reverse()
end
print(utf8reverse("أحمد"))
The trick is as follows: a multibyte Unicode code point always start with a byte 11xx xxxx, followed by one or several bytes 10xx xxxx. The first step is to reverse all bytes on each multibyte code point, and then reverse all bytes.
Note: when a Unicode character is composed of several code points, that simple trick will not work. A full support would require a big Unicode database to deal with.

How to send text in UTF-8 using Indy TIdTCPServer in c++ builder

My client j2me application reading text input stream using UTF-8
reader = new InputStreamReader(in,"UTF-8");
and my server when gets connected sends text using this statement
AContext->Connection->IOHandler->WriteLn(cxMemo1->Text,TEncoding::UTF8);
but result text showing weird characters like ?????????????????????????? ?????????????
Where I'm doing wrong?
also when i tried to load from utf-8 encoding data file in such a way
AContext->Connection->IOHandler->WriteFile("c:\\fids.xml");
it's all the same!
Indy 10 completely supports UTF-8 encoding. I've myself worked with it's TIdFTP component & successfully uploaded Unicode text files. From what I can make of it:
Your connection/transfer type is set to ftASCII rather than ftBinary.
Your J2ME applet/Host platform does not suport UTF-8
'?' characters occur when data is going through a Unicode-to-Ansi conversion to an Ansi charset that does not support the Unicode characters being converted.
What version of C++Builder are you using? In versions prior to CB2009, you should tell Indy the encoding of the AnsiString data that you are passing in. Indy defaults to ASCII (ie: TIdTextEncoding::ASCII) for most String-based operation. That can be overridden when needed, either with optional AAnsiEncoding parameters, the TIdIOHandler::DefAnsiEncoding property, or the global Idglobal::GIdDefaultAnsiEncoding setting. If you do not specify the correct encoding, the AnsiString data may not be converted to Unicode correctly before then being converted to UTF-8. For example:
AContext->Connection->IOHandler->WriteLn(cxMemo1->Text, TIdTextEncoding_UTF8, TTIdTextEncoding_Default);
Or:
AContext->Connection->IOHandler->DefAnsiEncoding = TIdTextEncoding_Default;
AContext->Connection->IOHandler->WriteLn(cxMemo1->Text, TIdTextEncoding_UTF8);
You can optionally also use the TIdIOHandler::DefStringEncoding property if you do not want to specify the UTF-8 encoding on every call:
AContext->Connection->IOHandler->DefStringEncoding = TIdTextEncoding_UTF8;
AContext->Connection->IOHandler->WriteLn(cxMemo1->Text);
Now, with that said, the fact that WriteFile() is also sending data that J2ME is not handling correctly tells me that Indy is not the root of the issue. WriteFile() simply dups the raw file data as-is to the connection without any interpretation at all. If you send a UTF-8 encoded file, then UTF-8 encoded octets will be sent to J2ME.
I suggest you use a packet sniffer, such as Wireshark, to verify the data that Indy is sending. That will tell you for sure whether Indy is really at fault or not.
*PS: notice in the examples above that I use Indy's TIdTextEncoding macros instead of TEncoding directly. This is because Indy's TIdTextEncoding logic works around some bugs in Embarcadero's TEncoding classes. Also, we're going to phase out direct support for TEncoding in Indy 11 and expand on TIdTextEncoding so Indy has more control than Embarcadero offers.