In Corona SDK how to reverse a unicode string? - unicode

I knew that Lua does not fully support unicode however there should be a workaround to solve this problem?
string.reverse will not work with unicode so the following example will not work
print(string.reverse("أحمد"))
any help on that?

Corona SDK seems to be using UTF-8 as encoding.
If you want to reverse all Unicode code points in a string, instead of all bytes, you can use that code:
function utf8reverse(str)
return str:gsub("([\194-\244][\128-\191]+)", string.reverse):reverse()
end
print(utf8reverse("أحمد"))
The trick is as follows: a multibyte Unicode code point always start with a byte 11xx xxxx, followed by one or several bytes 10xx xxxx. The first step is to reverse all bytes on each multibyte code point, and then reverse all bytes.
Note: when a Unicode character is composed of several code points, that simple trick will not work. A full support would require a big Unicode database to deal with.

Related

Is it possible to represent characters beyond ASCII in DataMatrix 2D barcode? (unicode?)

The DataMatrix article on Wikipedia mentions that it supports only ASCII by default. It also mentions a special mode for Base256 encoding, which should be able to represent arbitrary byte values.
However all the barcode generator libraries that I tried so far support data to be entered as string and show errors for characters beyond ASCII (Onbarcode and Barcodelib). There is also no way how to enter byte[] which would be required for Base256 mode.
Is there a barcode generator library that supports Base256 mode? (preferably commercial library with support)
Converting the unicode string into Base64 and decoding from base64 after the data is scanned would be one approach, but is there anything else?
it is possible, although, it has some pitfalls:
1) it depends on which language you're writing your app (there are different bindings fo different DM-libraries across programming languages.
For example, there is pretty common library in *nix-related environment (almost all barcode scanners/generators on Maemo/MeeGo/Tizen, some WinPhone apps, KDE thingies, and so on, using it) called [libdmtx][1]. As far, as I tested, encodes and decodes messages contatining unicode pretty fine, but it doesn't properly mark encoded message ("Hey, other readers, it is unicode here!"), so, other libraries, such as [ZXing][2], as many proprietary scanners, decodes that unicode messages as ASCII.
As far, as I dicussed with [ZXing][2] author, proper mark would probably be an ECI segment (0d241 byte as first codeword, followed by "0d26" byte (for UTF-8)). Although, that is theoretical solution, based on such one for QR-codes and not standardized in any way for DataMatrix (and neither [libdmtx][1] nor [ZXing][2], do not yet support encoding with such markings, althought, there is some steps in that way.
So, TL;DR: If you plan to use that generated codes (with unicode messages) only between apps, that you're writing — you can freely use [libdmtx][1] for both encoding and decoding on both sides and it will work fine :) If not — try to look for [zxing][2] ports on your language (and make sure that port supports encoding).
1: github.com/dmtx/libdmtx
2: github.com/zxing/zxing

Get code point of character in Lua?

I've done it before, but I'm not certain how and I have since lost the source files.
How do I get the code point of a character in Lua? Or, at least, a unique value for a character?
In Lua 5.3, you can get the code point of a UTF-8 string with utf8.codepoint.
print(utf8.codepoint("瑞"))
--29790
For ASCII strings it's easy:
local char_code = string.byte("A",1);
-- char_code now contains 65
For UTF-8 (assuming that's how you're representing data), it gets tricky. Either use a 3rd party library like slnunicode, or you'll have to write your own function to pasre the UTF-8 bytes.
Your Lua install may already contain the ValidateUnicodeString extension, which allows this to work:
local char_code = string.utf8code("ٱ");
-- char_code now contains 1649
(That example contains an Arabic Alef Wasla, which may not display correctly in your local font)
There are several answers that may give you what you want (if you limit yourself to UTF8):
Splitting a multibyte string
Iterating over UTF8 code points
Reversing a UTF8 string

Japanese mojibake detection

I want to know if there is a way to detect mojibake (Invalid) characters by their byte range. (For a simple example, detecting valid ascii characters is just to see if their byte values are less 128) Given the old customized characters sets, such as JIS, EUC and of course, UNICODE, is there a way to do this?
The immediate interest is in a C# project, but I'd like to find a language/platform independent solution as much as possible, so I could use in c++, Java, PHP or whatever.
Arrigato
detecting 文字化け(mojibake) by byte range is very difficult.
As you know, most Japanese characters consist of multi-bytes. In Shift-JIS (one of most popular encodings in Japan) case, the first-byte range of a Japanese character is 0x81 to 0x9f and 0xe0 to 0xef, and the second-byte has other range. In addition, ASCII characters may be inserted into Shift-JIS text. it's difficult.
In Java, you can detect invalid characters with java.nio.charset.CharsetDecoder.
What you're trying to do here is character encoding auto-detection, as performed by Web browsers. So you could use an existing character encoding detection library, like the universalchardet library in Mozilla; it should be straightforward to port it to the platform of your choice.
For example, using Mark Pilgrim's Python 3 port of the universalchardet library:
>>> chardet.detect(bytes.fromhex('83828357836f8350'))
{'confidence': 0.99, 'encoding': 'SHIFT_JIS'}
>>> chardet.detect(bytes.fromhex('e383a2e382b8e38390e382b1'))
{'confidence': 0.938125, 'encoding': 'utf-8'}
But it's not 100% reliable!
>>> chardet.detect(bytes.fromhex('916d6f6a6962616b6592'))
{'confidence': 0.6031748712523237, 'encoding': 'ISO-8859-2'}
(Exercise for the reader: what encoding was this really?)
This is not a direct answer to the question, but I've had luck using the ftfy Python package to automatically detect/fix mojibake:
https://github.com/LuminosoInsight/python-ftfy
https://pypi.org/project/ftfy/
https://ftfy.readthedocs.io/en/latest/
>>> import ftfy
>>> print(ftfy.fix_encoding("(ง'⌣')ง"))
(ง'⌣')ง
It works surprisingly well for my purposes.
I don't have time and / or priority level to follow up on this for the moment, but I think, if knowing the source is Unicode, using these charts and following on some of the work done here, I think some headway can be made into the issue. Likewise, for Shift-JIS, using this chart can be helpful.

Scintilla Supports Unicode? What about SCI_GETCHARAT?

Does Scintilla really support Unicode? If so, why does SCI_GETCHARAT return a char value (casted to LRESULT)?
From the SCI_SETCODEPAGE docs...
Code page SC_CP_UTF8 (65001) sets Scintilla into Unicode mode with the document treated as a sequence of characters expressed in UTF-8. The text is converted to the platform's normal Unicode encoding before being drawn by the OS and thus can display Hebrew, Arabic, Cyrillic, and Han characters.
You will have to examine the byte you retrieve with SCI_GETCHARAT(pos) and, depending on the top bits of that, maybe read SCI_GETCHARAT(pos+1) and beyond in order to get the Unicode code point. (See here.)
Edit:
For some C++ code that does this, see below (search for SciMoz::GetWCharAt):
http://vacuproj.googlecode.com/svn/trunk/npscimoz/npscimoz/oldsrc/trunk.nsSciMoz.cxx
I was long time ago but if I remember well Scintilla is not a native Unicode application. Still it has some Unicode support.
First, the function name should SCI_GETBYTEAT, because it returns a byte from UTF-8 internal buffer.
Also, the application has Unicode support for keybaord, so it has some Unicode support :)

Can I get a single canonical UTF-8 string from a Unicode string?

I have a twelve-year-old Windows program. As may be obvious to the knowledgeable, it was designed for ASCII characters, not Unicode. Most of it has been converted, but there's one spot that still needs to be changed over. There is a serious constraint on it though: the exact same ASCII byte sequence MUST be created by different encoders, some of which will be operating on non-Windows systems.
I'm trying to determine whether UTF-8 will do the trick or not. I've heard in passing that different UTF-8 sequences can come up with the same Unicode string, which would be a problem here.
So the question is: given a Unicode string, can I expect a single canonical UTF-8 sequence to be generated by any standards-conforming implementation of a converter? Or are there multiple possibilities?
Any given Unicode string will have only one representation in UTF-8.
I think the confusion here is that there are multiple ways in Unicode to get the same visual output for some languages. Not to mention that Unicode has several characters that have no visual representation.
But this has nothing to do with UTF-8, its a property of Unicode itself. The encoding of a given Unicode as UTF-8 is a purely mechanical process, and it's perfectly reversible.
The conversion rules are here:
http://en.wikipedia.org/wiki/UTF-8
As John already said, there is only one standards-conforming UTF-8 representation.
But the tricky point is "standards-conforming".
Older encoders are usually unable to properly convert UTF-16 because of surrogates.
Java is one notable case of those non-conforming converters (it will produce two 3-bytes sequences instead of one 4-byte sequence).
MySQL had problems until recently, and I am not sure about the current status.
Now, you will only have problems with code points that need surrogates, meaning above U+FFFF. If you application survived without Unicode for a long time, it means you never needed to move such "esoteric" characters :-)
But it is good to get things right from the get go.
Try using standards-conforming encoders and you will be fine.