Trouble parsing € with Barcode4J MessagePatternUtil - unicode

I'm trying to insert a € in the Message and/or Pattern for a barcode using
org.krysalis.barcode4j.tools.MessagePatternUtil
in Barcode4J, but am just getting hieroglyphics.Does anyone know how to do this?

Well, it seems MessagePatternUtil works on a byte-by-byte basis, so it can't handle multiple-byte characters like the € (Euro).
I've posted an improved version that can handle Unicode here:
https://github.com/DanskerDave/barcode4j/blob/master/src/main/java/org/krysalis/barcode4j/tools/MessagePatternUtil.java

Related

How to interpret emoji in Web API

I am trying to intercept and replace emoji with a corresponding text. I left the default encoding on the Web API (UTF-8 / UTF-16 respectively).
How can I convert an emoji like 😉 to something like U+1F609?
Here is something that helped me out, although it is in Perl. But you can encode and decode. This should be what you're looking for: https://metacpan.org/pod/Encode::JP::Emoji
This is quite an old post and even though I'm not on the project anymore I want to still answer with my findings for future reference if someone else has the same problem.
What I ended up doing is to create a dictionary with key the UTF combination of the emoji and as a value the text. One thing as an advice: I made sure the longest UTF combination, some emoji have 4 or even 5, as the first ones as otherwise some emoji never got reached. Totally not a perfect and future proof solution that I was hoping for but it worked for us and it shipped into production where it has been working since 2017.

Lectora Chinese characters that are manipulated using Variables are not coming in UTF-8

In lectora Inspire V11, while doing chinese course all Traditional Chinese characters that are directly assigned are coming in UTF 8 properly. Only the variables that are manipulated are not coming in UTF-8 there by resulting in Junk characters. Does anybody have any work around?
You are probably better off asking for help at Lectora's community website, https://community.trivantis.com/connect/

In Corona SDK how to reverse a unicode string?

I knew that Lua does not fully support unicode however there should be a workaround to solve this problem?
string.reverse will not work with unicode so the following example will not work
print(string.reverse("أحمد"))
any help on that?
Corona SDK seems to be using UTF-8 as encoding.
If you want to reverse all Unicode code points in a string, instead of all bytes, you can use that code:
function utf8reverse(str)
return str:gsub("([\194-\244][\128-\191]+)", string.reverse):reverse()
end
print(utf8reverse("أحمد"))
The trick is as follows: a multibyte Unicode code point always start with a byte 11xx xxxx, followed by one or several bytes 10xx xxxx. The first step is to reverse all bytes on each multibyte code point, and then reverse all bytes.
Note: when a Unicode character is composed of several code points, that simple trick will not work. A full support would require a big Unicode database to deal with.

Japanese mojibake detection

I want to know if there is a way to detect mojibake (Invalid) characters by their byte range. (For a simple example, detecting valid ascii characters is just to see if their byte values are less 128) Given the old customized characters sets, such as JIS, EUC and of course, UNICODE, is there a way to do this?
The immediate interest is in a C# project, but I'd like to find a language/platform independent solution as much as possible, so I could use in c++, Java, PHP or whatever.
Arrigato
detecting 文字化け(mojibake) by byte range is very difficult.
As you know, most Japanese characters consist of multi-bytes. In Shift-JIS (one of most popular encodings in Japan) case, the first-byte range of a Japanese character is 0x81 to 0x9f and 0xe0 to 0xef, and the second-byte has other range. In addition, ASCII characters may be inserted into Shift-JIS text. it's difficult.
In Java, you can detect invalid characters with java.nio.charset.CharsetDecoder.
What you're trying to do here is character encoding auto-detection, as performed by Web browsers. So you could use an existing character encoding detection library, like the universalchardet library in Mozilla; it should be straightforward to port it to the platform of your choice.
For example, using Mark Pilgrim's Python 3 port of the universalchardet library:
>>> chardet.detect(bytes.fromhex('83828357836f8350'))
{'confidence': 0.99, 'encoding': 'SHIFT_JIS'}
>>> chardet.detect(bytes.fromhex('e383a2e382b8e38390e382b1'))
{'confidence': 0.938125, 'encoding': 'utf-8'}
But it's not 100% reliable!
>>> chardet.detect(bytes.fromhex('916d6f6a6962616b6592'))
{'confidence': 0.6031748712523237, 'encoding': 'ISO-8859-2'}
(Exercise for the reader: what encoding was this really?)
This is not a direct answer to the question, but I've had luck using the ftfy Python package to automatically detect/fix mojibake:
https://github.com/LuminosoInsight/python-ftfy
https://pypi.org/project/ftfy/
https://ftfy.readthedocs.io/en/latest/
>>> import ftfy
>>> print(ftfy.fix_encoding("(ง'⌣')ง"))
(ง'⌣')ง
It works surprisingly well for my purposes.
I don't have time and / or priority level to follow up on this for the moment, but I think, if knowing the source is Unicode, using these charts and following on some of the work done here, I think some headway can be made into the issue. Likewise, for Shift-JIS, using this chart can be helpful.

Detect presence of a specific charset

I need a way to detect whether a file contains characters from a certain charset.
Specifically, I want to detect the presence of UTF8-encoded cyrillic characters in a series of files. Is there a tool to do this?
Thanks
If you are looking for ready solution, you might want to try Enca.
However, if you only want to detect presence of what can be possibly decoded as UTF-8 Cyrillic characters (without any complete UTF-8 validity checks), you just have to grep for something like /(\xD0[\x81\x90-\xBF]|\xD1[\x80-\x8F\x91]){n,}/ (this exact regexp is for n subsequent UTF8-encoded Russian Cyrillic characters). For additional check that the whole file contains only valid UTF-8 data you can use something like isutf8(1).
Both methods have their good and bad sides and may sometimes give wrong results.
IIRC the ICU library has code that does character set detection. Though it's basically a best effort guess.
Edit: I did remember correctly, check out this paper / tutorial