Does Apache Tika do character set conversion? - unicode

I'm using org.apache.tika.Tika.parseToString() to convert documents into plain text (i.e., unformatted text) files. My application potentially needs to convert documents that don't use a Unicode character set. For instance, some documents may be encoded in the Chinese GB2312 character set. It would be great if Tika re-coded the output into UTF-8. This would require Tika to reference a mapping between many different character sets and Unicode in order to convert the characters.
Does Tika convert the non-Unicode character set text into Unicode as the output of parseToString()? There are a lot of character sets out there so I would be impressed if Tika did this for more than a few character sets.
Update: I was able to create a couple different files with some non-Latin charsets (GB2312 (Chinese) and KOI8-R (Russian)). Tika.parseToString() couldn't even detect the charset or encoding. I opened an issue on the Tika bug tracker here: https://issues.apache.org/jira/browse/TIKA-1262

When talking about Character Sets in Apache Tika, you need to consider two kinds of files differently. One kind is that of basically just plain text, the other are more complex types (including binary ones)
With the more complex files, Tika mostly uses third party libraries, and these libraries are responsible for returning Java Strings. The exact way of doing that will depend on the file format in question - sometimes the file format will including encoding information, other times it'll be fixed in what it supports. Either way, Tika gets Java Strings, and returns to you a Java String. How you choose to encode that for output is up to you. (For Windows users especially, check the encoding of your terminal, and the font used. There've been lots of "Tika Encoding Problems" which were actually people failing to correctly set the default Java encoding on output, or failing to have a Unicode capable terminal!)
With plain text files, there's no encoding information in the file, all we have is a bunch of bytes. Here, Apache Tika uses one of a number of EncodingDetector instances to do the detection. These use hints, n-grams, language detection etc, to try to work out the most likely encoding of the file based on information given, pattern of bytes in the file etc.
The definition of EncodingDetector is held in the Tika-Core jar, but most of the implentations are held in the Tika-Parsers jar (and loaded by the service loader method, just like Detectors and Parsers). The main ones are here in SVN. If you check there, you'll see the main list of encodings that Tika can detect.
One final thing - the encoding detection is only performed on files that are text files, it isn't done on the binary type files. Depending on how you call Tika, you might need to tweak that and/or provide a hint that it's a text file, so that the EncodingDetector logic gets triggered.

This answer actually comes from a JIRA user on the Tika project. https://issues.apache.org/jira/browse/TIKA-1262
It turns out that if you tell Tika that the file extension is '.txt' it will treat the file as plain text, attempt to detect the encoding, and convert it to UTF.
An easy way to do this is to pass an empty Metadata object to TikaInputStream.get(). This will fill out the resourceName field of the Metadata object. Then pass this object to parseToString(). With the resourceName field set to a file name that ends with .txt the parser knows to treat this file as plain text and will do a encoding detection to try to discover how to decode the file. The string returned from parseToString() is a Java UTF-16 String object. When written to a file you can see that it is Unicode and uses the UCS charset.
Tika tika = new Tika();
Metadata metadata = new Metadata();
TikaInputStream reader = TikaInputStream.get(new File(filepath), metadata);
String contents = tika.parseToString(reader, metadata);
So far this has worked for text files using either GB2312/GB18030 and KOI8-R. This is the expected behavior and it's perfect! I don't know what other charsets/encoding is can handle.

Related

What happens if you set your integration package to Unicode?

I'm importing data from flat-files (text files). I do not know which encoding they will use, it may be unicode, or it may be ASCII. What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues? I am using SSIS 2012.
What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues?
The encoding that Microsoft misleadingly call “Unicode” is actually UTF-16LE, an encoding based around two-byte code units.
UTF-16LE is not compatible with ASCII (or any of the locale-specific ANSI code pages) so if you read a file this is actually encoded in an ASCII superset you will get unreadable nonsense.
There's no magic ‘do the right thing’ option for reading characters from files, you have to know what encoding was used to create them. If you can see an encoded Byte Order Mark on the front of the data that usually allows you to make a good guess, but otherwise you're on your own.

Displaying Chinese characters on a form from an INI File

My plugin reads the control caption text from an INI file (ANSI as UTF-8 encoding) in order to display multiple languages. Key point being it is a plugin, I have no control nor ability to change this INI file format or file type.
They are currently being read into my plugin with TINIFile.ReadString and stored as a string. I can modify this (data type, read method, etc) as needed.
The main application reads from its own application language files that are UCS-2 Little Endian encoded as a TXT file. These display fine when the language is changed, even when the Windows OS is kept in English (in other words no OS locale changes need to be made for the application to switch display languages).
My plugin's form cannot display Asian characters (Chinese, Japanese, Korean, etc). English language is fine.
I have tried various fonts, using various combinations of AnsiString, String, etc. What am I missing to be able to display Asian characters on the form? I have not found a similar question to what I'm trying to do specifically with how my language text is being read into the plugin.
If the .INI file reader does not interpret the contents of the values, and allows all values through transparently, then you need to map the strings into one with the correct locale.
There is a similar question at Delphi 2010: how do I convert a UTF8-encoded PAnsiChar to a UnicodeString? that explains how to do the conversion. You may need to extract the contents into a RawByteString to avoid the implicit conversions.

What charset to use to store russian text into javascript files as an array

I am creating a coldfusion page, that takes language translation data stored in a table in my database, and makes static js files for each language pairing of english to ___ etc...
I am now starting to work on russian, I was able to get the other languages to work fine..
However, when it saves the file, all the text looks like question marks. Even when I run my translation app, the text for just that language looks like all ?????
I have tried writing it via cffile as utf-8 or ISO-8859-1 but neither seems to get it to display properly.
Any suggestions?
Have you tried ISO-8859-5? I believe it's the encoding that "should" be used for Russian.
By all means do use UTF-8 over any other encoding type. You need to make sure that:
your cfm templates were written to disk with UTF-8 encoding (notepad++ handles that nicely, and so does Eclipse or the new ColdFusion Builder)
your database was created with the proper codepage for nvarchar (and varchar) datatypes
your database connection handles UTF-8
How to go about the last two items depends on your database back-end. Coldfusion is quite agnostic in that regard, as it will happily use any jdbc driver that you may need.
When working in a multi-character set environment, character set conversion issues can occur and it can be difficult to determine where the conversion issue occurred.
There are two categories into which conversion issues can be placed. The first involves sending data in the wrong format to the client API. Although this cannot happen with Unicode APIs, it is possible with all other client APIs and results in garbage data.
The second category of issue involves a character that does not have an equivalent in the final character set, or in one of the intermediate character sets. In this case, a substitution character is used. This is called lossy conversion and can happen with any client API. You can avoid lossy conversions by configuring the database to use UTF-8 for the database character set.
The advantage of UTF-8 over any other encoding is that you can handle any number of languages in the same database / client.
I can't personally reproduce this problem at all. Is the ColdFusion template that is making the call itself UTF-8? (with or without a BOM it matters not for Russian). In any case UTF-8 is absolutely what you should be using. Make sure you get a UTF-8 compliant editor. Which is most things on Mac. On Windows you could use Scite or GVim.
The correct encoding to use in a .js file is whatever encoding the parent page is in. Whilst there are methods to serve JavaScript using a different encoding to the page including it, they don't work on all browsers.
So make sure your web page is being saved and served in an encoding that contains the Russian characters, and then save the .js file using the same encoding. That will be either:
ISO-8859-5. A single-byte encoding with Cyrillic in the high bytes, similar to Windows code page 1251. cp1251 will be the default encoding when you save in a text editor from a Russian install of Windows;
or UTF-8. A multi-byte encoding that contains every character. All modern websites should be using UTF-8.
(ISO-8859-1 is Western European and does not include any Cyrillic. It is similar to code page 1252, the default on a Western Windows install. It's of no use to you.)
So, best is to save both the cf template and the js file as UTF-8, and add <cfprocessingdirective pageencoding="utf-8"> if CF doesn't pick it up automatically.
If you can't control the encoding of the page that includes the script (for example because it's a third party), then you can't use any non-ASCII characters directly. You would have to use JavaScript string literal escapes instead:
var translation_ru= {
launchMyCalendar: '\u0417\u0430\u043f\u0443\u0441\u043a \u041c\u043e\u0439 \u043a\u0430\u043b\u0435\u043d\u0434\u0430\u0440\u044c'
};
when it saves to file it is "·ÐßãáÚ ¼ÞÙ ÚÐÛÕÝÔÐàì" so the charset is wrong
Looks like you've saved as cp1251 (ie. default codepage on a Russian machine) and then copied the file to a Western server where the default codepage is cp1252.
I also just found out that my text editor of choice, textpad, doesn't support unicode.
Yes, that was my reason for no longer using it too. EmEditor (commercial) and Notepad++ (open-source) are good replacements.

What encoding does InstallShield expect non-latin-alphabet string table entries to use?

I work on an app that gets distributed via a single installer containing multiple localizations. The build process includes a script that updates the .ism string table with translations for each supported language.
This works fine for languages like French and German. But when testing the installer in, i.e. Japanese, the text shows up as a series of squares. It's unlikely to be a font problem, since the InstallShield-supplied strings show up fine; only the string table entries are mangled. So the problem seems to be that the strings are in the wrong encoding.
The .ism is in XML format, with UTF-8 declared as its encoding, so I assumed the strings needed to be UTF-8 encoded as well. Do they actually need to use the encoding of the target platform? Is there any concern, then, about targets having different encodings, i.e. Chinese systems using one GB-encoding versus another? What is the right thing to do here?
Edit: Using InstallShield 2009, since there is apparently a difference between that and 2010.
In InstallShield 2009 and earlier, the encoding is a base-64 encoding of the binary string in the ANSI encoding specific to the language in question (e.g. CP932 for Japanese). In InstallShield 2010 and later, it will still accept that or use UTF-8, depending on other columns in that table.
Thanks (up-voted his answer) go to Michael Urman, for pointing us in the right direction. But this is the actual working (with InstallShield 2009) algorithm, reverse-engineered by a co-worker:
Start with a unicode (multi-byte-character) string
Write out the length as the encoded-length field in the ism-file
Encode the string as UTF-16-little-endian
Base-64 using the uuencode dictionary, except with ` (back-tick) instead of spaces.
Write the result to the ism-file, escaping XML entities
Be aware that base-64ing using the uuencode dictionary is not the same as using the uuencode algorithm. Standard uuencode produces a set of newline-separated lines, including a header, footers and one or more data lines, each of which begins with a length-character. If you're implementing this using a uuencode codec, you'll need to strip all of that off.
I'm also trying to figure this out...
I've inhereted some Installshield 12 (which is pre-2009) projects with string table entries containing characters outside the range of base64 'target' characters.
For example, one of the Japanese strings is:
4P!H&$9!O'<4!R&\=!E&,=``#$(80!C&L=0!P"00!G`&4`;#!T`)(PI##S,+DPR##\,.LP5S!^,%DP`C
After much searching I happened upon Base85 encoding, which looks much closer to being plausible, but have not yet verified this to be the solution.

Is there a standard encoding for NEEDED entries in ELF?

I'm trying to make some of my code a bit more friendly to non-pure-ascii systems and was wondering if there was a particular character encoding used for NEEDED entries in ELF binaries, or is it rather unstandard and based on the creating system's filesystem encoding (or even just directly the bytes that were passed to whatever created the binary) (if so is there any place in the binary that specifies the encoding? assuming the current systems encoding wouldn't work very well for my usage I think), are non-ascii names pretty much banned or something else?
ELF format specifies NEEDED fields as "null-terminated string" and does not say more about the encoding, which pretty much implies 8-bit ASCII string.
I personally don't see any point in complicating executable file format specification that does not provide any additional value for the final product or development process: the user won't see library names, so they wouldn't care about localization of thereof. You may try to use UTF-8, but actual file system encoding is not guaranteed to be UTF-8. To be sure you need to know how your target linker handles those strings.
As far as I know, the standard Unix way of dealing with non-ASCII characters is to encode them as UTF-8.